It is probably just coincidence, but two articles for which I gave interviews appeared online today. The first, which I discussed in an earlier post, was online in Wired, One Scientist’s Junk Is a Creationist’s Treasure by Catherine Shaffer. The second appeared in the online edition of Scientific American, The 1 Percent Genome Solution by JR Minkel. Both deal with non-coding DNA, though from rather different perspectives. The first is about creationists invoking the discovery (by evolutionary biologists and other scientists) of (indirect indication of) function in (small sections of) non-coding DNA. The second is about the search for those functions through detailed, rigorous scientific analysis.
I know that science writers have a tough job. And I know that we scientists grumble about a lot of what they generate. But this time I want to do something a little different. I want to give readers some idea of what science writers are faced with when they interview a scientist. This is possible because the interview for Scientific American was conducted by email rather than by phone (which I actually prefer). Have a look at the article, and then see how the interview actually proceeded, and think about the challenge of summarizing my answers, which were admittedly somewhat long-winded (some might say carefully worded so as to avoid confusion and to not overlook important points). Note also the kinds of questions that a writer has to develop.
Here are the pertinent sections from the article:
The consortium found that 5 percent of the studied sequence has been conserved among 23 mammals, suggesting that it plays an important enough role for evolution to preserve while species have evolved. But of all the new ENCODE sequences identified as potentially important, only half fall into the conserved group.
These unconserved sequences may be “bystanders, Birney says”—consequences of the genome’s other functions—that neither help nor hurt cells and may have provided fodder for past evolution.
They could also simply maintain a useful DNA structure or spacing between pieces of DNA regardless of their particular sequence, says genomics researcher T. Ryan Gregory of the University of Guelph in Ontario, who was not part of the consortium.
“The biological insights are mainly incremental at this point,” says genome biologist George Weinstock of the Baylor College of Medicine in Houston, which he says is to be expected of such a pilot study. “This is a ‘community resource’ project, like a genome project, that makes lots of new data available to the community, who then dig into it and mine it for discoveries.”
Gregory says the results, although still cryptic, do hint at new functions and a more complicated genome. “This study shows us how far we are from a comprehensive understanding of the human genome.”
And here are my answers to Minkel’s questions reproduced in full:
How much of what the consortium found is new?
- What is new about this study is the fine focus being applied to the search for functional elements. By way of analogy, this study is like a group of 35 treasure hunters with metal detectors and sifters combing the same 35m of a 3.5km long beach. (In fact, the 35m are broken up into 44 discrete stretches of beach, half of them chosen because they are known to contain lots of interesting objects and the other half selected to include areas with varying properties. The plan is eventually to comb the entire beach this way, but this first pass should be taken more as a proof-of-principle than a conclusive assessment).
- Some of the conclusions reinforce ideas that have already been in the literature for several years, for example that the majority of the human genome is transcribed (see, e.g., Wong et al. 2000; Wong et al. 2001). The identification of non-protein-coding transcripts, particularly in areas where this was not thought to occur, is novel. But, again, this particular study is based on only 1% of the genome and one should exercise caution in extrapolating it to the entire human genome.
- Other ideas, such that chromatin structure is important in regulation, are also not entirely new, but these data provide interesting new evidence for them.
How much of what was identified is likely to be functional?
- 5% of the genome sequence is conserved across mammals, and for about 60% of this (i.e., 3% of the genome) there is additional evidence of function. This includes the protein-coding exons as well as regulatory elements and other functional sequences. So, at this stage, we have increasingly convincing evidence of function for about 3% of the genome, with another 2% likely to fall into this category as it becomes more thoroughly characterized.
- The authors report the presence of sequences that are not conserved but show experimental (in the genomics sense) evidence of function. There need not be constraint on base pair sequences if merely the presence of non-coding DNA would fill the role independent of what that DNA is. For example, if it is simply a matter of physical spacing or structural arrangement, then it may not matter what the actual sequence of bases were. On the other hand, the authors argue that these elements “may serve as a ‘warehouse’ for natural selection, potentially acting as the source of lineage-specific elements and functionally conserved but non-orthologous elements between species”. Of course, this would be an effect, not a function, because natural selection does not have foresight and cannot maintain elements because they may someday be useful. Also, they suggest that these regions are “neutral”, meaning that they are “biochemically active” but “do not confer a selective advantage or disadvantage to the organism”. If they have no fitness effects then they cannot have a function in the usual sense of the term; however, it could be that their absence would be detrimental, in which case there would be convincing evidence of function of some sort.
- A large fraction of the sequences analyzed, both in introns and intergenic regions, appears to be transcribed. However, most of this DNA is not conserved and there is no clear indication of function. It could be that the transcripts themselves play a functional role or that the process of transcription but not the transcripts per se contributes an important effect. It could be that the regions they examined, which were typically gene-dense, included transcribed introns (no surprise) plus longer-than-expected regulatory regions such as promoters near but outside of genes (e.g., Cooper et al. 2007), but that on the whole the long stretches of non-coding DNA in between genes are not actually transcribed. Or, it could be that transcription in the human genome simply is very inefficient. For example, the data in this study suggest that 19% of pseudogenes in their sample are transcribed, even though by definition they cannot encode a protein and are unlikely to play a regulatory role. It also appears that in other groups, e.g., plants (Wong et al. 2000), there is lots of intergenic DNA that is not transcribed, which may indicate that this is a process unique to mammals and is not typical of eukaryotic genomes.
- Looking at a broader scale, we must bear in mind that about half the human genome consists of transposable elements. Some of these clearly do have functions (e.g., in gene regulation), but others persist as disease-causing mutagens. It could be that a large portion of these have taken on functions, but this remains to be shown. We are also left with the question of why a pufferfish would require only 10% as much non-coding DNA as a human whereas an average salamander needs 10 times more than we do. The well known patterns of genome size diversity make it difficult to explain the presence of all non-coding DNA in functional terms, even as there is growing evidence that a significant portion of non-coding DNA is indeed functionally important.
What does this tell us about the genome’s organization and evolution?
- This work follows the growing trend in which simplistic assumptions about genome form and function are being overturned. Previous examples include the assumption that each gene encodes one protein product and the associated expectation that there would be a relatively large number of genes in our genome. This study deals a blow to the notion that the human genome is organized and regulated in a simple way, and further suggests that our definition of “gene” may need to be expanded.
- This study shows us how far we are from a comprehensive understanding of the human genome, but it also provides some of the tools that will be needed to achieve this goal.
- The authors begin their paper with a conclusion (p. 799): “The human genome is an elegant but cryptic store of information.” Elegant, in the scientific sense, means “concise, simple, succinct”. This does not strike me as an accurate descriptor for such a complex, redundant evolutionary patchwork.
- This study reinforces the notion that the genome is a legitimate level of biological organization with its own complex evolutionary history.
You advise caution in extrapolating the results. Do you think it more likely that the study over- or under-represents the amount of complexity or underappreciated function in the genome? Why, or what other biases would you expect?
The concern is that it may over-estimate the level of function in the genome, given that they specifically selected regions rich in well-characterized genes for at least half the dataset. Of course, the objective of the study is to identify functional elements, so an aggressive approach to the question is warranted in that context. However, they probably considered few sequences that were not associated with genes in some way, such as long stretches of short repeats or transposable elements. The study does suggest that regulation is more complex than we thought, shows some evidence of function for some noncoding DNA, and indicates that lots of noncoding DNA is transcribed, but beyond that it hasn’t really clarified these issues — nor should it be expected to, as this was a pilot project only.
You say the study “deals a blow to the notion that the human genome is organized and regulated in a simple way, and further suggests that our definition of “gene” may need to be expanded.” Do most biologists believe the genome is simply organized and regulated? What’s the dominant view?
I would say that, for obvious pragmatic reasons, people assume that a system is simple until it is shown to be otherwise. At first, it was surprising that genome size is decoupled from organismal complexity. Then it was surprising that genes are split into coding exons and noncoding introns. Then it was surprising that half the human genome is transposable elements. Then it was surprising that there are only 25,000 genes. Now it is surprising that a significant portion of the noncoding DNA is transcribed and that gene regulation is not a simple on-off system but involves interactions – perhaps even networks – of coding and noncoding segments. I wouldn’t want to speak for “most biologists”, but I think overall we are coming to appreciate that less has been figured out about genome function than we first thought. And that is what makes the future of genomic science exciting.
And what further evidence would tell us whether we should redefine “gene”? (E.g., would we need to find disease mutations associated with these chimeric transcripts?)
It depends on what you want the term “gene” to represent. In its original definition, it did not specify “protein-coding exons” (because these were not discovered until decades later), and instead referred to a generalized notion of a genetic “determiner” (according to Johansen 1909 “The word gene is completely free from any hypothesis; it expresses only the evident fact that, in any case, many characteristics of the organism are specified in the germ cells by means of special conditions, foundations, and determiners which are present in unique, separate, and thereby independent ways”). After the rise of molecular genetics in the ‘50s, the focus shifted to individual protein-coding sequences (hence, “one gene, one protein”), though this was expanded to include the intron-exon arrangement after it was described by Gilbert in 1978. Now we see that “units of genetic specification”, or what we might want the term “gene” to describe, can include exons, introns (especially as they play a role in alternative splicing to generate several proteins from one “gene”), regulatory regions, promoters, noncoding RNAs, and other elements. Maybe we need a word to mean “an associated unit of protein-coding exons that specifies a particular set of protein products” and one for “all sequences that are involved in generating a particular set of protein products, including coding, regulation, and associated processes”. One of these could be “gene” but we’d need another term to refer to the other. It may be rendered more complex if some regulatory elements affect multiple coding regions (hence discussion regarding relative contributions of cis vs. trans mechanisms). I think it is becoming clear enough that there is more to it than simply transcribing the stretch of DNA and splicing out the introns without linking changes in non-exonic elements to deleterious effects. So, it’s not so much a requirement of more experimental work to identify disease mutations as a conceptual decision about what we want the word to mean based on the more fundamental discoveries about regulation, protein-coding, and non-protein-coding function.
Overall, I think Minkel did a very good job with this piece — especially given the complex issues being discussed and the input offered by several scientists.