The coining of the term “junk DNA” is credited to Susumu Ohno, who used it in two conference presentations that were later published as Ohno (1972) and Ohno (1973). However, Ohno only used the term once per paper — in the titles. The first detailed discussion of “junk DNA” was by Comings (1972), and in fact his review appeared in print slightly before Ohno’s papers (Comings cites Ohno’s work as “in press”). So, if we’re going to frame the findings of ENCODE or any other genome analysis in terms of a comparison what the “junk DNA” view argued, we should probably refer to Comings (1972) even more so than to Ohno (1972). In that regard, I thought it would be interesting to compare some of the major claims made by the ENCODE authors in 2012 with Coming’s review of ideas on similar topics from 40 years earlier.
(Note that emphases added below are mine).
Here’s one of the major claims, from the main ENCODE (2012) paper:
“The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.“
“The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type.”
But wait. Here’s ENCODE’s lead coordinator, Ewan Birney, on the controversial “80% functional” figure:
Q. Hmmm. Let’s move onto the science. I don’t buy that 80% of the genome is functional.
A. It’s clear that 80% of the genome has a specific biochemical activity – whatever that might be. This question hinges on the word “functional” so let’s try to tackle this first. Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of “functional” works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as “specific biochemical activity” – for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like “having a phosphodiester bond” would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, “broad” histone modifications, “narrow” histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.
Q. So remind me which one do you think is “functional”?
A. Back to that word “functional”: There is no easy answer to this. In ENCODE we present this hierarchy of assays with cumulative coverage percentages, ending up with 80%. As I’ve pointed out in presentations, you shouldn’t be surprised by the 80% figure. After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.
However, on the other end of the scale – using very strict, classical definitions of “functional” like bound motifs and DNaseI footprints; places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases – we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as “functional” by intuition) that number goes up to 9%. Given what most people thought earlier this decade, that the regulatory elements might account for perhaps a similar amount of bases as exons, this is surprisingly high for many people – certainly it was to me!
In addition, in this phase of ENCODE we did sample broadly but nowhere near completely in terms of cell types or transcription factors. We estimated how well we have sampled, and our most generous view of our sampling is that we’ve seen around 50% of the elements. There are lots of reasons to think we have sampled less than this (e.g., the inability to sample developmental cell types; classes of transcription factors which we have not seen). A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our sampling) to 20%.“
“Q. Ok, fair enough. But are you most comfortable with the 10% to 20% figure for the hard-core functional bases? Why emphasize the 80% figure in the abstract and press release?
A. (Sigh.) Indeed. Originally I pushed for using an “80% overall” figure and a “20% conservative floor” figure, since the 20% was extrapolated from the sampling. But putting two percentage-based numbers in the same breath/paragraph is asking a lot of your listener/reader – they need to understand why there is such a big difference between the two numbers, and that takes perhaps more explaining than most people have the patience for. We had to decide on a percentage, because that is easier to visualize, and we choose 80% because (a) it is inclusive of all the ENCODE experiments (and we did not want to leave any of the sub-projects out) and (b) 80% best coveys the difference between a genome made mostly of dead wood and one that is alive with activity. We refer also to “4 million switches”, and that represents the bound motifs and footprints.
But for me, this is the kicker:
We use the bigger number because it brings home the impact of this work to a much wider audience. But we are in fact using an accurate, well-defined figure when we say that 80% of the genome has specific biological activity.”
Actually, the paper says “specific biochemical function“. In any case, this figure of 80% is based on an incredibly loose definition of the term “function” or “activity”. Basically, if it is transcribed at all, in any cell type, it’s “functional”. If a protein can bind to it, it’s functional. If it has any histone modifications or other patterns of chromatin structure, it’s functional. As Birney noted, a figure that is more in line with actual function is somewhere between 10% and 20% of the genome.
Let’s turn now to Comings (1972). Note, in particular, that “junk DNA” absolutely was not introduced because we didn’t know what its function was, nor because it was simply assumed to be useless.
As Comings (1972) wrote:
“Why should the disturbing possibility that some of the DNA of our genome is relatively useless junk even be considered? There are several reasons: (1) Some organisms have an unreasonable excess of DNA, clearly more than they require. (2) Reasonable estimates of the number of genes necessary to run a eukaryote seem significantly less than the amount of DNA available. (3) The mutational load would be too great to allow survival if all the DNA of most eukaryotes carry was composed of essential genes. (4) Some junk DNA, such as mouse satellite, clearly exists.”
Thus, the early discussions of junk DNA were based on both theoretical considerations and experimental observations. And it was considered an open question as to how much non-coding DNA would turn out to be functional:
“These considerations suggest that up to 20% of the genome is actively used and the remaining 80+% is junk. But being junk doesn’t mean it is entirely useless. Common sense suggests that anything that is completely useless would be discarded. There are several possible functions for junk DNA.”
So, the figure for “actively used” DNA given in the very first paper to discuss “junk DNA”, was the same one that the ENCODE data give.
How about the issue of large amounts of non-coding DNA being transcribed and its possible role in gene regulation?
Here’s Comings (1972) again:
“The observation that up to 25% of the genome of fetal mice is transcribed into rapidly labeled RNA, despite the fact that probably less than half this much of the genome serves a useful function, indicates that much of the junk DNA must be transcribed. It is thus not too surprising that much of this is rapidly broken down within the nucleus. There are several possible reasons why it is transcribed: (1) it may serve some unknown, obscure purpose; (2) it may play a role in gene regulation; or (3) the promoters which allow its transcription may remain sufficiently intact to allow RNA transcription long after the structural genes have become degenerate.”
“The question of whether all or only some of the different classes of repetitious DNA are transcribed is central to a number of theories concerning both gene regulation and the role of repetitious DNA in heterochromatin. For example, if satellite DNA is luxuriantly transcribed, this would be difficult to fit with its relationship to heterochromatin; if structural genes exist as single copies, then cytoplasmic messenger RNA should hybridize predominantly to nonrepetitious DNA; and if moderately repetitious DNA plays a role in gene regulation at the posttranscriptional level, then some of the cytoplasmic messenger RNA should hybridize to moderately repetitious DNA.”
Comings (1972) also discusses histone proteins, chromatin structure, DNA-protein binding, GC-content, DNA methylation, physical arrangement of chromosomes in the nucleus, and various other aspects of genome form and function. He generally considers each of these in the context of how it may be involved in genome structural or regulatory processes.
Now, I am not claiming that there is nothing new in the ENCODE data. Far from it — they provide the most detailed overview of genome elements we’ve ever seen and will surely lead to a flood of interesting research for many years to come. But it is folly to assume that ENCODE has supplanted a previously naive view of the genome as “mostly useless junk”.
In stark contrast, ENCODE specifically addresses many of the same questions that were raised in early discussions of junk DNA. The big difference is that their considerations of “function” are far less sophisticated than what was found in the literature decades ago.