If you read many of the media reports that came out today, the ENCODE project has demonstrated that 80% of the DNA in our genome has a biological function. This runs counter to traditional views of the genome, the story goes, because most of the genome had been dismissed as useless junk. I have blogged about why this cliché is historically inaccurate so many times that I just can’t bring myself to rehash it again right now. Instead, I’ll just direct you to the Junk DNA: Quotes of Interest series and you can see for yourself what was written in the scientific literature.
Some reports have been better than others. New Scientist ran a story today that presents a rather balanced treatment of the debate regarding how much of the genome is functional. (It so happens that I was quoted in that story, so that’s a bonus!). Ed Yong — one of the most reliable science writers around — wrote about it on his blog Not Exactly Rocket Science as well. Larry Moran didn’t like it, but to be fair I think Ed was trying to report what the ENCODE authors were claiming. No doubt he’s open to further discussion on the topic — in fact, he has invited comments from the other side. Maybe we’ll see a second post in the coming days.
The claim that “lots of the genome isn’t junk after all!” is not new — people have been using this straw man for nearly 20 years. What’s novel is that the ENCODE authors are claiming that there is now evidence that 80% of the genome shows signs function, or at least of “specific biological activity”. Many people are not convinced by this, me among them. I am especially unimpressed by this figure when I read the ENCODE project lead’s own words on the subject of “function” and the 80% figure.
Here’s Ewan Birney:
Q. Hmmm. Let’s move onto the science. I don’t buy that 80% of the genome is functional.
A. It’s clear that 80% of the genome has a specific biochemical activity – whatever that might be. This question hinges on the word “functional” so let’s try to tackle this first. Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of “functional” works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as “specific biochemical activity” – for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like “having a phosphodiester bond” would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, “broad” histone modifications, “narrow” histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.
Q. Ok, fair enough. But are you most comfortable with the 10% to 20% figure for the hard-core functional bases? Why emphasize the 80% figure in the abstract and press release?
A. (Sigh.) Indeed. Originally I pushed for using an “80% overall” figure and a “20% conservative floor” figure, since the 20% was extrapolated from the sampling. But putting two percentage-based numbers in the same breath/paragraph is asking a lot of your listener/reader – they need to understand why there is such a big difference between the two numbers, and that takes perhaps more explaining than most people have the patience for. We had to decide on a percentage, because that is easier to visualize, and we choose 80% because (a) it is inclusive of all the ENCODE experiments (and we did not want to leave any of the sub-projects out) and (b) 80% best coveys the difference between a genome made mostly of dead wood and one that is alive with activity. We refer also to “4 million switches”, and that represents the bound motifs and footprints.
We use the bigger number because it brings home the impact of this work to a much wider audience. But we are in fact using an accurate, well-defined figure when we say that 80% of the genome has specific biological activity.
So, “functional” is a pretty big stretch here, and 80% rather than 20% was used because it generates more interest. Not surprisingly, this has irritated many biologists and thrilled anti-evolutionists.
But here’s my slightly different take on the kerfuffle, and why people who deny the existence of non-functional DNA have little reason to rejoice. Consider the following:
1) Even after 5 years, $185 million, and a massive study by hundreds of researchers, there still is only evidence of function for 80% of the human genome under the most extremely generous interpretation. That leaves 20% without any signs of function whatsoever. That’s more than 600 million base pairs, or about 200 million more than the entire pufferfish genome.
That said, people like Ewan Birney and John Mattick think the figure will actually go to 100% functional once additional cell types are analyzed. Here’s a quote from Ed Yong’s piece:
And what’s in the remaining 20 percent? Possibly not junk either, according to Ewan Birney, the project’s Lead Analysis Coordinator and self-described “cat-herder-in-chief”. He explains that ENCODE only (!) looked at 147 types of cells, and the human body has a few thousand. A given part of the genome might control a gene in one cell type, but not others. If every cell is included, functions may emerge for the phantom proportion. “It’s likely that 80 percent will go to 100 percent,” says Birney. “We don’t really have any large chunks of redundant DNA. This metaphor of junk isn’t that useful.”
2) To get that 80% figure, you have to have a very loose definition of “function” indeed. Actual evidence (which itself may not convince many experts) suggests 20% is functional in the sense of, well, having a biological function. The 80% value refers only to “specific biological activity”. Some comments from the interwebs sum up the critique of this criterion rather nicely:
Michael Eisen: “Measurable biochemical activity is a meaningless measure of functional significance.”
Leonid Kruglyak: “80% includes definitions of “activity” barely more interesting than “replicated” (e.g. transcribed).”
Sandwalk reader named “Argon“: “Basically his trigger for ‘functionality’ being ‘specific biochemical activity’ sets a pretty low bar. It’s about the lowest set-point I think you can have short of ‘having a sequence that can be digested with a DNAse’.”
Also, I haven’t read the primary papers in detail yet, but the immediate question that comes to mind is how one distinguishes “specific biological activity” that is functional for the organism from “specific biological activity” of parasitic transposable elements or their remnants. Simply having a site to which DNA can bind or being transcribed into RNA do not seem like very good evidence of an important biological role to me.
3) The onion test. Maybe 80% of the human genome is “functional”, even in a biologically meaningful sense of that word. Even so, we’d still be left with the question of why onions need so much more non-coding DNA than humans, or how pufferfishes can get by just fine with only 1/10 as much.
4) Common sense. Probably 2/3 of the human genome is made up of transposable elements and the defunct remains thereof. Some of these elements exist in millions of copies in the genome, and many are known to cause disease by their insertion activity. Some are undoubtedly functional, but it is quite a stretch to suggest that millions of these elements are needed to regulate our 20,000 genes (but not the 30,000 genes of a pufferfish). As Carl Sagan said, “extraordinary claims require extraordinary evidence”, and so far we simply do not have it when it comes to claiming that the majority of the elements in the genome have a biological function.
So, even the most rigorous efforts to find function for non-coding DNA in the human genome have come up with a figure of 80% at best, and only when they use a very flexible definition of “function”. 20% remains a more realistic number, and that would leave a heck of a lot of non-functional DNA in the human genome. If anyone should be pleased by these results, it’s those who maintain that a sizeable portion of the human genome is without a biological function at the level of the organism.
Also, there’s this:
These considerations suggest that up to 20% of the genome is actively used and the remaining 80+% is junk. But being junk doesn’t mean it is entirely useless. Common sense suggests that anything that is completely useless would be discarded. There are several possible functions for junk DNA.
That was written by D.E. Comings in 1972, in the very first detailed discussion of “junk DNA”.