There is a story on Science News Online entitled “Genome 2.0“. The author has certainly done a lot of legwork and has tried to present a detailed discussion of a complex topic, and for that he deserves considerable credit. (He clearly hasn’t taken my guide to heart). That said, it is unfortunate that the author has fallen into the trap of repeating the usual claims about the history (everyone thought it was merely irrelevant garbage) and potential function (some is conserved and lots is transcribed, so it all must be serving a role) for “junk DNA”. As a result, I won’t comment much more on it. One thing that may be relevant to point out about this story in particular is the first figure it uses. This is a figure I have seen in a few places, including in the scientific literature. It makes me cringe every time because it reveals a real problem with how some people approach the issue of non-coding DNA. And so, 10 points to the first person who can point out what is deeply problematic about the interpretation it is often granted. I include the legend as provided in the original report.
JUNK BOOM. Simpler organisms such as bacteria (blue) have a smaller percentage of DNA that doesn’t code for proteins than more-complex organisms such as fungi (grey), plants (green), animals (purple), and people (orange).
(See also Genome size and gene number)
The 10 points has been awarded twice on the basis of two major problems being pointed out.
The first is that the graph arranges species according to % noncoding DNA and assumes that everyone will agree that the X-axis proceeds from less to more complex. This is classic “great chain of being” thinking. No criteria are specified by which the bacteria are ranked (and it is simply ignored that Rickettsia has a lot of pseudogenes which appear to be non-functional), which is bad enough. Worse yet, there is really no justification for ranking C. elegans as more complex than A. thaliana other than the animal-centric assumption that all animals must be more sophisticated than all plants.
The second, and the one I had in mind, is that this is an extremely biased dataset. Specifically, it is based on a set of species whose genomes have been sequenced. These target species were chosen in large part because they have very small genomes with minimal non-coding DNA. The one exception is humans, which was chosen because we’re humans. As has been pointed out, even if you chose a few of the more recently sequenced genomes (say, pufferfish at 400Mb and mosquito at 1,400Mb) this pattern would start to disintegrate. If you look at the actual ranges or means of genome size among different groups, you will see that there are no clear links between complexity and DNA content, despite what some authors (who focus only on sequenced genomes) continue to argue.
To illustrate this point, this figure shows the means (dots) and ranges in genome size for the various groups of organisms for which data are available. This represents estimates for more than 10,000 species. This is intentionally arranged along the same kind of axis of intuitive notions of complexity just to show how discordant “complexity” and genome size actually are. Humans, it will be noted, are average in genome size for mammals and not particularly special in the larger eukaryote picture.
Means and ranges of haploid DNA content (C-value) among different groups of organisms. Click for larger image. Source: Gregory, TR (2005). Nature Reviews Genetics 6: 699-708.
Maybe you will join me in cringing the next time you see a figure like the one in the story above.
Others have criticized this kind of figure before. As a case in point, see John Mattick’s (2004) article in Nature Reviews Genetics and the critical commentary by Anthony Poole (and Mattick’s reply). Obviously, I am with Poole on this one.