What’s wrong with this figure?

There is a story on Science News Online entitled “Genome 2.0“. The author has certainly done a lot of legwork and has tried to present a detailed discussion of a complex topic, and for that he deserves considerable credit. (He clearly hasn’t taken my guide to heart). That said, it is unfortunate that the author has fallen into the trap of repeating the usual claims about the history (everyone thought it was merely irrelevant garbage) and potential function (some is conserved and lots is transcribed, so it all must be serving a role) for “junk DNA”. As a result, I won’t comment much more on it. One thing that may be relevant to point out about this story in particular is the first figure it uses. This is a figure I have seen in a few places, including in the scientific literature. It makes me cringe every time because it reveals a real problem with how some people approach the issue of non-coding DNA. And so, 10 points to the first person who can point out what is deeply problematic about the interpretation it is often granted. I include the legend as provided in the original report.

JUNK BOOM. Simpler organisms such as bacteria (blue) have a smaller percentage of DNA that doesn’t code for proteins than more-complex organisms such as fungi (grey), plants (green), animals (purple), and people (orange).

(See also Genome size and gene number)


The 10 points has been awarded twice on the basis of two major problems being pointed out.

The first is that the graph arranges species according to % noncoding DNA and assumes that everyone will agree that the X-axis proceeds from less to more complex. This is classic “great chain of being” thinking. No criteria are specified by which the bacteria are ranked (and it is simply ignored that Rickettsia has a lot of pseudogenes which appear to be non-functional), which is bad enough. Worse yet, there is really no justification for ranking C. elegans as more complex than A. thaliana other than the animal-centric assumption that all animals must be more sophisticated than all plants.

The second, and the one I had in mind, is that this is an extremely biased dataset. Specifically, it is based on a set of species whose genomes have been sequenced. These target species were chosen in large part because they have very small genomes with minimal non-coding DNA. The one exception is humans, which was chosen because we’re humans. As has been pointed out, even if you chose a few of the more recently sequenced genomes (say, pufferfish at 400Mb and mosquito at 1,400Mb) this pattern would start to disintegrate. If you look at the actual ranges or means of genome size among different groups, you will see that there are no clear links between complexity and DNA content, despite what some authors (who focus only on sequenced genomes) continue to argue.

To illustrate this point, this figure shows the means (dots) and ranges in genome size for the various groups of organisms for which data are available. This represents estimates for more than 10,000 species. This is intentionally arranged along the same kind of axis of intuitive notions of complexity just to show how discordant “complexity” and genome size actually are. Humans, it will be noted, are average in genome size for mammals and not particularly special in the larger eukaryote picture.

Means and ranges of haploid DNA content (C-value) among different groups of organisms. Click for larger image. Source: Gregory, TR (2005). Nature Reviews Genetics 6: 699-708.

Maybe you will join me in cringing the next time you see a figure like the one in the story above.

Update (again):

Others have criticized this kind of figure before. As a case in point, see John Mattick’s (2004) article in Nature Reviews Genetics and the critical commentary by Anthony Poole (and Mattick’s reply). Obviously, I am with Poole on this one.

18 thoughts on “What’s wrong with this figure?

  1. Um, the fact that the organisms are arranged in a “great chain of being” fashion, with “simple” organisms coming before the humans at the pinnacle? I hate that!

  2. Indeed, although in this case they are also arranged according to % noncoding.

    Will you settle for 5 points?


  3. Yes, that is also very annoying. It seems we have similarly low tolerance for this sort of thing.

    But that’s not the biggie.

  4. It should be a huge red flag to people that the organisms that “happen to have” to largest % of non-coding DNA are also the ones that we know most about… I suspect that the % noncoding in A. thaliana, C. elegans, D. melan, and Humans are not significantly different from each other… I bet that as we know more about the other species, that their % non-coding will rise as well…

    Another issues, the guys in the “low %non coding” are prokaryotes..

  5. Matt – Getting close…

    I’ll grant that H. sapiens has a lot more noncoding DNA than, say, C. elegans. so that’s not quite it.

  6. On second thought, I am going to give Jonathan the full 10 points, so two people can potentially get full scores.

    Who decided that this is the order for the bacteria in terms of increasing complexity?

    And is A. thaliana really less complex than C. elegans? I doubt it, at least not in terms of number of cell types or some quasi-objective assessment.

    The more I think about it, the more this figure provides a superb example of how to mislead with bar graphs.

    So congrats to Jonathan, but the competition is still open…

  7. Isn’t the graph just wrong in detail? For example salamanders have a bigger genome that humans, much of it could be noncoding. There are probably other examples. If that is true then the trend in the graph is artificial, constructed by picking and choosing data.

  8. Definite data-plucking — why weren’t maize or Anopheles added to provide some contrast– or perhaps both zebrafish & fugu.

    To me that is the far more interesting conundrum — that species which are as similar as two bony fish can have 10X difference in genome size. It’s particularly disappointing to hear this is from Science News, because they generally are pretty good.

  9. What, you mean that the sequenced eukaryote genomes are not a good representative sample of eukaryote genomes? Shocker! It’s like they were trying to be cheap and sequence small genomes, except for humans which we have a particular interest in.

    At least some of this traces back to John Mattick who managed to get a figure like this published in a Scientific American article along with a lot of fluff about meaning in the junk.

  10. I’m psychic, I tell you, I sussed out Mattick-ian influence before I even read the article!

    While the number of genes isn’t much different in roundworms and people, the human genome is 30 times the size of the roundworms’. People have a much larger quantity of DNA beyond what codes for proteins. Since much of this “junk” DNA is being transcribed into RNA, perhaps it’s responsible for much of the complexity of human bodies and brains. In fact, organisms simpler than roundworms, such as single-celled bacteria, carry little noncoding DNA and may have no regulatory RNA at all.

    “Scientists have been suspecting that it is the regulatory networks that lead to this amazing complexity” in higher organisms, Ge says.

    John S. Mattick of the University of Queensland in Brisbane, Australia, points to a known example of the importance of regulatory RNAs: their crucial role in fetal development. For example, most multicellular animals possess a gene called Notch that helps guide neural development. While the gene itself has much the same form in both simple and complex animals, its activity is regulated by miRNAs that are highly variable from one animal to another. Such miRNAs also influence a gene called Hox, which acts in many animals to define a fetus’ body axis and the placement of its limbs.

  11. I’m a little bit confused by this. You say that the amount of non-coding DNA is not indicative of complexity. Okay. You say that there is a bias in the dataset because we’ve only sequenced genomes that are small – which tend to have fewer non-coding bases. Okay. But, then the chart about the amount of DNA shows there is no real correlation between “organismal complexity” and the number of bases. Okay. Although, that last chart of yours doesn’t really “shoot down” the idea that the percentage of non-coding DNA = organismal complexity. Sure, maybe salamanders have a lot of DNA, but maybe they also have an awful lot of genes (perhaps lots of slight variations on one gene – like how humans have multiple versions of hemoglobin). So, I really don’t know what percentage of the salamander genome is non-coding DNA. It could be higher or lower than the human values. It would’ve been a great rebuttal if your second chart had shown the percentage of non-coding DNA.

  12. It’s possible that some salamanders (and algae, and plants, and protists) have a million genes, but I strongly doubt it — for one thing, it was established quite some time ago that so many genes would cause major difficulties in terms of mutational loads.

  13. Just a second-isn’t accumlation of deleterious mutations a major problem with our current model of genetics since minor mutations don’t get elminated and they accumulate over time-was called ‘Muller’s Ratchet’ in my undergradte.

    Correct me on this if it has been sorted out in the years between it been taught to me and what I do now ;).

    P.S. As a microbiologist, the idea that bacteria arn’t complex is laughable-we can’t even classify them properly (their genetic interactions, variation and horizontal gene transfers are mind boggling complex). To make matters worse we can’t even grow or study 95-99% of the species we know exist(the figures keep changeing on this one).

    Even tryign to measure complexity of these guys is to hard for us-how can we even line them up with other creatures?

Comments are closed.