Non-coding DNA and the opossum genome.

The genome sequence of the gray short-tailed opossum, Monodelphis domestica, was published in today’s issue of Nature (Mikkelsen et al. 2007). It is interesting for many reasons, including its status as the first marsupial genome to be sequenced, its relatively large genome size, and low chromosome number (2n = 18). It is also interesting because it contains a similar number of genes (18,000 – 20,000) to humans, the vast majority of which exhibit close associations with the genes of placental mammals. Also, in keeping with the hypothesis that transposable elements are the dominant type of DNA in most eukaryotic genomes, the comparatively large opossum genome is comprised of 52% transposable elements, the most for any amniote sequenced so far.

One of the most intriguing discoveries about the opossum genome is that changes to protein-coding genes seem not to have been the driving force behind mammalian diversification. Instead, non-coding elements with regulatory functions — mostly derived from formerly parasitic transposable elements — appear to underly much of the difference.

Now, I would prefer to just talk about the science here, noting that this is yet another great example of the complex nature of genome evolution, the key role played by “non-standard” genetic processes (Gregory 2005), and the ever-increasing relevance of non-coding DNA in genomics. But, inevitably, I must comment on how this discovery has been reported. Here is what ScienceDaily (which I otherwise like a great deal) said about it:

Opossum Genome Shows ‘Junk’ DNA Source Of Genetic Innovation

(…)

The research, released Wednesday (May 9) also illustrated a mechanism for those regulatory changes. It showed that an important source of genetic innovation comes from bits of DNA, called transposons, that make up roughly half of our genome and that were previously thought to be genetic “junk.”

The research shows that this so-called junk DNA is anything but, and that it instead can help drive evolution by moving between chromosomes, turning genes on and off in new ways.

(…)

It had been initially thought that most of a creature’s DNA was made up of protein-coding genes and that a relatively small part of the DNA was made up of regulatory portions that tell the rest when to turn on and off.

As studies of mammalian genomes advanced, however, it became apparent that that view was incorrect. The regulatory part of the genome was two to three times larger than the portion that actually held the instructions for individual proteins.

I will just reiterate two brief points, as I have already dealt with some of these topics in earlier posts (and will undoubtedly have to do so again in the future). One, very few people have actually argued that all non-coding DNA is 100% functionlesss “junk”, and no one is surprised anymore when a regulatory or other function is observed for some non-coding DNA sequences. Moreover, transposable elements are more commonly labeled as “selfish DNA”, and it has been noted in countless articles that they can and do take on functions at the organism level even if they begin as parasites at the genome level. Two, yet again we are talking about a small portion of the genome such that this should not be considered a demonstration that all non-coding DNA is functional. In particular, the authors identified about 104 million base pairs of DNA that is conserved (i.e., shared and mostly invariant) among mammals, about 29% of which overlapped with protein-coding genes. In other words, about 74 million base pairs of non-coding DNA, much of it derived from former transposable elements, is found to be conserved among mammals and shows signs of being functional in regulation. The genome size of the opossum is probably around 3,500 million bases, which means that this functional non-coding DNA makes up 2% of the genome.

A note to science writers. There is nothing surprising about some sequences of non-coding DNA having an important function. The notion that all non-coding DNA has long been assumed to be completely functionless junk is a straw man. And to avoid misleading readers, you really need to specify that most examples of non-coding DNA with a function represent a very small portion of the total genome.

___________

References

Gregory, T.R. 2005. Macroevolution and the genome. In The Evolution of the Genome (ed. T.R. Gregory), pp. 679-729. Elsevier, San Diego.

Mikkelsen, T.S., M.J. Wakefield, B. Aken, C.T. Amemiya, J.L. Chang, S. Duke, M. Garber, A.J. Gentles, L. Goodstadt, A. Heger, J. Jurka, M. Kamal, E. Mauceli, S.M.J. Searle, T. Sharpe, M.L. Baker, M.A. Batzer, P.V. Benos, K. Belov, M. Clamp, A. Cook, J. Cuff, R. Das, L. Davidow, J.E. Deakin, M.J. Fazzari, J.L. Glass, M. Grabherr, J.M. Greally, W. Gu, T.A. Hore, G.A. Huttley, M. Kleber, R.L. Jirtle, E. Koina, J.T. Lee, S. Mahony, M.A. Marra, R.D. Miller, R.D. Nicholls, M. Oda, A.T. Papenfuss, Z.E. Parra, D.D. Pollock, D.A. Ray, J.E. Schein, T.P. Speed, K. Thompson, J.L. VandeBerg, C.M. Wade, J.A. Walker, P.D. Waters, C. Webber, J.R. Weidman, X. Xie, M.C. Zody, J.A.M. Graves, C.P. Ponting, M. Breen, P.B. Samollow, E.S. Lander, and K. Lindblad-Toh. 2007. Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447: 167-177.


Biodiversity databases.

The recent launch of the Encyclopedia of Life has generated quite a bit of excitement. It is my hope that advances such as this will help to make information about the millions of species that inhabit the planet accessible to everyone. It is the ultimate in open access science. In keeping with this, here is a list of biodiversity databases that are freely available to anyone. (I am sure to have missed some and I left out many taxon-specific pages — please leave me a comment or send me an email if you know of any other major resources and I will update the compilation).


What do non-coding DNA and sleep have in common?

Like many scientists, I make use of Web of Science and PubMed to alert me when papers in my field of research are published or when one of my articles is cited. The latter may sound vain, but in fact it is helpful because one’s papers sometimes are cited in unexpected ways that would probably not be discovered by one’s routine literature searches.

This post is about an intriguing example of a connection that I never would have drawn myself and which I probably would not have seen in the literature had I not been alerted that some of my articles on genome size and non-coding DNA had been cited. Specifically, this involves two recent papers on what I would have assumed was a totally unrelated topic: sleep.


I am sure that everyone reading this blog is familiar with the importance of sleep in a physiological sense — if you don’t sleep for a large portion (~1/3!) of every day, both your mind and body suffer. However, in an evolutionary sense, why we sleep remains something of a puzzle. As Savage and West (2007) put it, “Sleep is one of the most noticeable and widespread phenomena occurring in multicellular animals. Nevertheless, no consensus for a theory of its origins has emerged.”

So what does this have to do with non-coding DNA?

In the first paper mentioned above, Savage and West (2007) put forth a framework for studying the function of sleep that relates to cellular damage repair and brain reorganization. This includes linkages with metabolism, body size, and cell size, the latter of which is related to genome size, and thus to the quantity of non-coding DNA. Moreover, the amount of DNA per genome may be related to the genome’s susceptibility to mutational damage (though this remains to be established, and one could come up with a priori arguments for why the relationship might be positive or negative). So, in this case, the amount of non-coding DNA may relate to one or more of the proposed functions of sleep, and thus be relevant to an adaptive interpretation of the question. Interesting stuff.

The second paper, by Rial et al. (2007), takes a rather different approach to the question. They argue that sleep per se is not really necessary — rest would suffice. These authors invoke non-coding DNA not as a possible factor of interest in explanations of sleep, but as an analogy. In particular, Rial and colleagues lament the fact that sleep is almost always approached from an adaptive standpoint because it is relatively complex. “However,” they write, “complexity is not by itself firm proof of adaptation. A well known, complex, but seemingly useless structure, could be invoked as a metaphor for the uselessness of many sleep signs.” That complex, (possibly) largely functionless feature is, you guessed it, non-coding DNA. Sleep, under their interpretation, may be more like non-coding DNA than like the eye, in that it is not the product of adaptation but rather reflects a byproduct of other processes. In addition, whereas some non-coding DNA takes on a secondary function, so may have some components of sleep.

I am not qualified to comment on the scientific merit of either hypothesis, so I will not say any more about their specific arguments. I am, however, pleasantly surprised to see non-coding DNA making an appearance, both mechanistically and conceptually, in discussions of what I would have thought was an unrelated subject of inquiry. It just goes to show the interconnectedness of science, and the importance of reading outside the bounds of one’s own specialized field.

______________

References

Rial, R.B., M. del Carmen Nicolau, A. Gamundi, M. Akaarir, S. Aparicio, C. Garau, S. Tejada, C. Roca, L. Gene, D. Moranta, and S. Esteban. 2007. The trivial function of sleep. Sleep Medicine Reviews, in press.

Savage, V.M. and G.B. West. 2007. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences of the USA 104: 1051-1056.


Gene number and complexity.

Leaving aside the difficulty in defining terms such as “complexity” and “gene“, there has been for many decades an underlying assumption that there ought to be some relationship between morphological complexity and the number of protein-coding genes within a genome. This is a holdover from the pre-molecular era of genetics, when it was at first thought that total genome size should be related to gene number, and thus to complexity. Indeed, the constancy of DNA content within chromosome sets (“C-values”) was taken as evidence that DNA is the substance of heredity, and yet it was recognized as early as 1951 that there is no clear relationship between the amount of DNA per genome and organismal complexity (e.g., Mirsky and Ris 1951; Gregory 2005). By 1971, this had become known as the “C-value paradox” because it seemed so self-contradictory (Thomas 1971). (The solution to the C-value paradox was that most eukaryotic DNA is non-coding, although this raises plenty of questions of its own).

Nevertheless, one sometimes encounters arguments that there is a positive correlation between complexity and genome size, even in the scientific literature. Let me put to rest the notion that genome size is related to complexity on the broad scale of eukaryotic diversity. Here is a figure from Gregory (2005) showing the known ranges and means for more than 10,000 species of animals, plants, fungi, protists, bacteria, and archaea (click image for larger view).


The notion that gene number and complexity should be related has survived largely intact into the post-genomic era, in no small part due to the popular tendency to describe genomes as “blueprints”. Genomes are not blueprints because there is no direct correspondence between a given bit of the genome and a particular piece of the organism. If one must have an analogy for how genomes operate, then a far more appropriate one is with recipes and cakes. No single word in a recipe specifies a particular crumb of a cake, but following the recipe correctly will result in a cake nonetheless. It probably does not need spelling out, but genomes are the recipe, development is the process of mixing ingredients and baking, and organisms are the cake.


Now, one might expect that a more complex cake would require a more verbose recipe, and indeed on a very general level this is true: viruses have very few genes, bacteria and archaea have more, and eukaryotes have more still. Beyond that, however, it is not necessarily the case that a complex cake needs a recipe with more individual instructions. If the language is very efficient — for example, if one sentence in the recipe can convey several steps, or if one can combine the same basic instructions in different ways to make different parts of the cake — then a short recipe might easily produce a more complex cake than one that goes on for several pages.

While predictions regarding human gene number varied considerably prior to the completion of the human genome sequence in 2001, it was nevertheless somewhat surprising that the gene count is only about 20,000-25,000 for a human (International Human Genome Sequencing Consortium 2004). In fact, some people started calling this the “G-value paradox” or “N-value paradox” (for Gene or Number) in reference to the older C-value paradox (Claverie 2001; Betrán and Long 2002; Hahn and Wray 2002).

Here is how Comings (1972) described the C-value paradox:

Being a little chauvinistic toward our own species, we like to think that man is surely one of the most complicated species on earth and thus needs just about the maximum number of genes. However, the lowly liverwort has 18 times as much DNA as we, and the slimy, dull salamander known as Amphiuma has 26 times our complement of DNA. To further add to the insult, the unicellular Euglena has almost as much DNA as man.

And here are Harrison et al. (2002) (probably mostly facetiously):

The sequencing of the genomes of six eukaryotes has provided us with a related quandary: namely, how is the number of genes related to the biological complexity of an organism (termed an ‘N-value’ paradox by Claverie [2001])? How can our own supremely sophisticated species be governed by just 50-100% more genes than the nematode worm?

Of course, neither the “C-value paradox” nor the “G-value paradox” is a paradox at all. As I have said elsewhere, this simply follows the common but erroneous equation of simplistic expectation + contradictory data = “paradox”. Some genes may encode multiple proteins and gene regulation may be more important than gene number, which means that constructing a complex organism does not require a large number of genes any more than it requires a large genome. No paradoxes.

But why might less complex organisms possess large numbers of genes? Rice (Oryza sativa), for example, is thought to have about 50,000 genes, or twice as many as humans (Goff et al. 2002; Yu et al. 2002). One possible explanation is that rice is an ancient polyploid whose entire genome was duplicated in its ancestry. (At least one round of genome duplication also happened early in the evolution of vertebrates, though most lineages now behave genetically as diploids).

But what about something like a purple sea urchin (Strongylocentrotus purpuratus), whose genome apparently encodes 23,300 genes? As deuterostomes, sea urchins are more closely related to vertebrates than to other invertebrates, but that alone does not explain the fact that they have a gene number roughly equivalent to humans (at least, not under the simplified view of genome evolution being discussed). Further, relatedness to self-described complex organisms certainly can’t explain why corals, which are very distant relatives of vertebrates and considered to be relatively “simple” animals, also have somewhere around 20,000 to 25,000 genes.


It turns out that genes involved in immunity are extraordinarily abundant in sea urchins and corals, and that this could account for a significant portion of their total gene number. (Sensory and developmental genes also appear to be very well represented in the sea urchin genome). It is well known that pathogen populations can evolve rapidly and thus that a single host defense mechanism may not remain effective for long. Vertebrates handle the infectious onslaught with a two-tiered system. First, “innate immunity“, which is based on non-specific immune reactions to pathogen attack and is the first response of the body’s immune system. This sort of immunity involves a suite of genes that generate a generalized but limited immune response. In this case there is something of a link with complexity, namely that in order to have a more complex set of possible responses, one would need to have more such genes. All animals possess innate immunity, but only the jawed vertebrates also exhibit “adaptive immunity“, which provides a tailored response to individual pathogens. This system does not involve an individual gene for every possible pathogen, but rather employs an array of duplicated genes that can be shuffled in an effectively limitless number of combinations, like railway cars on a long train, to produce a wide variety combinations of antibodies.

The net result is that vertebrate immunity is more flexible, but that this is achieved not through the addition of tens of thousands of new genes, but through the evolution of a system that can recombine existing genes. Groups like echinoderms and cnidarians, by contrast, may require more immune genes to accomplish an effective level of defense because they lack this ability to use existing genes in a large number of combinations. While analogies between human inventions and biological systems can be very problematic, it does seem apt to point out that more sophisticated technologies are frequently simpler, smaller, and more efficient, with fewer parts. A large number of components and a high degree of physical complexity can represent the primitive rather than the derived state in both engineering and evolution.

More DNA generally, or more genes in particular, need not relate to morphological complexity. The more knowledge has accumulated about the size, content, and regulation of genomes, the more the basis for expecting such an association has eroded. Being shocked by, or even ashamed of, the fact that humans do not reign supreme in terms of genome size or number of genes is not the appropriate reaction. Rather, realizations such as these should be exciting and should stimulate the next generation of genomic investigation.

_________

References

Betrán, E., and M. Long. 2002. Expansion of genome coding regions by acquisition of new genes. Genetica 115: 65–80.

Claverie, J.-M. 2001. What if there are only 30,000 human genes? Science 291: 1255–1257.

Comings, D.E. 1972. The structure and function of chromatin. Advances in Human Genetics 3: 237-431.

Goff, S.A. et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92-100.

Gregory, T.R. 2005. Synergy between sequence and size in large-scale genomics. Nature Reviews Genetics 6: 699-708.

Gregory, T.R. 2006. Genomic puzzles old and new. ActionBioScience.org.

Hahn, M.W. and G.A. Wray. 2002. The g-value paradox. Evolution & Development 4: 73-75.

Harrison, P.M., A. Kumar, N. Lang, M. Snyder, and M. Gerstein. 2002. A question of size: the eukaryotic proteome and the problems in defining it. Nucleic Acids Research 30: 1083-1090.

International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431: 931–945.

Mirsky, A.E. and H. Ris. 1951. The desoxyribonucleic acid content of animal cells and its evolutionary significance. Journal of General Physiology 34: 451-462.

Pennisi, E. 2006. Sea urchin genome confirms kinship to humans and other vertebrates. Science 314: 908-909.

Rast. J.P., L.C. Smith, M. Loza-Coll, T. Hibino, and G.W. Litman. 2006. Genomic insights into the immune system of the sea urchin. Science 314: 952-956.

Sea Urchin Genome Sequencing Consortium. 2006. The genome of the sea urchin Strongylocentrotus purpuratus. Science 314: 941-952.

Thomas, C.A. 1971. The genetic organization of chromosomes. Annual Review of Genetics 5: 237-256.

Yu et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 92-100.

Hits.

I began this blog largely as an experiment in public outreach. It is my belief that many people are interested in science, and that they would like an opportunity to interact with practicing scientists in a blog format. In this regard, I have been very happy to come across the blogs of other front line researchers, such as Jonathan Eisen’s The Tree of Life, Rosie Redfield’s RRResearch, John Dennehy’s The Evilutionary Biologist, Rod Page’s iPhylo, and John Logsdon’s Sex, Genes, and Evolution (Best. Title. Ever.), along with those of quite a number of active grad students.

The question was, would anyone visit my blog, in light of established (and unabashedly political and correspondingly popular) options such as PZ Myers’s Pharyngula (like anyone still needs a link) and Larry Moran’s Sandwalk, or the excellent science reporting of Carl Zimmer’s The Loom?

Well, after just under three weeks in the blogosphere, Genomicron has received over 1,250 hits from roughly 850 unique visitors. Not bad for an upstart. At least, the null hypothesis that a blog is not useful for outreach has taken a bit of a thrashing. Thanks to those who have stopped by, and I hope to see you again soon.


Comments on "Noncoding DNA and Junk DNA" (re-post).

The following is a re-post of my comments on the recently posted Noncoding DNA and Junk DNA at Sandwalk. Needless to say, I am quite pleased to see such active discussion about non-coding DNA. Passages in italics are excerpts from the original article.

TR Gregory said…

Ryan Gregory has serious doubts about the usefulness of the term as he explains in his excellent article A word about “junk DNA”.

Just to clarify, I think the term could be useful — indeed, it was useful when Ohno coined it. The problem is that it is seldom used in an appropriate way. If the meaning were specified explicitly to be “regions strongly suspected of being non-functional with evidence to back it up” (which, incidentally, is not the original definition according to Ohno (1972) or Comings (1972)), and if people used it only in this way, then I would not have a problem with this. But given the difficulty that people seem to have in accepting that some DNA may truly not have a function at the organism level, I don’t know if we could ever get it to be used with such precision.

…a new term, Junctional DNA, to describe DNA that probably has a function but that function isn’t known… think we don’t need to go there. It’s sufficient to remind people that lots of DNA outside of genes has a function and these functions have been known for decades.

That neologism was suggested in response to Minkel’s appeal for a term that would “make the distinction between functional and nonfunctional noncoding DNA clear to a popular audience”. My main suggestion was to call DNA by what it is known to be, if at all possible, by function (“regulatory DNA”, “structural DNA”) or by type (“pseudogene”, “transposable element”, “intron”). Your definition of “junk DNA” is also more precise than most usages, meaning that you specify that the term only be applied to sequences for which there is evidence (not just assumption) of non-function. That leaves us with something in between for journalists to talk about with a catchy buzzword. “Junctional DNA” lets them specify that we’re not talking about “junk DNA” or “functional DNA” — i.e., there is some evidence for function (e.g., being conserved) but no evidence of what that function is. The main utility would be to stop the very frustrating leap that gets made from “this 1% of the genome may have a function, so the whole thing must have this function” kind of reporting. Now they could say “another 1% has moved into the category of ‘junctional DNA'”. I think that would be considerably less misleading than current wording.

Note that I’m avoiding the term “noncoding” DNA here. This is because to me the term “coding DNA” only refers to the coding region of a gene that encodes a protein … there are many genes for RNAs that are not properly called coding regions so they would fall into the noncoding DNA category … introns in eukaryotic genomes would be “noncoding DNA” as far as I’m concerned. I think that Ryan Gregory and others use the term “noncoding DNA” to refer to all DNA that’s not part of a gene instead of all DNA that’s not part of the coding region of a protein encoding gene. I’m not certain of this.

By definition, non-coding DNA is, and always has been, everything other than exons. The reason this is relevant is that early work in genome biology assumed that there should be a 1 to 1 correspondence between DNA content and protein-coding gene number. This is work that occurred for at least two decades before the discovery of introns, pseudogenes, and other non-coding DNA. Now we have more descriptive names for the categories of DNA that are not the genes, all the genes, and nothing but the genes. I actually don’t know of anyone else who would have a problem calling introns, pseudogenes, and regulatory regions “non-coding DNA”. Certainly, Ohno, Crick, and many others have historically put introns in the same non-protein-coding grouping as pseudogenes. It’s just a category — you also have more specific subcategories to apply to each of the types of non-coding DNA. Perhaps your objection relates to an undue emphasis on the distinction between exons and everything else — well, that’s the history of the past half century of this field, so it should be no surprise that the terminology reflects this.

Read Gregory’s article for the short concise version of this dispute. What it means is that junk DNA threatens the worldviews of both Dembski and Dawkins!

Not quite. What you’re leaving out of this is the possibility of multiple levels of selection. In the original edition of The Selfish Gene (1976, p.76), Dawkins argued that “the simplest way to explain the surplus DNA is to suppose that it is a parasite, or at best a harmless but useless passenger, hitching a ride in the survival machines created by the other DNA”. Cavalier-Smith (1977) drew a similar conclusion (before he had read Dawkins), and Doolittle and Sapienza (1980) and Orgel and Crick (1980) [yes, that Crick] independently developed the concept of “selfish DNA” a few years later. This is an explicitly multi-level selection approach because it specifies that non-coding DNA can be present due to selection within the genome rather than exclusively on the organism (or gene, in Dawkins’s case) (see, e.g., Gregory 2004, 2005). (Incidentally, this idea of parasitic DNA dates back at least to 1945, when Gunnar Östergren characterized B chromosomes in this fashion). Of course, they tended to do what Ohno did and applied this one idea to all non-coding DNA, which is too ambitious. The modern view is more pluralistic (see, e.g., Pagel and Johnstone 1992 vs. Gregory 2003). Some non-coding DNA is just accumulated “junk” (in the definition of evidence-supported non-function that you espouse). Some (perhaps most) is “selfish” or “parasitic” and persists because there is selection within the genome as well as on organisms (in fact, an argument could be, and has been, made that “selfish DNA” would be a much more accurate term than “junk DNA” for most non-coding DNA). Some non-coding DNA is clearly functional at the organism level, including regulatory regions and chromosome structure components. Some of these latter functional non-coding DNA sequences are derived from elements that originally were of one of the first two types, most notably transposable elements that take on a regulatory function through co-option (or, in another manner of thinking, that undergo a shift in level of selection).

Junk DNA is not noncoding DNA and anyone who claims otherwise just doesn’t know what they’re talking about.

I’m afraid I don’t follow what you mean here. By your definition, “junk DNA” is any non-functional sequence of DNA, including pseudogenes (i.e., the original meaning). Those sequences do not encode proteins. Hence, your version of junk DNA is non-coding. I think this reflects the confusion that is imposed by the term “junk DNA”, which is why I generally think it is more obfuscating than enlightening.

________

References

Cavalier-Smith, T. 1977. Visualising jumping genes. Nature 270: 10-12.

Comings, D.E. 1972. The structure and function of chromatin. Advances in Human Genetics 3: 237-431.

Dawkins, R. 1976. The Selfish Gene. Oxford University Press, Oxford.

Doolittle, W.F. and C. Sapienza. 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature 284: 601-603.

Gregory, T.R. 2003. Variation across amphibian species in the size of the nuclear genome supports a pluralistic, hierarchical approach to the C-value enigma. Biological Journal of the Linnean Society 79: 329-339.

Gregory, T.R. 2004. Macroevolution, hierarchy theory, and the C-value enigma. Paleobiology 30: 179-202.

Gregory, T.R. 2005. Macroevolution and the genome. In The Evolution of the Genome (ed. T.R. Gregory), pp. 679-729. Elsevier, San Diego.

Ohno, S. 1972. So much “junk” DNA in our genome. In Evolution of Genetic Systems (ed. H.H. Smith), pp. 366-370. Gordon and Breach, New York.

Orgel, L.E. and F.H.C. Crick. 1980. Selfish DNA: the ultimate parasite. Nature 284: 604-607.

Östergren, G. 1945. Parasitic nature of extra fragment chromosomes. Botaniska Notiser 2: 157-163.

Pagel, M. and R.A. Johnstone. 1992. Variation across species in the size of the nuclear genome supports the junk-DNA explanantion for the C-value paradox. Proceedings of the Royal Society of London, Series B: Biological Sciences 249: 119-124.


Peer review.

John Dennehy has posted an interesting summary on The Evilutionary Biologist about professional peer review1. He notes, along with Marc Hauser and Ernst Fehr, that delays imposed by slow reviewers can be a significant source of frustration with the peer review process. The suggestion by Hauser and Fehr (2007) is to institute a system of punishments and rewards to get reviewers to submit reviews on schedule. Interesting idea, though I strongly oppose intentionally subjecting anyone’s work to delay as punishment, no matter how dawdling they are as reviewers. The scientific community at large should not be held back in order to punish specific individuals. There is also the obvious difficulty that reviewers may begin to substitute speed for quality in their review of manuscripts. I am currently reviewing four papers for four different journals. It will take time to get through them, and I hope to get them all in on time, but rushing them to meet a deadline won’t help the peer review process.

Long turnaround times are a real issue, and I have my own stories (one paper took over a year to show up in print). But my complaint regarding peer review comes as a reviewer rather than as an author. One of the biggest frustrations comes when one reviews a paper carefully, provides detailed comments, points out significant problems with the data, analysis, or interpretation, and recommends that the paper be rejected in its present format — and then it shows up in one’s mailbox again, unaltered, after simply having been submitted to a different journal, or worse, appears in print in another journal with none of the errors corrected. If anything shakes my confidence in the efficacy of peer review, it is this.

I understand full well the pressure to publish, but something has to be done about the tendency to submit a rejected paper — sometimes without even fixing typos that have been pointed out — to journal after journal (my current record is reviewing the same paper three times for three journals) until it gets through reviewers who are willing to let the mistakes slide or who lack the expertise to recognize the problems.

My suggested solution is that authors should be required to submit all previous reviews to any new journal to which they are sending the same paper. They should be required to show the editor that changes have been made or to justify why they have not. Otherwise, the peer review process is undermined, the quality of the science suffers, and the reviewers’ time is completely wasted.

End rant.

_________

Notes

1Not to be confused with the spectacle currently going on with regard to the flagellum paper.

References

Hauser M, Fehr E (2007) An incentive solution to the peer review problem. PLoS Biology 5: e107.


Genome size databases.

In case anyone is unaware of their existence, here are the links to the available genome size databases.

For a summary of the databases, see Gregory et al. (2007).

For a discussion about units of measurement in genome size, see here.

A summary of genome size ranges in various animals is available here.

A much smaller database of genome sizes that also includes some taxa besides animals, plants, and fungi is posted here.

For bacterial and archaeal (“prokaryote”) genome size data, see here and here and here.

For a list of completed and ongoing genome sequencing initiatives, see the Genomes OnLine Database (GOLD).

For vertebrate red blood cell sizes, see here.


Junctional DNA.

JR Minkel at the Scientific American blog has responded to the post on Evolgen about his earlier story regarding “junk DNA” (did you catch all that?). At the end of the post, he asks:

Scientists and scientist bloggers: Again, do you care [if journalists call it junk DNA]? If so, what term would you propose instead, or how would you make the distinction between functional and nonfunctional noncoding DNA clear to a popular audience?

Yes, I care, and here are my suggestions. If you mean the general category without any speculation either way about function, then it is simply and accurately “noncoding DNA”. If it has a function, then you specify what that function is: “regulatory DNA” or “structural DNA” or what have you. If the type of sequence is known, then you can use that as well or instead: “transposable elements” or “mobile DNA” or “pseudogenes” or “introns”. Maybe readers won’t know what those terms mean. This is a good opportunity to inform them.

What is missing is a term to describe a given collection of noncoding DNA for which there is thought to be some function, but for which that function and/or the type of sequence is unknown. This would reside somewhere between “junk DNA” (in the vernacular sense) and “functional DNA” (to which specific names can be applied). I therefore suggest the neologism “junctional DNA” to encompass this category. Note that Petsko (2003) suggested “funk DNA” to represent “functionally unknown DNA”, but I think “junctional DNA” is a little less, uh, funky.

Let me be even more specific. The proposed term “junctional DNA” derives from a dual etymology: 1) a simple portmanteau of “junk” and “functional”; 2) an indication that the sequences so described reside at the crossroads between DNA with no evident function and that with a clear function.

Two terms in one day — “the onion test” and “junctional DNA” — how ’bout that.

Incidentally, my annoyance with such reports has less to do with the terminology than with the fact that the highly conserved sequences in question make up about 5% of the total genome. To jump from this to imply that all noncoding DNA is recognized as functional is inappropriate and misleading. I also wish they would cite the source papers they reference; some of us would like to look up the primary material when we see a summary in a news story.

_______________

Update: Other bloggers (RPM of Evolgen in personal correspondence, Sandwalk) seem to think this term is not needed. I point out that this post was given in direct response to Minkel’s appeal for a term that would “make the distinction between functional and nonfunctional noncoding DNA clear to a popular audience”. In light of the fact that a journalist sees the need for such a term, and that it was coined in response to that need, I think ‘junctional DNA’ could be a useful term.


The onion test.

I am not sure how official this is, but here is a term I would like to coin right here on my blog: “The onion test”.

The onion test is a simple reality check for anyone who thinks they have come up with a universal function for non-coding DNA1. Whatever your proposed function, ask yourself this question: Can I explain why an onion needs about five times more non-coding DNA for this function than a human?

The onion, Allium cepa, is a diploid (2n = 16) plant with a haploid genome size of about 17 pg. Human, Homo sapiens, is a diploid (2n = 46) animal with a haploid genome size of about 3.5 pg. This comparison is chosen more or less arbitrarily (there are far bigger genomes than onion, and far smaller ones than human), but it makes the problem of universal function for non-coding DNA clear2.

Further, if you think perhaps onions are somehow special, consider that members of the genus Allium range in genome size from 7 pg to 31.5 pg. So why can A. altyncolicum make do with one fifth as much regulation, structural maintenance, protection against mutagens, or [insert preferred universal function] as A. ursinum?

Left, A. altyncolicum (7 pg); centre, A. cepa (17 pg); right, A. ursinum (31.5 pg).


There you have it. The onion test. To be applied to any ambitious claims that a universal function has been found for non-coding DNA.

____________

1 I do not endorse the use of the term “junk DNA”, which I think has deviated far too much from its original meaning and is now little more than a loaded buzzword; the descriptive term “non-coding DNA” is what I use to refer to the majority of eukaryotic sequences (of various types) that do not encode protein products.

2 Some non-coding DNA certainly has a function at the organismal level, but this does not justify a huge leap from “this bit of non-coding DNA [usually less than 5% of the genome] is functional” to “ergo, all non-coding DNA is functional”.