Non-functional DNA: the burden of proof.

If one studies a genome sequence and comes across a region that is of the length and arrangement of a protein-encoding gene, and is enclosed but not interrupted by start and stop codons, then one can reasonably infer that this sequence is likely to be functional, even if no other evidence is yet in hand for what its potential protein product may do. If someone doubted that this were actually a protein-coding gene, then additional evidence would be expected to show this, for example by showing that it is really an artifact, or a pseudogene, or indicating that it is probably not functional despite the characteristics that suggest that it is.

Many non-genic sequences that are conserved across taxa or which exhibit characteristics consistent with regulatory regions or binding sites or structural components may be treated in the same way as finding an open reading frame as noted above. For most non-coding DNA, there is no such evidence of probable function, however. In fact, as the Sandwalk series on “junk DNA” notes, most non-coding DNA is of a type for which there is little reason to expect function. Inactive vestiges of transposable elements and pseudogenes may occasionally be functional, but there is no reason to assume that they all are given what we know about how they form. Moreover, the massive differences in genome size (i.e., amount of non-genic DNA) among taxa, including species that would be expected to have similar regulatory, mutational buffering, or structural requirements, suggests that much of it is indeed lacking in a universally applied function.

The default assumption by those who accept non-adaptive evolutionary outcomes is that much or even most of a larger genome is not functional for the cell or organism level. This is because of what we do know about these sequences, not because of what we don’t know. Therefore the burden of providing evidence to the contrary is on those who argue that all or even a large percentage of non-coding DNA has a function, which requires an explanation for the variation in DNA amount among eukaryotes in addition to empirical evidence for function at least in general terms, if not as an indication of what the function probably is.

Dinosaurs made from pseudogenes?

Matt Ridley, author of such books as The Red Queen, Genome, and The Origins of Virtue (and not to be confused with biologist Mark Ridley), asks the question “Will we clone a dinosaur?” in Time Magazine. His answer, at least in terms of the Jurassic Park sense of cloning a dinosaur from ancient DNA, is either “no” or “definitely not”.

Yet, Ridley argues for a different possible revival of dinosaur-like animals, ones built through genetic engineering. He notes three things that he considers encouraging in this regard. The first is that dinosaurs aren’t really extinct, or at least that they did leave a diverse line of descendants — namely birds. Second, important regulatory genes, such as the Hox genes that play a major role in directing development, are generally quite conserved across animal lineages. No doubt, the third will be of particular interest to readers of this blog and indeed Ridley singles it out:

Third, and most exciting, geneticists are finding many “pseudogenes” in human and animal DNA–copies of old, discarded genes. It’s a bit like finding the manual for a typewriter bound into the back of the manual for your latest word-processing software. There may be a lot of interesting obsolete instructions hidden in our genes.

Put these three premises together, and the implication is clear: the dino genes are still out there.

I remember an episode of Star Trek – The Next Generation in which the introns of the crew members’ genomes were “reactivated”, and this caused them to de-evolve through various stages in their species’ ancestries. Of course, introns include various types of DNA sequence, most of which are probably not something that could be activated in any sense. The writers probably meant to focus on pseudogenes, as Ridley did.

Pseudogenes are duplicates of protein-coding genes that either maintain the intron/exon structure of the original gene (classical pseudogenes) or lack introns because they were inserted retroactively from an RNA transcript (processed pseudogenes) — either way, they are defined by two characteristics: 1) their obvious similarity to and derivation from protein-coding genes, and 2) the fact that they no longer function in coding for a protein.

Pseudogenes can form at any time in the ancestry of a lineage, may be derived from a wide variety of genes, and may degrade by mutation or be partially deleted without consequence due to a relaxation of selection given that they no longer fulfill sequence-specific functions. Taken together, this means that it can be difficult to identify something as a pseudogene, let alone what the original sequence encoded and in which ancestor the duplication occurred. In other words, pseudogenes are not like an easily legible manual of a particular obsolete technology. They are a jumble of distorted and half-erased text from a manual that is continually being modified haphazardly.


Hat tip: Evolving Thoughts

Junk at Sandwalk.

Anyone who reads this blog but not Sandwalk (if any) should go right now and see Larry’s posts on junk DNA. Although I do not care so much for the term “junk DNA” because often it is employed ambiguously, Larry is careful to define it explicitly as sequences for which the evidence indicates nonfunction. The posts on the distinct components of the genome that are considered junk under this definition are:

Junk in your genome: LINEs

Junk in your genome: SINEs

Junk in your genome: pseudogenes

Junk in your genome: protein-encoding genes

A collection of related posts is compiled under Theme: genomes and junk DNA.


Incidental DNA revisited.

Note – this post has been updated since originally posted.

In the recent exchange regarding my post about genome size and code bloat, one of the authors of the study in question made the following claim:

In its conclusion prof. Gregory suggests that we claim that “Non-coding DNA does
accumulate “so that” it will result in longer-term evolutionary advantage”.
We ABSOLUTELY NEVER stated such a non-sense. It is curious that the same accuse was moved by prof. Gregory in its article “Coincidence, coevolution, or causation? DNA content, cell size, and the C-value enigma”, that we cite in our paper, to an article by Jain that we also cite in our paper. So, either prof. Gregory has a very poor opinion of our intelligence, or he thinks that we do not read the articles that we cite. Let us state, unambiguously, what we and Jain really say: “IF does exist a mechanism for genome size increase, THEN maybe the resulting long-term advantage can overcome the short-term disadvantage” (Jain was referring to the selfish dna as the genome increasing mechanism while we do not give any preference). Prof. Gregory reverts the implication: “IF there is a long-term advantage THEN the mechanism of genome increase is the product of selection”, and then explains us that it can’t be true. Incidentally, in the case of Jain, I think that what he was really intending can be clearly understood just by the title: “Incidental DNA”.

When someone suggests that one has misinterpreted the claims of an author, the appropriate thing to do is to consult the original article to be sure. So, I looked up the Jain (1980) letter, some quotes from which are given here (with emphasis):

Natural selection is concerned not only with the existing variability but even more so with mechanisms which ensure its continued availability. If there is intragenomic selection leading to rapid build-up of some of the DNA sequences (the selfish DNA of Doolittle and Sapienza and Orgel and Crick) we must treat this part of DNA as incidental to the fundamental process of mutability so vital for ensuring continued supply of raw material for the production of new genes. It does not follow that all of the DNA produced in this manner will, in fact, acquire a function. A large part of it (or even all of it) may not do so and may be eliminated only on an evolutionary time scale. Meanwhile, new DNA of the same and similar kind may continue to be produced so that at a given point of time there will always be large amounts of non-specific DNA. This fraction is best described as ‘incidental’ rather than ‘selfish’ DNA. We may call it incidental because it is a byproduct of the inherent property of mutability of the genome, a characteristic to which natural selection attaches great importance even if it leads to the production of repeated sequences and a wasteful deployment of energy. Viewed in this light, non-functional DNA is very much a product of natural selection — a selection operating for mutability per se. Its relative abundance is probably a function of its nonfunctional nature for any other DNA which carries information of one kind or another would create genetic imbalance and would be quickly rejected.

Nature places considerable premium on playing safe so that it will not run short of raw material even if this means indiscriminate production leading to sequences which are destined to remain functionless.

Now, Dr. Musso may interpret this very differently, but I take it to mean that Jain argued that non-coding DNA was preserved by natural selection specifically because it may become useful as a source of new genes. Moreover, this would have to be non-coding DNA that was preserved in this way because adding coding regions for future use would create complications in genic function. I have discussed in various posts (e.g. here, here) why this notion is untenable.

UPDATE: My interpretation of Jain (1980) was that he was arguing that non-coding DNA is preserved by selection because it contributes to mutability. Further discussion with Jonathan Badger, and another re-read of Jain (1980) in the context of alternative interpretation, has bolstered the conclusion that he was in fact suggesting something different from what I said. The much more reasonable interpretation, and what I now think he was actually arguing, is that the genome is inherently unstable for reasons unrelated to non-coding DNA and that this is maintained by selection (though, it must be said, not in the usual sense but interlineage level) and the accumulation of non-coding DNA is a byproduct of this. I will accept that the authors of the paper that began the discussions saw it this way — though their phrasing “IF does exist a mechanism for genome size increase, THEN maybe the resulting long-term advantage can overcome the short-term disadvantage” is easily confused with arguing that non-coding DNA generates some long-term advantage that overcomes its immediate disadvantage (rather than representing a side-effect of some other process with a long-term advantage). And then, there is still the issue of what the original article stated:

From this point of view, we can think of TMs in our simulations as organisms trying to increase their gene pools adding new genes assembled from junk DNA. If the organisms possess more junk DNA it is possible to test more “potential genes” until a good one is found.

Though I doubt he will read this post, I do apologize to Dr. Jain if indeed I misinterpreted his argument. That said, I do think his phrasing of selection is imprecise and that this probably contributed to the confusion. In my original citation written 8 years ago, I cited Jain as an example of a “noncoding DNA is there because it might be useful” line of thinking, and while he may have been an inappropriate example, this notion is still around and needs to be fixed. In any case, I have not changed my opinion that the article that started this discussion drew undue links between a model and biological genome evolution, and that their results have little bearing on the genome size question.


Update, part two

I hate to keep updating this post (though I have preserved the original form with strikeouts), but I just knew I was not the only person to have interpreted Jain (1980) as suggesting that noncoding DNA was preserved because of its potential long-term benefits. It seems W.F. Doolittle (an originator of the “selfish DNA” idea, and whose paper Jain was commenting on) got the same impression. I will quote at length from Doolittle (1982), in which he discussed the varying reactions to the notion of selfish DNA shortly after it was proposed (italics in original, most in-line references omitted).

(c) The long-term evolutionary advantage of genomic rearrangements. Transposable elements promote genetic rearrangements, and the kinds of rearrangements (transpositions, deletions and inversions) seem similar in both prokaryotes and eukaryotes. This (and the occasional turning on and off of genes adjacent to the site of insertion) appears to be all that many, perhaps most, transposable elements actually do for the organism which bears them and it does not seem to be a good thing. Selection operating on individuals should eliminate such elements. Thus many have claimed that transposable elements are maintained because they play important “evolutionary roles”. This is not a straw man which Carmen Sapienza and I set up in order to have a hypothesis against which to pit the notion of selfish DNA. I can only document this with quotations not, I hope, taken out of context:

“Whether they (insertion sequences) exert functions at these positions or are simply kept in reserve as prefabricated units for the evolution of new control circuits remains unclear.”

“It is possible that the sole function of these elements is to promote genetic variability…”

“A tenable hypothesis regarding the function of transposition is that it allows adaptation of a particular cell to a new environment.”

“All these alterations could lead to changes in structural gene function and in the control of gene expression and could provide organisms with a means of rapid adaptation to environmental change.”

Evolutionary roles have similarly been invoked for heterochromatic highly repetitive DNAs, whose presence does affect recombination in neighbouring and distant regions and whose characteristics may (although the experimental evidence is not strong) affect chromosome pairing.
Neither we [Doolittle and Sapienza] nor Drs Orgel and Crick denied that transposable elements or heterochromatic highly repetitive DNAs have such evolutionary effects, nor that these effects might not be important, perhaps even as the basis for macroevolutionary change. What we were arguing against was the assumption that these elements arose through and are maintained by natural selection because of these effects.
This assumption is often only implicit in the writings of many who suggest that the only roles of mobile dispersed and tandemly reiterated DNAs are evolutionary ones. Thus we have been accused by some of these of misrepresenting their positions and thus indeed of attacking straw men after all. I apologize to those who feel we have put words in their mouths. But I do not see how statements that the only “functions” of transposable elements or highly repetitive DNAs are to generate or modulate genetic variability can mean anything other than that natural selection maintains, and probably even gave rise to, such elements through selection for such “functions”. Shapiro (1980) has been brave enough to articulate this view outright:

“Why, then, are insertion elements not removed from the genome? I think the answer must be that there is a selective advantage in the ability to generate new chromosome primary structure.”

Those who speculate on the function of excess DNA have formulated this position in a more extreme way. For instance, Jain (1980) states

“at a given point of time there will always be large amounts of non-specific DNA. This fraction is best described as ‘incidental’ rather than ‘selfish’ DNA. We may call it incidental because it is a byproduct of the inherent property of mutability of the genome, a characteristic to which natural selection attaches great importance even if it leads to the production of repeated sequences and a wasteful deployment of energy. Viewed in this light, non-functional DNA is very much a product of natural selection — a selection operating for mutability per se.

The question of whether natural selection operates in this way, that is of whether the evolutionary process itself evolves under the direct influence of natural selection, lies at the root of the real controversy over whether self-maintaining, structured, genomic components without phenotypic function can properly be called “selfish”. This may seem like a small and metascientific quibble. In fact it is not; it is one of the most troublesome questions in evolutionary biology today. It manifests itself in debates over the origin and maintenance of mechanisms involved in the optimization of mutation rates, recombination, sexual reproduction, altruistic behaviours of all sorts and even speciation. Such mechanisms are not clearly advantageous to, and can be detrimental to, the fitness of the individual. Yet they may increase the long-term survival properties of the group to which the individual belongs, thus seeming to be the product of what has been called “group selection”.


Doolittle, W.F. (1982). Selfish DNA after fourteen months. In: Genome Evolution (G.A. Dover and R.B. Flavell, eds.), Academic Press, New York, pp.3-28.

Jain, H.K. (1980). Incidental DNA. Nature 288: 647-648.

Signs of function in non-coding RNAs in mouse brain.

Over on his blog, Greg Laden points to some new work by John Mattick’s group on non-coding RNA expression in mouse brains. It’s interesting stuff, and worth a look. Please bear in mind as you do, however, that non-protein-coding but functional RNA is nothing new. Ribosomes are made of non-coding RNA, for one thing. Sadly, Greg seems to have bought into the distortions (several promoted by Mattick) about what people have said about non-coding DNA.

The “Junk DNA” story is largely a myth, as you probably already know. DNA does not have to code for one of the few tens of thousands of proteins or enzymes known for any given animal, for example, to have a function. We know that. But we actually don’t know a lot more than that, or more exactly, there is not a widely accepted dogma for the role of “non-coding DNA.” It does really seem that scientists assumed for too long that there was no function in the DNA.

As I have noted, people have been proposing functions for non-coding DNA since the beginning. As I noted in one of my first Genomicron posts,

Those who complain about a supposed unilateral neglect of potential functions for non-coding DNA simply have been reading the wrong literature. In fact, quite a lengthy list of proposed functions for non-coding DNA could be compiled (for an early version, see Bostock 1971). Examples include buffering against mutations (e.g., Comings 1972; Patrushev and Minkevich 2006) or retroviruses (e.g., Bremmerman 1987) or fluctuations in intracellular solute concentrations (Vinogradov 1998), serving as binding sites for regulatory molecules (Zuckerkandl 1981), facilitating recombination (e.g., Comings 1972; Gall 1981; Comeron 2001), inhibiting recombination (Zuckerkandl and Hennig 1995), influencing gene expression (Britten and Davidson 1969; Georgiev 1969; Nowak 1994; Zuckerkandl and Hennig 1995; Zuckerkandl 1997), increasing evolutionary flexibility (e.g., Britten and Davidson 1969, 1971; Jain 1980; reviewed critically in Doolittle 1982), maintaining chromosome structure and behaviour (e.g., Walker et al. 1969; Yunis and Yasmineh 1971; Bennett 1982; Zuckerkandl and Hennig 1995), coordingating genome function (Shapiro and von Sternberg 2005), and providing multiple copies of genes to be recruited when needed (Roels 1966).

I am not about to claim that the study hasn’t shown evidence of function for these non-coding regions. I think it’s quite interesting, and it wouldn’t surprise me if lots of non-coding RNA turned out to have a regulatory function. But let’s be realistic with this. The authors consider a “long” non-coding RNA transcript to be >200bp. So let’s just round up and say 1,000bp for convenience. They identified around 850 potentially functional sequences (and ~500 that do not show evidence of functional expression, at least in the brain) and estimate that there are 20,000 of them all told. 1,000bp x 20,000 = 20Mb. The mouse genome is about 3Gb. In other words, this study, even read generously, has identified possible function for 0.7% of the mouse genome.

In summary, cool research. Important question, neat result. But let’s not start the usual extrapolationfest that normally accompanies such publications.

Junk DNA and ID redux.

Just a reminder, these are the important points under discussion:

* Proponents of ID themselves clearly suggest that “junk DNA” will mostly or all be functional.

* No unambiguous explanation has been given for why ID must assume that non-coding DNA is functional, especially since they say nothing can be known about the designer or the mechanism.

* The existence of much non-functional DNA would not necessarily refute the idea of design, as many human-designed structures have redundant, non-functional, or even counterproductive characteristics. It would, however, challenge certain assumptions about the designer and the mechanism, which again is why these must be made explicit if the junk DNA argument is to be invoked. Therefore, this is only a useful prediction if one includes details about the mechanism of design.

* The demonstration that all or most non-coding DNA is functional would not support ID to the exclusion of evolution, because a strict interpretation of Darwinian processes has always been taken to propose function as well.

* The demonstration that all or most non-coding DNA in the human genome is functional would still leave the question unanswered as to why the designer put five times more in onion genomes.

* Many functions that have been proposed or demonstrated are dependent on the process of co-option, the same process that is involved in the evolution of complex features.

* Evidence for function in non-coding DNA comes from analyses using evolutionary methods. Other approaches, such as deleting some, have not supported the hypothesis that it is functional.

* The current evidence for function, and other details about how non-coding DNA forms, both suggest that most non-coding DNA is non-functional, or at least that this is the most plausible condition pending much more evidence.

Feel free to comment, but please address these points directly.

Is most of the human genome functional?

I first became interested in genome size because of its tie-ins with important evolutionary questions in which I was (and still am) interested, such as punctuated vs. gradual patterns, levels of selection, and adaptive vs. non-adaptive processes. What I didn’t realize was that one component of the question, the quantity of DNA that is non-functional (but not necessarily inconsequential) with regard to the phenotype of the organism, is such a hot-button issue. I had vague inklings at first that young-earth creationists would object to the idea of non-functional DNA — because God, as they say, don’t make no junk. (Why intelligent design proponents, who purport to take a strictly scientific view of the question, also assume that non-coding DNA cannot be non-functional remains unstated). And of course there has always been a persistent undertone in biology that non-coding DNA must be doing something or it would have been deleted. This latter view, which derives directly from a hardcore adaptationist approach, destroys the argument by creationists that “Darwinism” has prevented researchers from considering functions for non-coding DNA. Indeed, the main motivation for the early papers on “selfish DNA” was to counter this adaptationist assumption (Doolittle and Sapienza 1980).

Creationist nonsense about DNA does not surprise me. What has intrigued me much more is the debate among biologists about this, and the rather questionable claims, suppositions, and extrapolations that get made not just by the media but by various scientists themselves.

Take Francis Collins. He’s a major player in genome biology and led the charge by the public Human Genome Project. And yet, he makes claims that non-coding DNA may be present in the genome “just in case” it needs to be put to use in the future. This makes no sense from an evolutionary perspective. It would be tempting to attribute this to Collins’s adherence to the notion of theistic evolution, but in fact one can find this sort of fuzzy foresight argument being brought up by lots of authors. I suppose it’s just disappointing that there is not better communication between genome biology and evolutionary biology.

The case that frustrates me most is that of John Mattick. He of the worst figure ever is one of the primary promulgators of the view that scientists have overlooked possible function for non-coding DNA and that this is “one of the biggest mistakes in the history of molecular biology” that can only be corrected by a “new paradigm”, and so on. Basically, the argument seems to be that much of the non-coding portion of a given genome is involved in regulation and such. In the past, Mattick has refrained from pinning down an estimate of how much non-coding DNA he believes is functional, but his presentation of (extremely selective) data left little doubt that he considers more non-coding DNA to be correlated with greater complexity. But now we’re starting to get some more explicit and increasingly bold claims.

As Check (2007) pointed out in a news article in Nature,

Mattick thinks scientists are vastly underestimating how much of the genome is functional. He and Birney have placed a bet on the question. Mattick thinks at least 20% of possible functional elements in our genome will eventually be proven useful. Birney thinks fewer are functional.

Now consider this quote by Comings (1972), who was the first person to use the term “junk DNA” extensively (even before Ohno’s (1972) coinage appeared in print):

These considerations suggest that up to 20% of the genome is actively used and the remaining 80+% is junk. But being junk doesn’t mean it is entirely useless. Common sense suggests that anything that is completely useless would be discarded. There are several possible functions for junk DNA.

So, even if Mattick is right about 20% of the human genome being functional, which is considered a rather high estimate on the basis of available data, he still would be merely agreeing with the author of the first major discussion about junk DNA.

Now, I should point out that I do not have a vested interest in how much of the human genome is functional. 5%? Fine. 20%? Fine. 50%? Ok. I will go where the data indicate. My reason for rejecting the notion of “more complexity means more DNA” is comparative: I refer you to the “onion test” for a simple illustration. However, as readers of Genomicron already know, I find it rather irksome when people take any new finding about (potential) function in some part of the human genome and extrapolate this to mean that all DNA in every genome must be serving some role.

Anyway, back to what Mattick suggests. As noted, for the most part he has gone about arguing for large-scale function more by hint than by direct claim. However, finally he says the following (Phaesant and Mattick 2007).

Thus, although admittedly on the basis of as yet limited evidence, it is quite plausible that many, if not the majority, of the expressed transcripts are functional and that a major component of genomic information is rapidly evolving regulatory DNA and RNA. Consequently, it is possible that much if not most of the human genome may be functional. This possibility cannot be ruled out on the available evidence, either from conservation analysis or from genetic studies, but does challenge current conceptions of the extent of functionality of the human genome and the nature of the genetic programming of humans and other complex organisms. [Emphasis added]

It seems to me that “we can’t rule this out” is not a reason to think that something is plausible, let alone true. In fact, the existence of mechanisms such as transposable element spread and the pseudogenization of duplicate genes suggests that there is good reason to expect much (probably most) of the genome to be non-functional unless data show otherwise. Some TEs have taken on a function, some cause disease, some are merely benign or only slightly detrimental. The proportions of non-coding elements in each of these categories remain to be determined, but they are not all equally likely by default.

The question of which sequences are functional, and in what way, is one of the more contentious and therefore interesting ones in genome biology. On the one hand, new information from various sources including the ENCODE project indicates that much non-coding is transcribed, though it remains an open question whether this has to do with function or noise. On the other hand, a recent analysis has suggested that as many as 4,000 sequences within the human genome initially thought to be genes are not really genes after all (Clamp et al. 2007), bringing the total count down to around 20,000.

Some people, mostly creationists and strict adaptationists (strange bedfellows, I agree) desperately want the vast non-coding majority of eukaryote DNA to have a function. They latch onto any new discovery of function in some segment of the genome or another (or indeed, any mere restatement of what many authors have been saying since the 1970s) and consider their position supported. The rest of us will just have to wait and see.



Check, E. (2007). Genome project turns up evolutionary surprises. Nature 447: 760-761.

Clamp, M., B. Fry, M. Kamal, X. Xie, J. Cuff, M.F. Lin, M. Kellis, K. Lindblad-Toh, and E.S. Lander (2007). Distinguishing protein-coding and noncoding genes in the human genome. Proceedings of the National Academy of Sciences USA 104: 19428-19433.

Comings, D.E. 1972. The structure and function of chromatin. Advances in Human Genetics 3: 237-431.

Doolittle, W.F. and C. Sapienza. 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature 284: 601-603.

Ohno, S. 1972. So much “junk” DNA in our genome. In Evolution of Genetic Systems (ed. H.H. Smith), pp. 366-370. Gordon and Breach, New York.

Phaesant, M. and J.S. Mattick (2007). Raising the estimate of functional human sequences. Genome Research 17: 1245-1253.

Genome size, code bloat, and proof-by-analogy.

I recently did an interview with New Scientist for what, I am happy to say, was one of the most reasonable popular reviews of “junk DNA” that has appeared in recent times (Pearson 2007). My small section appeared in a box entitled “Survival of the fattest”, in which most of the discussion related to diversity in genome size and its causes and consequences. It even included mention of “the onion test“, which I proposed as a tonic for anyone who thinks they have discovered “the” functional explanation for the existence of vast amounts of non-coding DNA within eukaryotic genomes. Also thrown in, though not because I said anything about it, was a brief analogy to computer code: “Computer scientists who use a technique called genetic programming to ‘evolve’ software also find their pieces of code grow ever larger — a phenomenon called code bloat or ‘survival of the fattest'”.

I do not follow the literature of computer science, though I am aware that “genetic algorithms” (i.e., program evolution by mutation and selection) is a useful approach to solving complex puzzles. When I read the line about code bloat, my impression was that it probably gave other readers an interesting, though obviously tangential, analogy by which to understand the fact that streamlined efficiency of any coding system, genetic or computational, is not a given when it is the product of a messy process like evolution.

More recently, I have been made aware of an electronic article published in the (non-peer-reviewed) online repository known as arXiv (pr. “archive”; the “X” is really “chi”) that takes this analogy to an entirely different level. Indeed, the authors of the paper (Feverati and Musso 2007) claim to use a computer model to provide insights into how some eukaryotic genomes become so bloated. That is, instead of applying biological observations (i.e., naturally evolving genomes can become large) to a computational phenomenon (i.e., programs evolved in silico can become large, too), the authors flipped the situation around and decided that a computer model could provide substantive information about how genomes evolve in nature.

I will state up front that I am rarely (read: never) convinced by proof-by-analogy studies. Yes, modeling can be helpful if it provides a simplified way to test the influence of individual parameters in complex systems, but only insofar as the conclusions are then compared against reality. When it comes to something like genome size evolution, which applies to millions of species (billions if you consider that every species that has ever lived, about 99% of which are extinct, had a genome) and billions of years, one should be very skeptical of a model that involves only a handful of simplified parameters. This is especially true if no effort is made to test the model in the one way that counts: by asking if it conforms to known facts about the real world.

The abstract of the Feverati and Musso (2007) article says the following:

The development of a large non-coding fraction in eukaryotic DNA and the phenomenon of the code-bloat in the field of evolutionary computations show a striking similarity. This seems to suggest that (in the presence of mechanisms of code growth) the evolution of a complex code can’t be attained without maintaining a large inactive fraction. To test this hypothesis we performed computer simulations of an evolutionary toy model for Turing machines, studying the relations among fitness and coding/non-coding ratio while varying mutation and code growth rates. The results suggest that, in our model, having a large reservoir of non-coding states constitutes a great (long term) evolutionary advantage.

I will not embarrass myself by trying to address the validity of the computer model itself — I am but a layman in this area, and I am happy to assume for the sake of argument that it is the single greatest evolutionary toy model for Turing machines ever developed. It does not follow, however, that the authors are correct in their assertion that they “have developed an abstract model mimicking biological evolution”.

As I understand it, the simulation is based on devising a pre-defined “goal” sequence, similarity to which forms the basis of selecting among randomly varying algorithms. As algorithms undergo evolution by selection, they tend to accumulate more non-coding elements, and the ones that reach the goal most effectively turn out to be those with an “optimal coding/non-coding ratio” which, in this case, was less than 2%. The implication, not surprisingly, is that genomes evolve to become larger because this improves long-term evolvability by providing fodder for the emergence of new genes.

Before discussing this conclusion, it is worth considering the assumptions that were built into the model. The authors note that:

For the sake of simplicity, we imposed various restrictions on our model that can be relinquished to make the model more realistic from a biological point of view. In particular we decided that:

  1. non-coding states accumulate at a constant rate (determined by the state-increase rate pi) without any deletion mechanism [this is actually two distinct claims rolled into one],
  2. there is no selective disadvantage associated with the accumulation of both coding and non-coding states,
  3. the only mutation mechanism is given by point mutation and it also occurs at a constant rate (determined by the mutation rate pm),
  4. there is a unique ecological niche (defined by the target tape),
  5. population is constant,
  6. reproduction is asexual.

As noted, I am fine with considering this a fantastic computer simulation — it just isn’t a simulation that has any resemblance to the biological systems that it purports to mimic. Consider the following:

  • Although some authors have suggested that non-coding DNA accumulates at a constant rate (e.g., Martin and Gordon 1995), this is clearly not generally true. All extant lineages can trace their ancestries back to a single common ancestor, and thus all living lineages (though not necessarily all taxonomic groups) have existed for exactly the same amount of time. And yet the amount of non-coding DNA varies dramatically among lineages, even among closely related ones. Ergo, the rate of accumulation of non-coding DNA differs among lineages. Premise 1 is rejected.
  • The insertion of non-coding elements can be selectively relevant not only in terms of effects on protein-coding genes (many transposable elements are, after all, disease-causing mutagens), but also in terms of bulk effects on cell division, cell size, and associated organism-level traits (Gregory 2005). Premise 2 is rejected.
  • The accumulation of non-coding DNA in eukaryotes does not occur by point mutation, except in the sense that genes that are duplicated may become pseudogenized by this mechanism. Indeed, the model seems only to involve a switch between coding and non-coding elements without the addition of new “nucleotides”, which makes it even more distant from true genomes. Moreover, the primary mechanisms of DNA insertion, including gene duplication and inactivation, transposable element insertion, and replication and recombination errors, do not occur at a constant rate. In fact, the presence of some non-coding DNA can have a feedback effect in which the likelihood of additional change is increased, be it by insertions (e.g., into non-coding regions, such that mutational consequences are minimized) or deletions (e.g., illegitimate recombination among LTR elements) or both (e.g., unequal crossing over or replication slippage enhanced by the presence of repetitive sequences). Premise 3 is rejected.
  • Evolution does not have a pre-defined goal. Evolutionary change occurs along trajectories that are channeled by constraints and history, but not by foresight. As long as a given combination of features allows an organism to fill some niche better than alternatives, it will persist. Not only this, but models like the one being discussed are inherently limited in that they include only one evolutionary process: adaptation. Evolution in the biological world also occurs by non-adaptive processes, and this is perhaps particularly true of the evolution of non-coding DNA. It is on these points that the analogy between evolutionary computation and biological evolution fundamentally breaks down. Premise 4 is rejected in the strongest possible terms.
  • Real populations of organisms are not constant in size, though one could argue that in some cases they are held close to the carrying capacity of an available niche. However, this assumes the existence of only one conceivable niche. Real populations can evolve to exploit different niches. Premise 5 is rejected.
  • With a few exceptions (e.g., DNA transposons), transposable elements are sexually transmitted parasites of the genome, and these elements make up the single largest portion of eukaryotic genomes (roughly half of the human genome, for example). Ignoring this fact makes the model inapplicable to the very question it seeks to address. Premise 6 is rejected.

The main problem with proofs-by-analogy such as this is that they disregard most of the characteristics that make biological questions complex in the first place. Non-coding DNA evolves not as part of a simple, goal-directed, constant-rate process, but one typified by the influence of non-adaptive processes (e.g., gene duplication and pseudogenization), selection at multiple levels (e.g, both intragenomic and organismal), and open-ended trajectories. An “evolutionary” simulation this may be, but a model of biological evolution it is not.

Finally, it is essential to note that “non-coding elements make future evolution possible” explanations, though invoked by an alarming number of genome biologists, contradict basic evolutionary principles. Natural selection cannot favour a feature, especially a potentially costly one such as the presence of large amounts of non-coding DNA, because it may be useful down the line. Selection occurs in the here and now, and is based on reproductive success relative to competing alternatives. Long-term consequences are not part of the equation except in artificial situations where there is a pre-determined finish line to which variants are made to race.

That said, there can be long-term consequences in which inter-lineage sorting plays a role. In terms of processes such as alternative splicing and exon shuffling, which rely on the existence of non-coding introns, an effect on evolvability is plausible and may help to explain why lineages of eukaryotes with introns are so common (Doolittle 1987; Patthy 1999; Carroll 2002). However, this is not necessarily linked to total non-coding DNA amount. For a process of inter-lineage sorting to affect genome size more generally, large amounts of non-coding DNA would have to be insufficiently detrimental in the short term to be removed by organism-level selection, and would have to improve lineage survival and/or enhance speciation rates, such that over time one would observe a world dominated by lineages with huge genomes. In principle, this would be compatible with the conclusions of the model under discussion, at least in broad outline. In practice, however, this is undone by evidence that lineages with exorbitant genomes are restricted to narrower habitats (e.g., Knight et al. 2005), are less speciose (e.g., Olmo 2006), and may be more prone to extinction (e.g., Vinogradov 2003) than those with smaller genomes.

Non-coding DNA does not accumulate “so that” it will result in longer-term evolutionary advantage. And even if this explanation made sense from an evolutionary standpoint, it is not the effect that is observed in any case. No computer simulation changes this.



Carroll, R.L. 2002. Evolution of the capacity to evolve. Journal of Evolutionary Biology 15: 911-921.

Doolittle, W.F. 1987. What introns have to tell us: hierarchy in genome evolution. Cold Spring Harbor Symposia on Quantitative Biology 52: 907-913.

Feverati, G. and F. Musso. 2007. An evolutionary model with Turing machines. arXiv.0711.3580v1.

Gregory, T.R. 2005. Genome size evolution in animals. In: The Evolution of the Genome (edited by T.R. Gregory). Elsevier, San Diego, pp. 3-87.

Knight, C.A., N.A. Molinari, and D.A. Petrov. 2005. The large genome constraint hypothesis: evolution, ecology and phenotype. Annals of Botany 95: 177-190.

Martin, C.C. and R. Gordon. 1995. Differentiation trees, a junk DNA molecular clock, and the evolution of neoteny in salamanders. Journal of Evolutionary Biology 8: 339-354.

Olmo, E. 2006. Genome size and evolutionary diversification in vertebrates. Italian Journal of Zoology 73: 167-171.

Patthy, L. 1999. Genome evolution and the evolution of exon shuffling — a review. Gene 238: 103-114.

Pearson, A, 2007. Junking the genome. New Scienist 14 July: 42-45.

Vinogradov, A.E. 2003. Selfish DNA is maladaptive: evidence from the plant Red List. Trends in Genetics 19: 609-614.


Update: The author’s responses are posted and addressed here.

Help requested: Who said non-coding DNA was all non-functional?

I have a request that I hope some readers can help me with. I am looking for examples from the literature (rather than any “general sense”) of people who claimed that “junk DNA” or “selfish DNA” was totally non-functional. I am particularly interested in peer-reviewed primary articles, but media reports and textbooks are of interest too. Anything from the 1970s to the present would be useful, especially pre-2000 publications. I suspect that the assumption that junk = totally functionless arose sometime in the 1990s, and hardly qualifies as a long-held view. I would also be interested to see references in which people suggest that the term “junk” was meant only to reflect our ignorance about what non-coding DNA is doing in the genome. Post your examples (with a quote and full reference info) in the comments. (Please don’t list Ohno 1972).

Quotes of interest — junk DNA and selfish DNA.

There has been a lot of discussion regarding discoveries in genomics, in terms of both genes (especially their number) and non-coding DNA (in particular whether any of it is functional and how much of it is transcribed). All of this supposedly contradicts long-held assumptions about genomes, especially those attributed to the early proponents of “junk DNA” or “selfish DNA” such as that all non-coding elements must be totally non-functional.

I thought I would share some quotes about this topic that I found interesting.

The observation that up to 25% of the genome of fetal mice is transcribed into rapidly labeled RNA, despite the fact that probably less than half this much of the genome serves a useful function, indicates that much of the junk DNA must be transcribed. It is thus not too surprising that much of this is rapidly broken down within the nucleus. There are several possible reasons why it is transcribed: (1) it may serve some unknown, obscure purpose; (2) it may play a role in gene regulation; or (3) the promoters which allow its transcription may remain sufficiently intact to allow RNA transcription long after the structural genes have become degenerate. [1]

These considerations suggest that up to 20% of the genome is actively used and the remaining 80+% is junk. But being junk doesn’t mean it is entirely useless. Common sense suggests that anything that is completely useless would be discarded. There are several possible functions for junk DNA. [1]

The observations on a number of structural gene loci of man, mice and other organisms revealed that each locus has a 10-5 per generation probability of sustaining a deleterious mutation. It then follows that the moment we acquire 105 gene loci, the overall deleterious mutation rate per generation becomes 1.0 which appears to represent an unbearably heavy genetic load. Taking into consideration the fact that deleterious mutations can be dominant or recessive, the total number of gene loci of man has been estimated to be about 3×104. [2]

The creation of every new gene must have been accompanied by many other redundant copies joining the ranks of silent DNA base sequences, and these silent DNA base sequences may now be serving the useful but negative function of spacing those which have succeeded. [2]

It would be surprising if the host genome did not occasionally find some use for particular selfish DNA sequences, especially if there were many different sequences widely distributed over the chromosomes. One obvious use … would be for control purposes at one level or another. [3]

It seems that what the new publications do, rather than overturning any previous claims, is indicate that some authors don’t read the literature that they cite.


Part of the Quotes of interest series.


1. Comings, D.E. 1972. The structure and function of chromatin. Advances in Human Genetics 3: 237-431.

2. Ohno, S. 1972. So much “junk” DNA in our genome. In Evolution of Genetic Systems (ed. H.H. Smith), pp. 366-370. Gordon and Breach, New York.

3. Orgel, L.E. and F.H.C. Crick. 1980. Selfish DNA: the ultimate parasite. Nature 284: 604-607.
3: 237-431.