Genome size, code bloat, and proof-by-analogy.

I recently did an interview with New Scientist for what, I am happy to say, was one of the most reasonable popular reviews of “junk DNA” that has appeared in recent times (Pearson 2007). My small section appeared in a box entitled “Survival of the fattest”, in which most of the discussion related to diversity in genome size and its causes and consequences. It even included mention of “the onion test“, which I proposed as a tonic for anyone who thinks they have discovered “the” functional explanation for the existence of vast amounts of non-coding DNA within eukaryotic genomes. Also thrown in, though not because I said anything about it, was a brief analogy to computer code: “Computer scientists who use a technique called genetic programming to ‘evolve’ software also find their pieces of code grow ever larger — a phenomenon called code bloat or ‘survival of the fattest'”.

I do not follow the literature of computer science, though I am aware that “genetic algorithms” (i.e., program evolution by mutation and selection) is a useful approach to solving complex puzzles. When I read the line about code bloat, my impression was that it probably gave other readers an interesting, though obviously tangential, analogy by which to understand the fact that streamlined efficiency of any coding system, genetic or computational, is not a given when it is the product of a messy process like evolution.

More recently, I have been made aware of an electronic article published in the (non-peer-reviewed) online repository known as arXiv (pr. “archive”; the “X” is really “chi”) that takes this analogy to an entirely different level. Indeed, the authors of the paper (Feverati and Musso 2007) claim to use a computer model to provide insights into how some eukaryotic genomes become so bloated. That is, instead of applying biological observations (i.e., naturally evolving genomes can become large) to a computational phenomenon (i.e., programs evolved in silico can become large, too), the authors flipped the situation around and decided that a computer model could provide substantive information about how genomes evolve in nature.

I will state up front that I am rarely (read: never) convinced by proof-by-analogy studies. Yes, modeling can be helpful if it provides a simplified way to test the influence of individual parameters in complex systems, but only insofar as the conclusions are then compared against reality. When it comes to something like genome size evolution, which applies to millions of species (billions if you consider that every species that has ever lived, about 99% of which are extinct, had a genome) and billions of years, one should be very skeptical of a model that involves only a handful of simplified parameters. This is especially true if no effort is made to test the model in the one way that counts: by asking if it conforms to known facts about the real world.

The abstract of the Feverati and Musso (2007) article says the following:

The development of a large non-coding fraction in eukaryotic DNA and the phenomenon of the code-bloat in the field of evolutionary computations show a striking similarity. This seems to suggest that (in the presence of mechanisms of code growth) the evolution of a complex code can’t be attained without maintaining a large inactive fraction. To test this hypothesis we performed computer simulations of an evolutionary toy model for Turing machines, studying the relations among fitness and coding/non-coding ratio while varying mutation and code growth rates. The results suggest that, in our model, having a large reservoir of non-coding states constitutes a great (long term) evolutionary advantage.

I will not embarrass myself by trying to address the validity of the computer model itself — I am but a layman in this area, and I am happy to assume for the sake of argument that it is the single greatest evolutionary toy model for Turing machines ever developed. It does not follow, however, that the authors are correct in their assertion that they “have developed an abstract model mimicking biological evolution”.

As I understand it, the simulation is based on devising a pre-defined “goal” sequence, similarity to which forms the basis of selecting among randomly varying algorithms. As algorithms undergo evolution by selection, they tend to accumulate more non-coding elements, and the ones that reach the goal most effectively turn out to be those with an “optimal coding/non-coding ratio” which, in this case, was less than 2%. The implication, not surprisingly, is that genomes evolve to become larger because this improves long-term evolvability by providing fodder for the emergence of new genes.

Before discussing this conclusion, it is worth considering the assumptions that were built into the model. The authors note that:

For the sake of simplicity, we imposed various restrictions on our model that can be relinquished to make the model more realistic from a biological point of view. In particular we decided that:

  1. non-coding states accumulate at a constant rate (determined by the state-increase rate pi) without any deletion mechanism [this is actually two distinct claims rolled into one],
  2. there is no selective disadvantage associated with the accumulation of both coding and non-coding states,
  3. the only mutation mechanism is given by point mutation and it also occurs at a constant rate (determined by the mutation rate pm),
  4. there is a unique ecological niche (defined by the target tape),
  5. population is constant,
  6. reproduction is asexual.

As noted, I am fine with considering this a fantastic computer simulation — it just isn’t a simulation that has any resemblance to the biological systems that it purports to mimic. Consider the following:

  • Although some authors have suggested that non-coding DNA accumulates at a constant rate (e.g., Martin and Gordon 1995), this is clearly not generally true. All extant lineages can trace their ancestries back to a single common ancestor, and thus all living lineages (though not necessarily all taxonomic groups) have existed for exactly the same amount of time. And yet the amount of non-coding DNA varies dramatically among lineages, even among closely related ones. Ergo, the rate of accumulation of non-coding DNA differs among lineages. Premise 1 is rejected.
  • The insertion of non-coding elements can be selectively relevant not only in terms of effects on protein-coding genes (many transposable elements are, after all, disease-causing mutagens), but also in terms of bulk effects on cell division, cell size, and associated organism-level traits (Gregory 2005). Premise 2 is rejected.
  • The accumulation of non-coding DNA in eukaryotes does not occur by point mutation, except in the sense that genes that are duplicated may become pseudogenized by this mechanism. Indeed, the model seems only to involve a switch between coding and non-coding elements without the addition of new “nucleotides”, which makes it even more distant from true genomes. Moreover, the primary mechanisms of DNA insertion, including gene duplication and inactivation, transposable element insertion, and replication and recombination errors, do not occur at a constant rate. In fact, the presence of some non-coding DNA can have a feedback effect in which the likelihood of additional change is increased, be it by insertions (e.g., into non-coding regions, such that mutational consequences are minimized) or deletions (e.g., illegitimate recombination among LTR elements) or both (e.g., unequal crossing over or replication slippage enhanced by the presence of repetitive sequences). Premise 3 is rejected.
  • Evolution does not have a pre-defined goal. Evolutionary change occurs along trajectories that are channeled by constraints and history, but not by foresight. As long as a given combination of features allows an organism to fill some niche better than alternatives, it will persist. Not only this, but models like the one being discussed are inherently limited in that they include only one evolutionary process: adaptation. Evolution in the biological world also occurs by non-adaptive processes, and this is perhaps particularly true of the evolution of non-coding DNA. It is on these points that the analogy between evolutionary computation and biological evolution fundamentally breaks down. Premise 4 is rejected in the strongest possible terms.
  • Real populations of organisms are not constant in size, though one could argue that in some cases they are held close to the carrying capacity of an available niche. However, this assumes the existence of only one conceivable niche. Real populations can evolve to exploit different niches. Premise 5 is rejected.
  • With a few exceptions (e.g., DNA transposons), transposable elements are sexually transmitted parasites of the genome, and these elements make up the single largest portion of eukaryotic genomes (roughly half of the human genome, for example). Ignoring this fact makes the model inapplicable to the very question it seeks to address. Premise 6 is rejected.

The main problem with proofs-by-analogy such as this is that they disregard most of the characteristics that make biological questions complex in the first place. Non-coding DNA evolves not as part of a simple, goal-directed, constant-rate process, but one typified by the influence of non-adaptive processes (e.g., gene duplication and pseudogenization), selection at multiple levels (e.g, both intragenomic and organismal), and open-ended trajectories. An “evolutionary” simulation this may be, but a model of biological evolution it is not.

Finally, it is essential to note that “non-coding elements make future evolution possible” explanations, though invoked by an alarming number of genome biologists, contradict basic evolutionary principles. Natural selection cannot favour a feature, especially a potentially costly one such as the presence of large amounts of non-coding DNA, because it may be useful down the line. Selection occurs in the here and now, and is based on reproductive success relative to competing alternatives. Long-term consequences are not part of the equation except in artificial situations where there is a pre-determined finish line to which variants are made to race.

That said, there can be long-term consequences in which inter-lineage sorting plays a role. In terms of processes such as alternative splicing and exon shuffling, which rely on the existence of non-coding introns, an effect on evolvability is plausible and may help to explain why lineages of eukaryotes with introns are so common (Doolittle 1987; Patthy 1999; Carroll 2002). However, this is not necessarily linked to total non-coding DNA amount. For a process of inter-lineage sorting to affect genome size more generally, large amounts of non-coding DNA would have to be insufficiently detrimental in the short term to be removed by organism-level selection, and would have to improve lineage survival and/or enhance speciation rates, such that over time one would observe a world dominated by lineages with huge genomes. In principle, this would be compatible with the conclusions of the model under discussion, at least in broad outline. In practice, however, this is undone by evidence that lineages with exorbitant genomes are restricted to narrower habitats (e.g., Knight et al. 2005), are less speciose (e.g., Olmo 2006), and may be more prone to extinction (e.g., Vinogradov 2003) than those with smaller genomes.

Non-coding DNA does not accumulate “so that” it will result in longer-term evolutionary advantage. And even if this explanation made sense from an evolutionary standpoint, it is not the effect that is observed in any case. No computer simulation changes this.



Carroll, R.L. 2002. Evolution of the capacity to evolve. Journal of Evolutionary Biology 15: 911-921.

Doolittle, W.F. 1987. What introns have to tell us: hierarchy in genome evolution. Cold Spring Harbor Symposia on Quantitative Biology 52: 907-913.

Feverati, G. and F. Musso. 2007. An evolutionary model with Turing machines. arXiv.0711.3580v1.

Gregory, T.R. 2005. Genome size evolution in animals. In: The Evolution of the Genome (edited by T.R. Gregory). Elsevier, San Diego, pp. 3-87.

Knight, C.A., N.A. Molinari, and D.A. Petrov. 2005. The large genome constraint hypothesis: evolution, ecology and phenotype. Annals of Botany 95: 177-190.

Martin, C.C. and R. Gordon. 1995. Differentiation trees, a junk DNA molecular clock, and the evolution of neoteny in salamanders. Journal of Evolutionary Biology 8: 339-354.

Olmo, E. 2006. Genome size and evolutionary diversification in vertebrates. Italian Journal of Zoology 73: 167-171.

Patthy, L. 1999. Genome evolution and the evolution of exon shuffling — a review. Gene 238: 103-114.

Pearson, A, 2007. Junking the genome. New Scienist 14 July: 42-45.

Vinogradov, A.E. 2003. Selfish DNA is maladaptive: evidence from the plant Red List. Trends in Genetics 19: 609-614.


Update: The author’s responses are posted and addressed here.

8 thoughts on “Genome size, code bloat, and proof-by-analogy.

  1. Although some authors have suggested that non-coding DNA accumulates at a constant rate (e.g., Martin and Gordon 1995), this is clearly not generally true.

    When you say “accumulates”, are you referring to mutation or fixation? I’ll agree that the fixation rate of non-coding DNA differs along lineages, but how different is the mutation rate? Granted, the mutation rate is dependent on the amount of selfish elements in the genome (giving a sort-of snowball effect), but I’m not sure differences in mutation rates are as striking as differences in fixation rates. From what I can gather, the authors are referring to mutation rates, not fixation rates of non-coding DNA.

    But, yeah, their model does seem kind of bunk.

  2. Obviously I mean fixation rate, as we’re talking about observable differences in DNA amount between species.

    They set their entire paper up as a discussion about the evolution of genome size, which is an issue of actual differences, not simple mutation rate.

    If they meant something else, then they need a new introduction and conclusion.

  3. Let’s start with complete agreement with Ryan:

    “I am happy to assume for the sake of argument that it is the single greatest evolutionary toy model for Turing machines ever developed”

    For the sake of argument, let me agree. Ryan continues:

    “It does not follow, however, that the authors are correct in their assertion that they “have developed an abstract model mimicking biological evolution””.

    Here, I disagree with Ryan that the mathematical model is a “non sequitur”.

    Per definitionem, “Turing machines” are virtually indistinguishable from the modeled system.

    Therefore, if we agree (for the sake of argument) that “it is the single greatest evolutionary …model for Turing machines ever developed” than it follows that an algorithmic mathematical model, as a Turing machine, is virtually indistinguishable from the modeled (evolutionary) system.

    It looks like that the question is if mathematical models are “toys” or something more precious.

    Comments, anyone?

  4. Andras, Thank you for the comment – it is helpful to know that this is how you see the evolutionary process.

  5. This was both interesting and singularly confusing, the possible different meanings of toy models in different areas aside. (For example physicists are very relaxed about toy models. Sometimes they are only vaguely reminiscent of the target problem, permitting superficial check of a method for instance.) Perhaps biologists immediately recognizes and accepts what in my eyes the paper skirts, but I have some troubles with it.

    First, it uses Turing machines as they are more “convenient”, later to be identified with two possible biological models. But suggesting that the tapes (single tape Turing), in both cases identified with phenotypes, affects the genotypes in a read operation suggests AFAIU that Lamarckian mechanisms are considered instead of Darwinian. I can’t find any such discussion in the paper, and it is only in the post where some regulatory or mutational mechanisms with possible feedback are mentioned. Fine, but can they be clearly identified with the papers read operation?

    Second, there is a difference between using a Turing machine and showing that it is universal and obey the possibly true Church-Turing thesis. The paper may perhaps have no reason to discuss it as they don’t rely on the later (and they more or less concede that it isn’t a necessity), but I do think Andras is relying on it here. “an algorithmic mathematical model, as a Turing machine, is virtually indistinguishable from the modeled (evolutionary) system” suggests some sort of specified property, which isn’t assured what I can see.

    Sure, I can imagine that DNA may be soaped up to be Turing complete in some biochemical process, and perhaps some have shown it to be so. But have anyone shown that plain vanilla evolutionary mechanisms are TC? It seems to me evolution is capable at what it does without needing to have such universal properties.

    it follows that an algorithmic mathematical model, as a Turing machine, is virtually indistinguishable from the modeled (evolutionary) system.

    Even if we were discussing Turing complete systems, that wouldn’t follow on the here interesting metrics, code size and bloat. IIRC someone have shown that Conway’s Game of Life is TC, but that doesn’t mean that I would want to program a Windows system in it. Neither code size/execution speed nor ease of code bloating are coupled to TC machines capabilities as algorithmic solvers.

  6. hi Torbjörn,

    I defer to your understanding of the computer science side — my point was just that no matter how superb the model is in that sense (and maybe it isn’t, I don’t know), it has little resemblance to evolution.

  7. Leroy Hood, who defines his main interests as “immunity, evolution, genomics” went on record that this biggest surprise has been that these subjects have turned into “information science” (not something you can do without mathematical algorithms, models, computer science methodologies). Does anyone think that Dr. Hood might be wrong and evolution a priori is an exception and could for some reason not be modeled mathematically? If not, what would make evolution a singularity in natural sciences?

    For the “not-so-mathematically-minded” let me note that “random number generators” (or “random mutation generators” entail a mathematical model, albeit not a terribly sophisticated kind).

    Am I the only one thinking in terms of mathematics about basic tenets of natural sciences, including biology? (I don’t think so – Schrodinger, Szilard, von Neumann, Turing, Wiener already did it several generations ago and biology, especially genomics, is increasingly mathematical these days)

  8. As an author of the article discussed in this blog I would like to reply to Prof. Gregory
    criticisms. First of all I think that after writing such an harsh comment on an article, it would
    be a matter of good taste to inform the authors just to give them the opportunity to reply (that
    does not cost a great effort since email addresses are in the paper). I stumbled in this review
    by chance and only recently, and so my answer comes a bit late.

    Even if Prof. Gregory introduces our article saying that: “the authors ….decided that a computer
    model could provide substantive information about how genomes evolve in nature”, actually we never
    said that. We have a brief subsection in the conclusions (less than half a page long) where we
    comment on the biological relevance of our results. Such subsection begins with the following
    words: “In this section we put forward some biological speculations inspired by our model”.
    It seems to me that “biological speculations” is quite different from “substantive information”;
    moreover we speak only of possible advantages in terms of “evolvability”, and that’s also very
    different from saying “how genomes evolve in nature”.

    Prof. Gregory next discusses the validity of our assumptions. First of all I would like to notice
    that since we wrote:”For the sake of simplicity, we imposed various restrictions on our model
    that can be relinquished to make the model more realistic from a biological point of view”,
    it means that we are fully aware that our assumptions are NOT realistic. So I can’t understand
    what’s the point in putting such emphasis in explaining the reasons why they are not. A much
    briefer comment would have been: “as the authors candidly admit, their assumptions are unrealistic”.
    I would like to stress that a “model” is a simplified version of reality, while a “toy model” is
    oversimplified to the point that the model is just a caricature of the reality. Still toy models
    are precious instruments in the investigation of complex systems, and can give some hints and
    help comprehension on the modelized phenomenon. First example that comes to my mind is the
    “HPP lattice gas model” for hydrodynamics. Imposing the level of detail requested by prof. Gregory
    would result not in a toy model and neither in a model but in an accurate description of
    reality (admitting that by now we have a perfect understanding of all biological phenomena).
    Moreover with such level of detail it would have been impossible to reach our aim (measuring the
    optimal coding/non-coding ratio in our model), partly for the computational time required and
    partly for the impossibility to interpret unambiguously the results obtained. I would like to
    stress that since, in our model, adding a new state has a NEUTRAL impact on the fitness,
    the process of state-increasing is, by definition, NON-adaptative. I agree with prof. Gregory
    that it would have been better to use “mimic Darwinian evolution” instead of
    “mimic biological evolution”, but I have also a provocative question:
    was Darwin’s theory to be rejected as a theory of biological evolution since he did not specify
    the exact mechanisms of mutation?

    In its conclusion prof. Gregory suggests that we claim that “Non-coding DNA does
    accumulate “so that” it will result in longer-term evolutionary advantage”.
    We ABSOLUTELY NEVER stated such a non-sense. It is curious that the same accuse was moved by
    prof. Gregory in its article “Coincidence, coevolution, or causation? DNA content, cell size, and
    the C-value enigma”, that we cite in our paper, to an article by Jain that we also cite in our
    paper. So, either prof. Gregory has a very poor opinion of our intelligence, or he thinks that
    we do not read the articles that we cite. Let us state, unambiguously, what we and Jain really
    say: “IF does exist a mechanism for genome size increase, THEN maybe the resulting long-term
    advantage can overcome the short-term disadvantage” (Jain was referring to the
    selfish dna as the genome increasing mechanism while we do not give any preference).
    Prof. Gregory reverts the implication: “IF there is a long-term advantage THEN the mechanism of
    genome increase is the product of selection”, and then explains us that it can’t be true.
    Incidentally, in the case of Jain, I think that what he was really intending can be clearly
    understood just by the title: “Incidental DNA”.

    Finally, let us state, very very briefly what in our paper we really did. We built up an abstract
    evolutionary model with mechanisms of mutation and genome increase, in such a way that we could
    exactly measure what is, in our model, the coding/non-coding ratio, and we found that it can’t
    be more than 2%. We were thinking that such result could be interesting also for biologists,
    maybe we were wrong.

Comments are closed.