Kudos on the placozoan genome!

Trichoplax adhaerens is a bizarre little animal with a decidedly simple morphology. (You can see some here). There has been some question as to the relationship between this critter and other animal groups, but mitochondrial sequences (Dellaporta et al. 2006) and, as of this week, a complete nuclear genome sequence (Srivastava et al. 2008), suggest that it is a modern representative of the earliest branch to split from the rest of the animal lineages (for more detail, check out John Timmer’s discussion). The term “basal” is usually applied to lineages like this, often with the assumption that basal means primitive. Sometimes genome sequencing articles exhibit misunderstanding of what “early branching” actually means, but I must give kudos to Srivastava et al. (2008) for their refreshingly apt conclusions:

Trichoplax‘s apparent genomic primitiveness, however, is separate from the question of whether placozoan morphology or life history is a relict of the eumetazoan ancestor. For example, the flat form and gutless feeding could be a ‘primitive’ ancestral feature, with the cnidarian–bilaterian gut arising secondarily by the invention of a developmental process for producing an internal body cavity (as in Bütschli’s ‘plakula’ theory), or it could be a ‘derived’, uniquely placozoan feature that resulted from the loss of an ancestral eumetazoan gut. Unfortunately, the genome sequence alone cannot answer these questions, but it does provide a platform for further studies.

Non-functional DNA: non-functional vs. inconsequential.

Each copy of the human genome consists of about 3,200,000,000 base pairs, and includes about 500,000 repeats of the LINE-1 transposable element (a LINE) and twice as many copies of Alu (a SINE), as compared to around 20,000 protein-coding genes. Whereas protein-coding regions represent about 1.5% of the genome, about half is made up LINE-1, Alu, and other transposable element sequences. These begin as parasites, and some continue to behave as detrimental mutagens implicated in disease. However, most of those in the human genome are no longer mobile, and it is possible that many of these persist as commensal freeloaders. Finally, it has long been expected that a significant subset of non-coding elements would be co-opted by the host and take on functional roles at the organism level, and there is increasing evidence to support this.

A notable fraction of the non-genic portion of human DNA is undoubtedly involved in regulation, chromosomal function, and other important processes, but based on what we know about non-coding DNA sequences, it remains a reasonable default assumption — though one that should continue to be tested empirically — that much or perhaps most of it is not functional at the organism level. This does not mean that a search for the functional segments is futile or irrelevant — far from it, as many non-genic regions are critical for normal genomic operation and some have played an important role in many evolutionary transitions. It simply means that one must not extrapolate without warrant from discoveries involving a small fraction of sequences to the genome as a whole.

More generally, it has been known for more than 50 years that the total quantity of DNA in the genome is linked to nucleus size, cell size, cell division rate, and a wide range of organism-level characteristics that derive from these cytological features. Thus, large amounts of DNA tend to be found in large, slowly dividing cells, which in turn typically make up the bodies of organisms with low metabolisms, slow development, or other such traits. On this basis alone, one would expect to see consequences for the organism if a large quantity of non-coding DNA were eliminated from or added to the genome, even if most of the particular elements in question were neutral or detrimental under normal circumstances. Non-functional is not equivalent to inconsequential. This is especially true when there are factors operating at different levels, for example when an abundant and diverse collective of entities includes components that are variously neutral, beneficial, and detrimental to a host.

Though they cannot prove an argument, analogies are often useful for understanding an issue. In this capacity, consider the following:

  • There are roughly 1013 to 1014 individual microorganisms living in your digestive tract (Gill et al. 2006), which is on par with, or perhaps even 10x larger than, the number of cells making up your own body. It is also two or three orders of magnitude larger than the number of humans who have ever lived, and of the number of stars in the Milky Way galaxy.
  • The assemblage of microorganisms in your intestines comprises some 500 species, most of which have never been cultured in the lab or studied in detail (Gilmore and Ferretti 2003). To put this diversity in perspective, there are only about 5,000 species of mammals on Earth today.
  • The combined “metagenome” of the microorganisms in your gut contains at least 100 times as many genes as your own genome (Gill et al. 2006).

We do not know the specific characteristics of many of the microorganisms in the gut. However, we do know that at least some of them are essential, or at least highly beneficial, for human health. Several of the species found in the gut are important mutualists, assisting with digestion and in return drawing nutrients from the food that we consume. In this sense, it is hard not to agree with Gill et al. (2006), who argue that “humans are superorganisms whose metabolism represents an amalgamation of microbial and human attributes”.

The question is, are all 10,000,000,000,000+ microbial cells that we carry with us functional for our well-being? Some certainly are. But many, maybe even most, are probably commensal freeloaders who neither harm nor benefit us, though of course their total abundance is limited to what can be carried by the host without deleterious consequences. By contrast, some gut bacteria are implicated in gastrointestinal disorders. A few are actively parasitic, but their numbers may be kept in check by our own immune system or through competition with non-pathogenic species, or because they kill the host or are killed by antibiotics. Some, such as the well known Escherichia coli, can be harmless or deadly depending on the presence of particular genes. Thus, the total number of microorganisms, and the relative diversity of species that this encompasses, is influenced by a complex interaction of factors internal to the gut (e.g., who invades, which microorganisms are already present, how efficiently they reproduce) and higher-level conditions (e.g., human immune response, dietary effects on which nutrients are present, positive or negative effects on the host).

What we know about bacteria and other microorganisms makes for a reasonable default assumption that much or even most of what is found in the gut is not there because it provides a direct benefit to humans. On the flipside, we have good reason to expect that some, perhaps even a large fraction, of these organisms are beneficial. Therefore, we require evidence to show that any particular species is functional from the human point of view, and that its abundance is determined on this basis. The search for such evidence is important, but it occurs against a backdrop of realizing that bacteria could be there for their own benefit only, whether or not that has any adverse effects on our well-being as hosts. Establishing that a specific strain of bacteria in the digestive tract is beneficial does not justify the conclusion that all bacteria in the gut are mutualistic. It does not even imply that all individuals of the helpful strain are essential, because the optimal abundance for the host and the pressures for reproduction of the microorganisms may not converge on the same quantity.

If one were to remove the microorganisms from the gut, or to significantly alter their species composition or abundance, one would expect to see consequences for host health. This would be true even if most of the particular organisms in question were neutral or detrimental in normal circumstances. As with non-genic elements in the genome, this means that even if many organisms in the gut are non-functional from the host’s perspective, their presence is not inconsequential for the biology of an animal carrying them.

Non-functional DNA: quantity.

In my previous post, I noted that because of what we understand about the nature, origins, and cross-taxon quantitative diversity of the various sorts of non-genic DNA in large eukaryote genomes, the default assumption is that much or even most of it is not functional at the cell and organism levels. Thus, the burden of proof rests with authors who claim that a large fraction, or indeed most or all, of this DNA is functional for the organisms in which it occurs.

This should not be construed as claiming that all non-genic DNA is assumed to be non-functional. I have pointed out in various preceding posts that even those who postulated non-adaptive explanations for its existence did not rule out — and indeed, explicitly favoured — the notion that a significant portion would turn out to serve a function. You need not take my word for this, as it is not difficult to find unambiguous statements from the original authors themselves.

For example, here are Orgel and Crick (1980) who, along with Doolittle and Sapienza (1980), first proposed the concept of “selfish DNA” in detail:

It would be surprising if the host genome did not occasionally find some use for particular selfish DNA sequences, especially if there were many different sequences widely distributed over the chromosomes. One obvious use … would be for control purposes at one level or another.

Here, too, is Comings (1972), the first person to use the term “junk DNA” in print and the first to provide a substantive discussion of the topic. (The term was coined by Ohno in 1972, but Comings’s paper appeared in print first, citing Ohno as ‘in press’, and Ohno used the term only in the title).

These considerations suggest that up to 20% of the genome is actively used and the remaining 80+% is junk. But being junk doesn’t mean it is entirely useless. Common sense suggests that anything that is completely useless would be discarded. There are several possible functions for junk DNA.

The use of the terms “selfish DNA” or “junk DNA” has changed over time, and both are now often applied to all non-genic DNA, rather than to the sequences to which they originally referred (i.e., transposable elements and pseudogenes, respectively). Moreover, it seems that many authors — at least those whose studies focus primarily on protein-coding genes and DNA sequencing — believe that the assumption has been that all non-genic DNA is “junk” in the sense of totally non-functional. However, amidst any such assumptions there has always been a diversity of views on the subject, ranging from assuming that most non-genic DNA is non-functional (as in the quotes above) to expecting it all to be functional — the latter being a position held by strict adaptationists, and a large part of the motivation for proposing the alternative view of selfish DNA the first place.

As with many issues in evolution, this is a matter of relative quantity, not an exclusive dichotomy. We may reasonably expect a significant fraction of non-genic DNA to show evidence of function, and the pursuit of such evidence is a valid and important endeavour. It does not follow, however, that the pendulum must be perceived to swing from entirely functional to entirely non-functional and back again. We will undoubtedly refine our estimates of the amount of non-genic DNA that is mutualistic at the organism level, how much is commensal, and how much is best characterized as parasitic in nature.

As it stands, the evidence suggests that about 5% of the human genome is functional at the organism level. The total may be higher — as noted, Comings suggested 20% is actively utilized. It is conceivable that 50% or more of the genome is functional, perhaps in structural roles or some other higher-order capacity. It would require evidence to support this contention, however, and the question would remain as to why an onion requires 5x more of this structural or otherwise essential DNA, and why some of its close relatives can get by with half as much while others have twice the onion amount. There is nothing remarkable about onions in this sense, by the way — animal genome sizes alone cover a more than 7,000-fold range, and even among vertebrates there is a 350-fold difference. The range among single-celled protozoa is at least 30,000-fold, though even higher estimates have been presented.

The take home message is simply this. What we know about eukaryote genomes suggests that there are many mechanisms that can add non-coding DNA that do not require it to be functional. This does not in any way preclude the possibility of, or invalidate the search for, function in some, many, or possibly even most of those non-coding components. How much proves to be functional is an empirical question, and at present the indication seems to be that most non-genic DNA is non-functional. That said, non-functional is not the same as inconsequential.


Comings, D.E. 1972. The structure and function of chromatin. Advances in Human Genetics 3: 237-431.

Doolittle, W.F. and C. Sapienza. 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature 284: 601-603.

Ohno, S. 1972. So much “junk” DNA in our genome. In Evolution of Genetic Systems (ed. H.H. Smith), pp. 366-370. Gordon and Breach, New York.

Orgel, L.E. and F.H.C. Crick. 1980. Selfish DNA: the ultimate parasite. Nature 284: 604-607.

Dinosaurs made from pseudogenes?

Matt Ridley, author of such books as The Red Queen, Genome, and The Origins of Virtue (and not to be confused with biologist Mark Ridley), asks the question “Will we clone a dinosaur?” in Time Magazine. His answer, at least in terms of the Jurassic Park sense of cloning a dinosaur from ancient DNA, is either “no” or “definitely not”.

Yet, Ridley argues for a different possible revival of dinosaur-like animals, ones built through genetic engineering. He notes three things that he considers encouraging in this regard. The first is that dinosaurs aren’t really extinct, or at least that they did leave a diverse line of descendants — namely birds. Second, important regulatory genes, such as the Hox genes that play a major role in directing development, are generally quite conserved across animal lineages. No doubt, the third will be of particular interest to readers of this blog and indeed Ridley singles it out:

Third, and most exciting, geneticists are finding many “pseudogenes” in human and animal DNA–copies of old, discarded genes. It’s a bit like finding the manual for a typewriter bound into the back of the manual for your latest word-processing software. There may be a lot of interesting obsolete instructions hidden in our genes.

Put these three premises together, and the implication is clear: the dino genes are still out there.

I remember an episode of Star Trek – The Next Generation in which the introns of the crew members’ genomes were “reactivated”, and this caused them to de-evolve through various stages in their species’ ancestries. Of course, introns include various types of DNA sequence, most of which are probably not something that could be activated in any sense. The writers probably meant to focus on pseudogenes, as Ridley did.

Pseudogenes are duplicates of protein-coding genes that either maintain the intron/exon structure of the original gene (classical pseudogenes) or lack introns because they were inserted retroactively from an RNA transcript (processed pseudogenes) — either way, they are defined by two characteristics: 1) their obvious similarity to and derivation from protein-coding genes, and 2) the fact that they no longer function in coding for a protein.

Pseudogenes can form at any time in the ancestry of a lineage, may be derived from a wide variety of genes, and may degrade by mutation or be partially deleted without consequence due to a relaxation of selection given that they no longer fulfill sequence-specific functions. Taken together, this means that it can be difficult to identify something as a pseudogene, let alone what the original sequence encoded and in which ancestor the duplication occurred. In other words, pseudogenes are not like an easily legible manual of a particular obsolete technology. They are a jumble of distorted and half-erased text from a manual that is continually being modified haphazardly.


Hat tip: Evolving Thoughts

Genome size, code bloat, and proof-by-analogy — a response.

Some of you may remember the post from Dec. 1, 2007, on Genome size, code bloat, and proof-by-analogy (which was posted on DNA and Diversity also). This post referred to a computer study published in the online, non-peer-reviewed arXiv database by Feverati and Musso (2007). Recently, Dr. Musso has been kind enough to provide some responses to my post, though of course very few people will notice because they are located within the comments section of a post that is more than two months old. So, I reprint them here in full, with my responses interspersed throughout.

As an author of the article discussed in this blog I would like to reply to Prof. Gregory criticisms. First of all I think that after writing such an harsh comment on an article, it would be a matter of good taste to inform the authors just to give them the opportunity to reply (that does not cost a great effort since email addresses are in the paper). I stumbled in this review by chance and only recently, and so my answer comes a bit late.

Fair enough (and my apologies if this caused frustration), though it was not my intent to enter into a discussion about the paper, only to post my thoughts and move on. In particular, I had been asked by a reporter for my thoughts about this paper — in the context of understanding genome size — and instead of sending an email I decided to post them.

Even if Prof. Gregory introduces our article saying that: “the authors ….decided that a computer model could provide substantive information about how genomes evolve in nature”, actually we never said that. We have a brief subsection in the conclusions (less than half a page long) where we comment on the biological relevance of our results. Such subsection begins with the following words: “In this section we put forward some biological speculations inspired by our model”. It seems to me that “biological speculations” is quite different from “substantive information”; moreover we speak only of possible advantages in terms of “evolvability”, and that’s also very
different from saying “how genomes evolve in nature”.

Allow me to insert the abstract of the paper:

The development of a large non-coding fraction in eukaryotic DNA and the phenomenon of the code-bloat in the field of evolutionary computations show a striking similarity. This seems to suggest that (in the presence of mechanisms of code growth) the evolution of a complex code can’t be attained without maintaining a large inactive fraction. To test this hypothesis we performed computer simulations of an evolutionary toy model for Turing machines, studying the relations among fitness and coding/non-coding ratio while varying mutation and code growth rates. The results suggest that, in our model, having a large reservoir of non-coding states constitutes a great (long term) evolutionary advantage.

Furthermore, the first two paragraphs of the paper, and the last two (about 1/4 or more of the entire discussion and conclusion), are about genome size, and I believe that one could be forgiven for interpreting this as indicating that the authors saw a strong connection between their study and genome size evolution.

Prof. Gregory next discusses the validity of our assumptions. First of all I would like to notice that since we wrote:”For the sake of simplicity, we imposed various restrictions on our model that can be relinquished to make the model more realistic from a biological point of view”, it means that we are fully aware that our assumptions are NOT realistic. So I can’t understand what’s the point in putting such emphasis in explaining the reasons why they are not. A much briefer comment would have been: “as the authors candidly admit, their assumptions are unrealistic”.

I am glad we are in agreement that the assumptions are unrealistic. The reason I emphasized this so strongly is that this is a blog about genomes and evolution that is meant to provide information to readers with a diversity of educational backgrounds. Dr. Musso and I may know that these assumptions are very unrealistic, but many readers would not. More than a critique of this paper, I was providing details about how evolution actually operates in nature. Incidentally, these criticisms regarding the unrealistic assumptions are the same ones I would have made had I been reviewing this article for a peer-reviewed journal — at least, if any connection was attempted between this model and genome size in eukaryotes.

I would like to stress that a “model” is a simplified version of reality, while a “toy model” is oversimplified to the point that the model is just a caricature of the reality. Still toy models are precious instruments in the investigation of complex systems, and can give some hints and help comprehension on the modelized phenomenon. First example that comes to my mind is the “HPP lattice gas model” for hydrodynamics. Imposing the level of detail requested by prof. Gregory would result not in a toy model and neither in a model but in an accurate description of reality (admitting that by now we have a perfect understanding of all biological phenomena). Moreover with such level of detail it would have been impossible to reach our aim (measuring the optimal coding/non-coding ratio in our model), partly for the computational time required and partly for the impossibility to interpret unambiguously the results obtained.

I think we are in agreement on this, though my conclusion is that if a model has too be too simple to reflect reality then it is not useful, whereas Dr. Musso seems to be saying that because only simplified models can be used, they are justified. The notion that biological evolution is similar to hydrodynamics, and indeed this view of models generally, is the reason for my original post. I noted in the original post that their model may have been the greatest of its sort ever developed, but that it has no bearing on biological evolution — if we agree that it is unrealistic, then why begin and end a paper with a multi-paragraph discussion of a biological phenomenon?

I would like to stress that since, in our model, adding a new state has a NEUTRAL impact on the fitness, the process of state-increasing is, by definition, NON-adaptative. I agree with prof. Gregory that it would have been better to use “mimic Darwinian evolution” instead of “mimic biological evolution”, but I have also a provocative question: was Darwin’s theory to be rejected as a theory of biological evolution since he did not specify the exact mechanisms of mutation?

As a matter of fact, Darwin’s theory of natural selection (but not the fact of evolution) was not widely accepted in his own time in part because he lacked a basis for inheritance, and it was largely rejected in the early 1900s, in part because new knowledge about heredity (namely the rediscovery of Mendelian inheritance) seemed to contradict its assumptions. In any case, I don’t really see what the relevance of this well known history is to the discussion of these models.

In its conclusion prof. Gregory suggests that we claim that “Non-coding DNA does accumulate “so that” it will result in longer-term evolutionary advantage”. We ABSOLUTELY NEVER stated such a non-sense.

If I may quote once more from the article:

In this section we put forward some biological speculations inspired by our model. There are two way [sic] of identifying TMs [Turing machines] with biological entities and they suggest two ways up to which the accumulation of non-coding free to mutate DNA can play a role for “evolvability”. In the first one we identify TMs with organisms and coding-states with genes. We have to stress that the mechanism of transcription is different in the two contexts. For TMs transcription is serial, so that states must be transcribed, one at a time, in prescribed order, while in biological organisms transcription of genes can happen in parallel. We can interpret TMs states as genes accomplishing both a structural and regulatory function, since a coding state both affects the output tape and specifies which state has to be successively transcribed. From this point of view, we can think of TMs in our simulations as organisms trying to increase their gene pools adding new genes assembled from junk DNA. If the organisms possess more junk DNA it is possible to test more “potential genes” until a good one is found.

I may have misinterpreted what the authors meant by this, but it seems to imply that junk DNA serves as a reservoir of potential genes and that this increases evolvability. The implication drawn by many authors, including some biologists like Collins, is that this is why junk is there (“It is not the sort of clutter that you get rid of without consequences because you might need it. Evolution may need it,” [Collins] said.”). Either way, this served as a useful launching pad to reiterate the important point that this makes no sense evolutionarily if framed as a cause of junk DNA rather than as a potential consequence.

It is curious that the same accuse was moved by prof. Gregory in its article “Coincidence, coevolution, or causation? DNA content, cell size, and the C-value enigma”, that we cite in our paper, to an article by Jain that we also cite in our paper. So, either prof. Gregory has a very poor opinion of our intelligence, or he thinks that we do not read the articles that we cite.

I reject the dichotomy presented there. Some other possibilities, inter alia, are that the authors did not interpret the papers the same way as I did, or they read mine by disagreed with my argument, or they partially misunderstand how evolution occurs. Given that even some biologists who work on real-life genomes make this mistake, I hardly think this implies a lack of intelligence, only a lack of background.

Let us state, unambiguously, what we and Jain really say: “IF does exist a mechanism for genome size increase, THEN maybe the resulting long-term advantage can overcome the short-term disadvantage” (Jain was referring to the selfish dna as the genome increasing mechanism while we do not give any preference). Prof. Gregory reverts the implication: “IF there is a long-term advantage THEN the mechanism of genome increase is the product of selection”, and then explains us that it can’t be true. Incidentally, in the case of Jain, I think that what he was really intending can be clearly understood just by the title: “Incidental DNA”.

“Long-term advantage” and “short-term disadvantage” imply selection, and there does not seem to be much difference between the two ways of stating this. Moreover, as I noted in my original post and in more detail in an earlier post, long-term inter-lineage selection can potentially overcome short-term disadvantage, but this is not why non-coding DNA exists in the first place. If Dr. Musso and others understand it that way, then so much the better. But many people do not, and so taking an opportunity to clarify the issue once more was worthwhile.

Finally, let us state, very very briefly what in our paper we really did. We built up an abstract evolutionary model with mechanisms of mutation and genome increase, in such a way that we could exactly measure what is, in our model, the coding/non-coding ratio, and we found that it can’t be more than 2%. We were thinking that such result could be interesting also for biologists, maybe we were wrong.

Once again, this strongly indicates that Dr. Musso sees his “evolutionary model with mechanisms of mutation and genome increase” as a way of studying real biological genome size evolution, which was the entire reason for the post in the first place.

Biologists may indeed have an interest — I suggest that the paper be submitted to a peer-reviewed biological journal.

Junk DNA and ID redux.

Just a reminder, these are the important points under discussion:

* Proponents of ID themselves clearly suggest that “junk DNA” will mostly or all be functional.

* No unambiguous explanation has been given for why ID must assume that non-coding DNA is functional, especially since they say nothing can be known about the designer or the mechanism.

* The existence of much non-functional DNA would not necessarily refute the idea of design, as many human-designed structures have redundant, non-functional, or even counterproductive characteristics. It would, however, challenge certain assumptions about the designer and the mechanism, which again is why these must be made explicit if the junk DNA argument is to be invoked. Therefore, this is only a useful prediction if one includes details about the mechanism of design.

* The demonstration that all or most non-coding DNA is functional would not support ID to the exclusion of evolution, because a strict interpretation of Darwinian processes has always been taken to propose function as well.

* The demonstration that all or most non-coding DNA in the human genome is functional would still leave the question unanswered as to why the designer put five times more in onion genomes.

* Many functions that have been proposed or demonstrated are dependent on the process of co-option, the same process that is involved in the evolution of complex features.

* Evidence for function in non-coding DNA comes from analyses using evolutionary methods. Other approaches, such as deleting some, have not supported the hypothesis that it is functional.

* The current evidence for function, and other details about how non-coding DNA forms, both suggest that most non-coding DNA is non-functional, or at least that this is the most plausible condition pending much more evidence.

Feel free to comment, but please address these points directly.

Bacterial genomes and evolution.

The seminar that I give most often when I am invited to speak at other universities begins with a brief introduction to genomes, sets up some comparisons between bacteria and eukaryotes, and then moves into a short overview of bacterial genome size evolution before spending the remainder of the time on genome size diversity and its importance among animals.

The main things that I have to say about bacterial genomes are:

1) Unlike in eukaryotes, bacterial genome size shows a strong positive relationship with gene number (in other words, bacterial genomes contain little non-coding DNA).

Genome size and gene number in bacteria and archaea.
From Gregory and DeSalle (2005).

2) Bacterial genome sizes do not vary anywhere near as much as those of animals do (on the order of 20-fold versus 7,000-fold).

The diversity of archaeal, bacterial, and eukaryotic genome
sizes as currently known from more than 10,000 species.
From Gregory (2005).

3) The major pattern in bacteria is that, on average, free-living species have larger genomes than parasitic species which in turn have larger genomes than obligate endosymbionts (Mira et al. 2001; Gregory and DeSalle 2005; Ochman and Davalos 2006).

Genome sizes among bacteria with differing lifestyles.
Because genome size is primarily determined by the
number of genes in bacteria, the question to be addressed
is why symbionts have fewer genes in their genomes.
From Gregory and DeSalle (2005).

In order to explain these patterns, it was sometimes argued that some bacteria have small genomes because there is selection for rapid cell division, with larger DNA contents taking longer to replicate and thereby slowing down the cell cycle. However, when Mira et al. (2001) compared doubling time and genome size in bacteria that could be cultured in the lab, they found no significant relationship between them. In other words, selection for small genome size is probably not responsible for the highly compact genomes of some bacteria, even though it seems plausible that, more generally, selection does prevent the accumulation of non-coding DNA to eukaryote levels in bacterial cells.

Mira et al. (2001) suggested a different interpretation that is based on two other major processes in evolution — mutation and genetic drift. In terms of mutation, they pointed out that on the level of individual changes that add or subtract relatively small quantities of DNA — i.e., insertions or deletions, or “indels” — deletions tend to be somewhat larger than insertions. The insertions in this case are separate from the addition of whole genes, which happens often in bacteria through sharing of genes among individuals or even across species (“horizontal gene transfer” or “lateral gene transfer“) or gene duplication.

In bacteria (and eukaryotes) small-scale deletions tend
to involve more base pairs than insertions, creating a
“deletion bias”. Of course, larger insertions such as of
transposable elements or gene duplicates are not part
of this calculation as they add much more DNA at once.
From Mira et al. (2001).

So, on the one hand, there are processes that can add genes (duplication and lateral gene transfer), whereas in the absence of these processes, and if there are no adverse consequences to losing DNA (i.e., there is no selective constraint occurring), genomes should tend to get smaller as a result of this deletion bias. In free-living bacteria, there are many opportunities for gene exchange, with lateral gene transfer adding DNA at an appreciable frequency. Moreover, free-living bacteria tend to occur in astronomical numbers, and elementary population genetics reveals that selection will be strong under such conditions (so that even a mildly deleterious mutation, such as a deletion or disruptive insertion, will probably be lost from the population over time). Finally, free-living bacteria must produce their own protein products, and therefore tend to make use of all their genes, which places selective constraints on changes (including indels) in those sequences.

Endosymbiotic bacteria, especially those that live within the cells of eukaryote hosts, are different in multiple relevant respects. First, they do not regularly encounter other bacteria from whom they can receive genes. Second, they occur in drastically smaller numbers — indeed, they experience a population bottleneck severe enough to shift the balance from selection to drift. Third, they come to rely on some metabolites provided by the host and no longer make use of all their own genes. These factors in combination mean that the selective constraints on many endosymbiont genes are relaxed, and the dominant processes become deletion bias and random drift. Over many generations, endosymbiotic bacteria lose the genes they are not using (and some that are only mildly constrained by selection, such is the strength of drift under such conditions) due to deletion bias, and the end result is highly compact genomes.

The compaction of genomes in endosymbionts can be extreme. The smallest genome known in any cellular organism (except, perhaps, one in Craig Venter‘s lab) is found in the bacterial genus Carsonella, a symbiont that lives within the cells of psyllid insects. It contains only 159,662 base pairs of DNA and 182 genes, some of which overlap (Nakabachi et al. 2006).

Carsonella (dark blue) living within the cells and
around the nucleus (light blue) of a psyllid insect.
From Nakabachi et al. (2006).

In some other bacteria, genes that are not used (including non-functional duplicates) may not be lost for some time and may persist as pseudogenes, just as are observed in large numbers in eukaryote genomes. These tend to undergo additional mutations and to degrade over time but can still be recognized as copies of existing genes. In Mycobacterium leprae, the pathogen that causes leprosy, for example, there are more than 1,100 pseudogenes alongside roughly 1,600 functional genes (Cole et al. 2001). Its genome is about 1 million base pairs smaller than that of its relative M. tuberculosis, but clearly many of the inactive genes have not (yet) been deleted.

The two major influences on bacterial genomes: insertion of
genes by duplication and lateral gene transfer, and the loss
of non-functional sequences by deletion.
From Mira et al. (2001).

It would be nice if this post could end there, having delivered a brief overview of an interesting issue in comparative genomics. Sadly, there is more to say because some anti-evolutionists apparently have begun using the topic in a confused attempt to challenge evolutionary science. In particular, though I note that I have become aware of this only second hand, some creationists apparently have suggested that all bacterial genomes are degrading and therefore that bacteria today are simpler than they were in the past, such that complex structures like flagella could not have evolved from less complicated antecedents.

It should be obvious that not all genomes are necessarily “degrading” just because there is a net deletion bias. For starters, selective constraints prevent essential genes from being lost by this mechanism in most bacteria. Furthermore, there exist well established mechanisms that can add new genes to bacterial genomes, including lateral gene transfer and gene duplication. In fact, the rate of gene duplication seems to be related to genome size in bacteria (Gevers et al. 2004). Also, as Nancy Moran noted in an email, “The most primitive bacteria were certainly simple, but they are not around or at least are not easily identified. Many modern bacteria have large genomes and are very complex.” Finally, the compact genomes of endosymbionts, such as in the aphid symbiont Buchnera aphidicola, tend to be more stable than the genomes of free-living bacteria in terms of larger-scale perturbations such as chromosomal rearrangements (Silva et al. 2003).

Some bacteria, in particular those that have shifted to a
parasitic or endosymbiotic dependence on a eukaryote host,
have undergone genome reductions (green, red) as compared
to inferred ancestral conditions. Nevertheless, many other
species continue to display large genomes (blue).
However, the very earliest bacteria probably began
with small genomes and simple cellular features.
From Ochman (2006).

As with eukaryotes, the genomes of bacteria provide exceptional confirmation of the fact of common descent. Not only do comparative gene sequence analyses shed light on the relatedness of different bacterial lineages and the evolution of features like flagella, but the presence — and loss to varying degrees — of non-functional DNA highlights a strong historical signal.

Given that it is her work that is being misused by anti-evolutionists, it is fitting that Dr. Moran be given the last word:

“It seems to me that the widespread occurrence of degrading genes, which are present in most genomes including those of animals, plants, and bacteria, argues pretty strongly in favor of evolution. They are the molecular equivalent of vestigial organs.”

Quite right.



Cole, S.T., K. Eiglmeier, J. Parkhill, K.D. James, N.R. Thomson, P.R. Wheeler, and et al. 2001. Massive gene decay in the leprosy bacillus. Nature 409: 1007-1011.

Gevers, D., K. Vandepoele, C. Simillion, and Y. Van de Peer. 2004. Gene duplication and biased functional retention of paralogs in bacterial genomes. Trends in Microbiology 12: 148-154.

Gregory, T.R. 2005. Synergy between sequence and size in large-scale genomics. Nature Reviews Genetics 6: 699-708.

Gregory, T.R. and R. DeSalle. 2005. Comparative genomics in prokaryotes. In The Evolution of the Genome, ed. T.R. Gregory. Elsevier, San Diego, pp. 585-675.

Mira, A., H. Ochman, and N.A. Moran. 2001. Deletional bias and the evolution of bacterial genomes. Trends in Genetics 17: 589-596.

Nakabachi, A., A. Yamashita, H. Toh, H. Ishikawa, H.E. Dunbar, N.A. Moran, and M. Hattori. 2006. The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science 314: 267.

Ochman, H. 2006. Genomes on the shrink. Proceedings of the National Academy of Sciences of the USA 102: 11959-11960.

Ochman, H. and L.M. Davalos. 2006. The nature and dynamics of bacterial genomes. Science 311: 1730-1733.

Silva, F.J., A. Latorre, and A. Moya. 2003. Why are the genomes of endosymbiotic bacteria so stable? Trends in Genetics 19: 176-180.

Endogenous retroviruses and human transcriptional networks.

The human genome, like that of most eukaryotes, is dominated by non-coding DNA sequences. In humans, protein-coding exons constitute only about 1.5% of the total DNA sequence. The rest is made up of non-coding elements of various types, including pseudogenes (both classical and processed), introns, simple sequence repeats (microsatellites), and especially transposable elements — sequences capable of autonomous or semi-autonomous movement around, and in most cases duplication within, the genome. Endogenous retroviruses (ERVs), which are very similar to or indeed are classified as long terminal repeat (LTR) retrotransposons, represent one type of transposable element within Class I (elements that use an RNA intermediate during transposition; Class II elements transpose directly from DNA to DNA by cut-and-paste mechanisms). Roughly 8% of the human genome is represented by ERVs, which are descendants of former exogenous retroviruses that became incorporated into the germline genome.

It seems that no discussion about non-coding DNA is complete without stating that until recently it was all dismissed as useless junk. This claim is demonstrably false, but that does not render it uncommon. Some scientists did indeed characterize non-coding DNA as mostly useless, but finding references to this effect that do not also make explicit allowances for potential functions in some non-coding regions is challenging. Even authors such as Ohno and Comings, who first used the term “junk DNA”, noted that this did not imply a total lack of function. In fact, for much of the early period following the discovery of non-coding DNA, there was plentiful speculation about what this non-coding DNA must be doing — and it must be doing something, many authors argued, or else it would have been eliminated by natural selection. (Hence the fallacy involved in claiming that “Darwinism” prevented people from considering functions for non-coding regions within the genome).

Some authors rejected this automatic assumption of function, and argued instead that mechanisms of non-coding DNA accumulation — such as the accretion of pseudogenes following duplication (“junk DNA” sensu stricto) or insertions of transposable elements (“selfish DNA”) — could account for the presence of so much non-coding material without appeals to organism-level functions. However, the originators of such ideas often were careful to note that this did not preclude some portions of non-coding DNA from taking on functions, especially in gene regulation [Function, non-function, some function: a brief history of junk DNA].

There are lots of examples of particular transposable elements, which probably began as parasitic sequences, becoming co-opted into integral roles within the host genome. This process has played an important role in several major transitions during the macroevolutionary history of lineages such as our own. There is a large and growing literature on this topic, but reviewing this is beyond the scope of this post (see chapter 11 in The Evolution of the Genome for some examples). The present post will focus on only one recent case that was published this month in the Proceedings of the National Academy of Sciences of the USA by Ting Wang, David Haussler, and colleagues which focuses on the role of ERVs in the evolution of a key human gene regulatory system.

Here is the abstract from their paper (which is open access and is available here):

Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53

The evolutionary forces that establish and hone target gene networks of transcription factors are largely unknown. Transposition of retroelements may play a role, but its global importance, beyond a few well described examples for isolated genes, is not clear. We report that LTR class I endogenous retrovirus (ERV) retroelements impact considerably the transcriptional network of human tumor suppressor protein p53. A total of 1,509 of {approx}319,000 human ERV LTR regions have a near-perfect p53 DNA binding site. The LTR10 and MER61 families are particularly enriched for copies with a p53 site. These ERV families are primate-specific and transposed actively near the time when the New World and Old World monkey lineages split. Other mammalian species lack these p53 response elements. Analysis of published genomewide ChIP data for p53 indicates that more than one-third of identified p53 binding sites are accounted for by ERV copies with a p53 site. ChIP and expression studies for individual genes indicate that human ERV p53 sites are likely part of the p53 transcriptional program and direct regulation of p53 target genes. These results demonstrate how retroelements can significantly shape the regulatory network of a transcription factor in a species-specific manner.

The TP53 gene is a “master control gene” — a sequence whose product (“protein 53”, or “p53“) is a transcription factor that binds to DNA and regulates the expression of other genes, including ones involved in DNA repair, cell cycle regulation, and programmed cell death (apoptosis). It is so important that it has been dubbed “the guardian of the genome”. Mutations in this gene can be highly detrimental: the “T” in TP53 stands for tumor, and mutations in this gene are often associated with cancers. This includes many smoking-related cancers.

The authors of this study report that particular ERVs contain sites to which the p53 protein binds. As a result of past retrotransposition, these ERVs tend to be distributed in various locations in the genome. This makes it possible for the p53 protein to bind not just at one site, but at sites dispersed in different regions, and therefore in proximity to a variety of other genes. It is this distributed network of binding sites that allows p53 to regulate so many other genes in its role as genome guardian. And this is only possible because an ERV with a site to which the p53 protein is capable of binding inserted into the genome of an early primate ancestor some 40 million years ago, made copies of itself throughout the genome, and then became useful as a source of binding sites. This is classic co-option (exaptation) at the genomic level, and represents the very same kind of explanation that Darwin himself offered for the evolution of complex structures at the organismal scale.

While this is a truly interesting discovery that sheds even more light on the complex history of the genome, it also highlights some important points that I have tried to make on this blog. First, this applies to only a fraction of non-coding DNA. Only about 8% of the genome is made up of ERVs, and, of these, only 1,509 of 319,000 copies (0.5%) include the relevant binding site. About 90% of the ERVs are represented only by “solo LTRs”, the long repeats at the end that remain after the rest of the element was deleted. Moreover, several ERVs have been implicated in autoimmune diseases. Thus, not only is only a small fraction likely to be involved in gene regulatory networks such as that of TP53, others are clearly maladaptive from the perspective of the host genome.

The evolution of the genome is a complex process involving multiple types of elements and interactions at several levels of organization. While very few authors ever claimed that all non-coding DNA was totally without function, it is certainly the case that non-coding sequences are worthy of the new-found attention that they have received from the genomics community. Let us hope that this will include more integration with evolutionary biology than has been evident in the past, as it clearly requires an appreciation of both complexity and history.


ps: The press release from UC Santa Cruz by Karen Schmidt is quite good (notwithstanding the mandatory “it was dismissed as junk” line).

Function, non-function, some function: a brief history of junk DNA.

It is commonly suggested by anti-evolutionists that recent discoveries of function in non-coding DNA support intelligent design and refute “Darwinism”. This misrepresents both the history and the science of this issue. I would like to provide some clarification of both aspects.

When people began estimating genome sizes (amounts of DNA per genome) in the late 1940s and early 1950s, they noticed that this is largely a constant trait within organisms and species. In other words, if you look at nuclei in different tissues within an organism or in different organisms from the same species, the amount of DNA per chromosome set is constant. (There are some interesting exceptions to this, but they were not really known at the time). This observed constancy in DNA amount was taken as evidence that DNA, rather than proteins, is the substance of inheritance.

These early researchers also noted that some “less complex” organisms (e.g., salamanders) possess far more DNA in their nuclei than “more complex” ones (e.g., mammals). This rendered the issue quite complex, because on the one hand DNA was thought to be constant because it’s what genes are made of, and yet the amount of DNA (“C-value”, for “constant”) did not correspond to assumptions about how many genes an organism should have. This (apparently) self-contradictory set of findings became known as the “C-value paradox” in 1971.

This “paradox” was solved with the discovery of non-coding DNA. Because most DNA in eukaryotes does not encode a protein, there is no longer a reason to expect C-value and gene number to be related. Not surprisingly, there was speculation about what role the “extra” DNA might be playing.

In 1972, Susumu Ohno coined the term “junk DNA“. The idea did not come from throwing his hands up and saying “we don’t know what it does so let’s just assume it is useless and call it junk”. He developed the idea based on knowledge about a mechanism by which non-coding DNA accumulates: the duplication and inactivation of genes. “Junk DNA,” as formulated by Ohno, referred to what we now call pseudogenes, which are non-functional from a protein-coding standpoint by definition. Nevertheless, a long list of possible functions for non-coding DNA continued to be proposed in the scientific literature.

In 1979, Gould and Lewontin published their classic “spandrels” paper (Proc. R. Soc. Lond. B 205: 581-598) in which they railed against the apparent tendency of biologists to attribute function to every feature of organisms. In the same vein, Doolittle and Sapienza published a paper in 1980 entitled “Selfish genes, the phenotype paradigm and genome evolution” (Nature 284: 601-603). In it, they argued that there was far too much emphasis on function at the organism level in explanations for the presence of so much non-coding DNA. Instead, they argued, self-replicating sequences (transposable elements) may be there simply because they are good at being there, independent of effects (let alone functions) at the organism level. Many biologists took their point seriously and began thinking about selection at two levels, within the genome and on organismal phenotypes. Meanwhile, functions for non-coding DNA continued to be postulated by other authors.

As the tools of molecular genetics grew increasingly powerful, there was a shift toward close examinations of protein-coding genes in some circles, and something of a divide emerged between researchers interested in particular sequences and others focusing on genome size and other large-scale features. This became apparent when technological advances allowed thoughts of sequencing the entire human genome: a question asked in all seriousness was whether the project should bother with the “junk”.

Of course, there is now a much greater link between genome sequencing and genome size research. For one, you need to know how much DNA is there just to get funding. More importantly, sequence analysis is shedding light on the types of non-coding DNA responsible for the differences in genome size, and non-coding DNA is proving to be at least as interesting as the genic portions.

To summarize,

  • Since the first discussions about DNA amount there have been scientists who argued that most non-coding DNA is functional, others who focused on mechanisms that could lead to more DNA in the absence of function, and yet others who took a position somewhere in the middle. This is still the situation now.
  • Lots of mechanisms are known that can increase the amount of DNA in a genome: gene duplication and pseudogenization, duplicative transposition, replication slippage, unequal crossing-over, aneuploidy, and polyploidy. By themselves, these could lead to increases in DNA content independent of benefits for the organism, or even despite small detrimental impacts, which is why non-function is a reasonable null hypothesis.
  • Evidence currently available suggests that about 5% of the human genome is functional. The least conservative guesses put the possible total at about 20%. The human genome is mid-sized for an animal, which means that most likely a smaller percentage than this is functional in other genomes. None of the discoveries suggest that all (or even more than a minor percentage) of non-coding DNA is functional, and the corollary is that there is indirect evidence that most of it is not.
  • Identification of function is done by evolutionary biologists and genome researchers using an explicit evolutionary framework. One of the best indications of function that we have for non-coding DNA is to find parts of it conserved among species. This suggests that changes to the sequence have been selected against over long stretches of time because those regions play a significant role. Obviously you can not talk about evolutionarily conserved DNA without evolutionary change.
  • Examples of transposable elements acquiring function represent co-option. This is the same phenomenon that is involved in the evolution of complex features like eyes and flagella. In particular, co-option of TEs appears to have happened in the evolution of the vertebrate immune system. Again, this makes no sense in the absence of an evolutionary scenario.
  • Most transposable elements do not appear to be functional at the organism level. In humans, most are inactive molecular fossils. Some are active, however, and can cause all manner of diseases through their insertions. To repeat: some transposons are functional, some are clearly deleterious, and most probably remain more or less neutral.
  • Any suggestions that all non-coding DNA is functional must explain why an onion needs five times more of it than you do. So far, none of the proposed unilateral functions has done this. It therefore remains most reasonable to take a pluralistic approach in which only some non-coding elements are functional for organisms.

I realize that this will have no effect on the arguments made by anti-evolutionists, but I hope it at least clarifies the issue for readers who are interested in the actual science involved and its historical development.

More about ENCODE from Scientific American.

It is probably just coincidence, but two articles for which I gave interviews appeared online today. The first, which I discussed in an earlier post, was online in Wired, One Scientist’s Junk Is a Creationist’s Treasure by Catherine Shaffer. The second appeared in the online edition of Scientific American, The 1 Percent Genome Solution by JR Minkel. Both deal with non-coding DNA, though from rather different perspectives. The first is about creationists invoking the discovery (by evolutionary biologists and other scientists) of (indirect indication of) function in (small sections of) non-coding DNA. The second is about the search for those functions through detailed, rigorous scientific analysis.

I know that science writers have a tough job. And I know that we scientists grumble about a lot of what they generate. But this time I want to do something a little different. I want to give readers some idea of what science writers are faced with when they interview a scientist. This is possible because the interview for Scientific American was conducted by email rather than by phone (which I actually prefer). Have a look at the article, and then see how the interview actually proceeded, and think about the challenge of summarizing my answers, which were admittedly somewhat long-winded (some might say carefully worded so as to avoid confusion and to not overlook important points). Note also the kinds of questions that a writer has to develop.

Here are the pertinent sections from the article:

The consortium found that 5 percent of the studied sequence has been conserved among 23 mammals, suggesting that it plays an important enough role for evolution to preserve while species have evolved. But of all the new ENCODE sequences identified as potentially important, only half fall into the conserved group.

These unconserved sequences may be “bystanders, Birney says”—consequences of the genome’s other functions—that neither help nor hurt cells and may have provided fodder for past evolution.

They could also simply maintain a useful DNA structure or spacing between pieces of DNA regardless of their particular sequence, says genomics researcher T. Ryan Gregory of the University of Guelph in Ontario, who was not part of the consortium.

“The biological insights are mainly incremental at this point,” says genome biologist George Weinstock of the Baylor College of Medicine in Houston, which he says is to be expected of such a pilot study. “This is a ‘community resource’ project, like a genome project, that makes lots of new data available to the community, who then dig into it and mine it for discoveries.”

Gregory says the results, although still cryptic, do hint at new functions and a more complicated genome. “This study shows us how far we are from a comprehensive understanding of the human genome.”

And here are my answers to Minkel’s questions reproduced in full:

How much of what the consortium found is new?

– What is new about this study is the fine focus being applied to the search for functional elements. By way of analogy, this study is like a group of 35 treasure hunters with metal detectors and sifters combing the same 35m of a 3.5km long beach. (In fact, the 35m are broken up into 44 discrete stretches of beach, half of them chosen because they are known to contain lots of interesting objects and the other half selected to include areas with varying properties. The plan is eventually to comb the entire beach this way, but this first pass should be taken more as a proof-of-principle than a conclusive assessment).

– Some of the conclusions reinforce ideas that have already been in the literature for several years, for example that the majority of the human genome is transcribed (see, e.g., Wong et al. 2000; Wong et al. 2001). The identification of non-protein-coding transcripts, particularly in areas where this was not thought to occur, is novel. But, again, this particular study is based on only 1% of the genome and one should exercise caution in extrapolating it to the entire human genome.

– Other ideas, such that chromatin structure is important in regulation, are also not entirely new, but these data provide interesting new evidence for them.

How much of what was identified is likely to be functional?

– 5% of the genome sequence is conserved across mammals, and for about 60% of this (i.e., 3% of the genome) there is additional evidence of function. This includes the protein-coding exons as well as regulatory elements and other functional sequences. So, at this stage, we have increasingly convincing evidence of function for about 3% of the genome, with another 2% likely to fall into this category as it becomes more thoroughly characterized.

– The authors report the presence of sequences that are not conserved but show experimental (in the genomics sense) evidence of function. There need not be constraint on base pair sequences if merely the presence of non-coding DNA would fill the role independent of what that DNA is. For example, if it is simply a matter of physical spacing or structural arrangement, then it may not matter what the actual sequence of bases were. On the other hand, the authors argue that these elements “may serve as a ‘warehouse’ for natural selection, potentially acting as the source of lineage-specific elements and functionally conserved but non-orthologous elements between species”. Of course, this would be an effect, not a function, because natural selection does not have foresight and cannot maintain elements because they may someday be useful. Also, they suggest that these regions are “neutral”, meaning that they are “biochemically active” but “do not confer a selective advantage or disadvantage to the organism”. If they have no fitness effects then they cannot have a function in the usual sense of the term; however, it could be that their absence would be detrimental, in which case there would be convincing evidence of function of some sort.

– A large fraction of the sequences analyzed, both in introns and intergenic regions, appears to be transcribed. However, most of this DNA is not conserved and there is no clear indication of function. It could be that the transcripts themselves play a functional role or that the process of transcription but not the transcripts per se contributes an important effect. It could be that the regions they examined, which were typically gene-dense, included transcribed introns (no surprise) plus longer-than-expected regulatory regions such as promoters near but outside of genes (e.g., Cooper et al. 2007), but that on the whole the long stretches of non-coding DNA in between genes are not actually transcribed. Or, it could be that transcription in the human genome simply is very inefficient. For example, the data in this study suggest that 19% of pseudogenes in their sample are transcribed, even though by definition they cannot encode a protein and are unlikely to play a regulatory role. It also appears that in other groups, e.g., plants (Wong et al. 2000), there is lots of intergenic DNA that is not transcribed, which may indicate that this is a process unique to mammals and is not typical of eukaryotic genomes.

– Looking at a broader scale, we must bear in mind that about half the human genome consists of transposable elements. Some of these clearly do have functions (e.g., in gene regulation), but others persist as disease-causing mutagens. It could be that a large portion of these have taken on functions, but this remains to be shown. We are also left with the question of why a pufferfish would require only 10% as much non-coding DNA as a human whereas an average salamander needs 10 times more than we do. The well known patterns of genome size diversity make it difficult to explain the presence of all non-coding DNA in functional terms, even as there is growing evidence that a significant portion of non-coding DNA is indeed functionally important.

What does this tell us about the genome’s organization and evolution?

– This work follows the growing trend in which simplistic assumptions about genome form and function are being overturned. Previous examples include the assumption that each gene encodes one protein product and the associated expectation that there would be a relatively large number of genes in our genome. This study deals a blow to the notion that the human genome is organized and regulated in a simple way, and further suggests that our definition of “gene” may need to be expanded.

– This study shows us how far we are from a comprehensive understanding of the human genome, but it also provides some of the tools that will be needed to achieve this goal.

– The authors begin their paper with a conclusion (p. 799): “The human genome is an elegant but cryptic store of information.” Elegant, in the scientific sense, means “concise, simple, succinct”. This does not strike me as an accurate descriptor for such a complex, redundant evolutionary patchwork.

– This study reinforces the notion that the genome is a legitimate level of biological organization with its own complex evolutionary history.

You advise caution in extrapolating the results. Do you think it more likely that the study over- or under-represents the amount of complexity or underappreciated function in the genome? Why, or what other biases would you expect?

The concern is that it may over-estimate the level of function in the genome, given that they specifically selected regions rich in well-characterized genes for at least half the dataset. Of course, the objective of the study is to identify functional elements, so an aggressive approach to the question is warranted in that context. However, they probably considered few sequences that were not associated with genes in some way, such as long stretches of short repeats or transposable elements. The study does suggest that regulation is more complex than we thought, shows some evidence of function for some noncoding DNA, and indicates that lots of noncoding DNA is transcribed, but beyond that it hasn’t really clarified these issues — nor should it be expected to, as this was a pilot project only.

You say the study “deals a blow to the notion that the human genome is organized and regulated in a simple way, and further suggests that our definition of “gene” may need to be expanded.” Do most biologists believe the genome is simply organized and regulated? What’s the dominant view?

I would say that, for obvious pragmatic reasons, people assume that a system is simple until it is shown to be otherwise. At first, it was surprising that genome size is decoupled from organismal complexity. Then it was surprising that genes are split into coding exons and noncoding introns. Then it was surprising that half the human genome is transposable elements. Then it was surprising that there are only 25,000 genes. Now it is surprising that a significant portion of the noncoding DNA is transcribed and that gene regulation is not a simple on-off system but involves interactions – perhaps even networks – of coding and noncoding segments. I wouldn’t want to speak for “most biologists”, but I think overall we are coming to appreciate that less has been figured out about genome function than we first thought. And that is what makes the future of genomic science exciting.

And what further evidence would tell us whether we should redefine “gene”? (E.g., would we need to find disease mutations associated with these chimeric transcripts?)

It depends on what you want the term “gene” to represent. In its original definition, it did not specify “protein-coding exons” (because these were not discovered until decades later), and instead referred to a generalized notion of a genetic “determiner” (according to Johansen 1909 “The word gene is completely free from any hypothesis; it expresses only the evident fact that, in any case, many characteristics of the organism are specified in the germ cells by means of special conditions, foundations, and determiners which are present in unique, separate, and thereby independent ways”). After the rise of molecular genetics in the ‘50s, the focus shifted to individual protein-coding sequences (hence, “one gene, one protein”), though this was expanded to include the intron-exon arrangement after it was described by Gilbert in 1978. Now we see that “units of genetic specification”, or what we might want the term “gene” to describe, can include exons, introns (especially as they play a role in alternative splicing to generate several proteins from one “gene”), regulatory regions, promoters, noncoding RNAs, and other elements. Maybe we need a word to mean “an associated unit of protein-coding exons that specifies a particular set of protein products” and one for “all sequences that are involved in generating a particular set of protein products, including coding, regulation, and associated processes”. One of these could be “gene” but we’d need another term to refer to the other. It may be rendered more complex if some regulatory elements affect multiple coding regions (hence discussion regarding relative contributions of cis vs. trans mechanisms). I think it is becoming clear enough that there is more to it than simply transcribing the stretch of DNA and splicing out the introns without linking changes in non-exonic elements to deleterious effects. So, it’s not so much a requirement of more experimental work to identify disease mutations as a conceptual decision about what we want the word to mean based on the more fundamental discoveries about regulation, protein-coding, and non-protein-coding function.

Overall, I think Minkel did a very good job with this piece — especially given the complex issues being discussed and the input offered by several scientists.