Inter-lineage selection versus "just in case".

I still want to grant the benefit of the doubt to my fellow biologists who recently have made statements about non-coding DNA being potentially useful in the future. Natural selection does not work this way, because it is simply the differential survival and reproduction of entities based on heritable differences. In the most common case, this means individual organisms within populations leaving more or fewer offspring and/or surviving or dying under given conditions in a non-random manner due to heritable trait differences. However, the general principle of natural selection is not restricted to this level, and is a logical consequence in any circumstance in which there is differential survival and reproduction based on inherited variation. There can be selection within the genome among transposons, for example, and some authors also argue that selection can take place among species (as differential speciation and extinction).

The most straightforward way of thinking about natural selection is to imagine that a certain genetic trait is either beneficial or detrimental to an organism, such that it is passed on either more or less commonly to subsequent generations. However, there can be higher-order selection as well, in which some lineages persist longer or branch off to form additional daughter lineages more often than others for non-random reasons. This is not why those traits originated nor why they are maintained from one generation to the next, but it could explain why lineages with those traits are more common or last longer than others.

As an example, consider sex. Sexual reproduction involves the recombination of genes which has two important effects: 1) it allows beneficial mutations to spread more easily in a population, and 2) it prevents the ratchet-like accumulation of deleterious mutations at multiple loci. What this means is that sexual lineages can be expected to evolve more quickly and to last longer than asexual lineages. So, when we look around, we expect to see more sexual lineages than asexual ones, and indeed that is what we see (at least in animals). Sex did not evolve so that lineages would have greater evolutionary potential or would survive for a longer time, but that is nevertheless a significant effect when considering the distribution of biological diversity. However, there is still an issue that sexual reproduction is costly: you only pass on half your genes, you produce “wasteful” males, you have to find a mate, and so on, so we also need to consider immediate benefits that keep the trait around long enough for us to even notice the higher-order effects.

Now back to “junk DNA”. It may be that over the long term, lineages with more non-coding DNA are more flexible and can diverge more often, or that they are more resilient to environmental change and will last longer than those with less DNA. If this is so, then this might explain why we see lineages with lots of non-coding DNA — because those lineages persisted while others disappeared. We would still have to explain the origin of the non-coding DNA and the reason it persists over the shorter term though. There are several possibilities. One, non-coding DNA is beneficial to the organism in some way. Lots of ideas have been proposed for this over the last half century. Two, non-coding DNA could be neutral and is simply not eliminated by selection. Three, non-coding DNA is slightly detrimental, but selection has been too weak (e.g., if populations are small) or mutation too strong (e.g., continual transposable element insertions) for it to be deleted. In any of these situations, it could be possible for non-coding DNA to persist long enough to be co-opted (by chance mutations and subsequent selection) or to have impacts on lineage diversification and/or lifespan.

The problem with this is that species with small genomes are much more common than ones with large genomes and large-genomed species seem to be more sensitive to environmental challenges. So, the most likely scenario is that mutational mechanisms affect DNA amount from the bottom up, while selection comes into play from the top down in terms of effects on cell size and also selection against disruptions of genes. On balance, some lineages end up with large amounts of non-coding DNA, and in some cases this is co-opted into functions like regulation or structure.

It certainly could be that some people are thinking about this from a reasonable perspective based on multiple levels of selection and time scales and are just being sloppy in their descriptions of the net processes. Or maybe they really do think that “junk DNA” is kept because it might become useful. Either way, we need to steer clear of simplified soundbites that obfuscate more than enlighten.


Function, non-function, some function: a brief history of junk DNA.

It is commonly suggested by anti-evolutionists that recent discoveries of function in non-coding DNA support intelligent design and refute “Darwinism”. This misrepresents both the history and the science of this issue. I would like to provide some clarification of both aspects.

When people began estimating genome sizes (amounts of DNA per genome) in the late 1940s and early 1950s, they noticed that this is largely a constant trait within organisms and species. In other words, if you look at nuclei in different tissues within an organism or in different organisms from the same species, the amount of DNA per chromosome set is constant. (There are some interesting exceptions to this, but they were not really known at the time). This observed constancy in DNA amount was taken as evidence that DNA, rather than proteins, is the substance of inheritance.

These early researchers also noted that some “less complex” organisms (e.g., salamanders) possess far more DNA in their nuclei than “more complex” ones (e.g., mammals). This rendered the issue quite complex, because on the one hand DNA was thought to be constant because it’s what genes are made of, and yet the amount of DNA (“C-value”, for “constant”) did not correspond to assumptions about how many genes an organism should have. This (apparently) self-contradictory set of findings became known as the “C-value paradox” in 1971.

This “paradox” was solved with the discovery of non-coding DNA. Because most DNA in eukaryotes does not encode a protein, there is no longer a reason to expect C-value and gene number to be related. Not surprisingly, there was speculation about what role the “extra” DNA might be playing.

In 1972, Susumu Ohno coined the term “junk DNA“. The idea did not come from throwing his hands up and saying “we don’t know what it does so let’s just assume it is useless and call it junk”. He developed the idea based on knowledge about a mechanism by which non-coding DNA accumulates: the duplication and inactivation of genes. “Junk DNA,” as formulated by Ohno, referred to what we now call pseudogenes, which are non-functional from a protein-coding standpoint by definition. Nevertheless, a long list of possible functions for non-coding DNA continued to be proposed in the scientific literature.

In 1979, Gould and Lewontin published their classic “spandrels” paper (Proc. R. Soc. Lond. B 205: 581-598) in which they railed against the apparent tendency of biologists to attribute function to every feature of organisms. In the same vein, Doolittle and Sapienza published a paper in 1980 entitled “Selfish genes, the phenotype paradigm and genome evolution” (Nature 284: 601-603). In it, they argued that there was far too much emphasis on function at the organism level in explanations for the presence of so much non-coding DNA. Instead, they argued, self-replicating sequences (transposable elements) may be there simply because they are good at being there, independent of effects (let alone functions) at the organism level. Many biologists took their point seriously and began thinking about selection at two levels, within the genome and on organismal phenotypes. Meanwhile, functions for non-coding DNA continued to be postulated by other authors.

As the tools of molecular genetics grew increasingly powerful, there was a shift toward close examinations of protein-coding genes in some circles, and something of a divide emerged between researchers interested in particular sequences and others focusing on genome size and other large-scale features. This became apparent when technological advances allowed thoughts of sequencing the entire human genome: a question asked in all seriousness was whether the project should bother with the “junk”.

Of course, there is now a much greater link between genome sequencing and genome size research. For one, you need to know how much DNA is there just to get funding. More importantly, sequence analysis is shedding light on the types of non-coding DNA responsible for the differences in genome size, and non-coding DNA is proving to be at least as interesting as the genic portions.

To summarize,

  • Since the first discussions about DNA amount there have been scientists who argued that most non-coding DNA is functional, others who focused on mechanisms that could lead to more DNA in the absence of function, and yet others who took a position somewhere in the middle. This is still the situation now.
  • Lots of mechanisms are known that can increase the amount of DNA in a genome: gene duplication and pseudogenization, duplicative transposition, replication slippage, unequal crossing-over, aneuploidy, and polyploidy. By themselves, these could lead to increases in DNA content independent of benefits for the organism, or even despite small detrimental impacts, which is why non-function is a reasonable null hypothesis.
  • Evidence currently available suggests that about 5% of the human genome is functional. The least conservative guesses put the possible total at about 20%. The human genome is mid-sized for an animal, which means that most likely a smaller percentage than this is functional in other genomes. None of the discoveries suggest that all (or even more than a minor percentage) of non-coding DNA is functional, and the corollary is that there is indirect evidence that most of it is not.
  • Identification of function is done by evolutionary biologists and genome researchers using an explicit evolutionary framework. One of the best indications of function that we have for non-coding DNA is to find parts of it conserved among species. This suggests that changes to the sequence have been selected against over long stretches of time because those regions play a significant role. Obviously you can not talk about evolutionarily conserved DNA without evolutionary change.
  • Examples of transposable elements acquiring function represent co-option. This is the same phenomenon that is involved in the evolution of complex features like eyes and flagella. In particular, co-option of TEs appears to have happened in the evolution of the vertebrate immune system. Again, this makes no sense in the absence of an evolutionary scenario.
  • Most transposable elements do not appear to be functional at the organism level. In humans, most are inactive molecular fossils. Some are active, however, and can cause all manner of diseases through their insertions. To repeat: some transposons are functional, some are clearly deleterious, and most probably remain more or less neutral.
  • Any suggestions that all non-coding DNA is functional must explain why an onion needs five times more of it than you do. So far, none of the proposed unilateral functions has done this. It therefore remains most reasonable to take a pluralistic approach in which only some non-coding elements are functional for organisms.

I realize that this will have no effect on the arguments made by anti-evolutionists, but I hope it at least clarifies the issue for readers who are interested in the actual science involved and its historical development.


ENCODE links.

The ENCODE paper and related commentaries in Nature (June 14):

List of stories about the ENCODE study:

I think the project is very interesting and important, but as I have said before, one study by itself is rarely revolutionary. ENCODE is adding evidence in favour of a revised understanding of genome function. It, along with many other studies, may require us to re-think a few concepts like “regulatory sequences” or “gene”, but this one paper alone is not engaged in battle against some stubborn establishment that steadfastly refuses to consider new possibilities.


More about ENCODE from Scientific American.

It is probably just coincidence, but two articles for which I gave interviews appeared online today. The first, which I discussed in an earlier post, was online in Wired, One Scientist’s Junk Is a Creationist’s Treasure by Catherine Shaffer. The second appeared in the online edition of Scientific American, The 1 Percent Genome Solution by JR Minkel. Both deal with non-coding DNA, though from rather different perspectives. The first is about creationists invoking the discovery (by evolutionary biologists and other scientists) of (indirect indication of) function in (small sections of) non-coding DNA. The second is about the search for those functions through detailed, rigorous scientific analysis.

I know that science writers have a tough job. And I know that we scientists grumble about a lot of what they generate. But this time I want to do something a little different. I want to give readers some idea of what science writers are faced with when they interview a scientist. This is possible because the interview for Scientific American was conducted by email rather than by phone (which I actually prefer). Have a look at the article, and then see how the interview actually proceeded, and think about the challenge of summarizing my answers, which were admittedly somewhat long-winded (some might say carefully worded so as to avoid confusion and to not overlook important points). Note also the kinds of questions that a writer has to develop.

Here are the pertinent sections from the article:

The consortium found that 5 percent of the studied sequence has been conserved among 23 mammals, suggesting that it plays an important enough role for evolution to preserve while species have evolved. But of all the new ENCODE sequences identified as potentially important, only half fall into the conserved group.

These unconserved sequences may be “bystanders, Birney says”—consequences of the genome’s other functions—that neither help nor hurt cells and may have provided fodder for past evolution.

They could also simply maintain a useful DNA structure or spacing between pieces of DNA regardless of their particular sequence, says genomics researcher T. Ryan Gregory of the University of Guelph in Ontario, who was not part of the consortium.

“The biological insights are mainly incremental at this point,” says genome biologist George Weinstock of the Baylor College of Medicine in Houston, which he says is to be expected of such a pilot study. “This is a ‘community resource’ project, like a genome project, that makes lots of new data available to the community, who then dig into it and mine it for discoveries.”

Gregory says the results, although still cryptic, do hint at new functions and a more complicated genome. “This study shows us how far we are from a comprehensive understanding of the human genome.”

And here are my answers to Minkel’s questions reproduced in full:

How much of what the consortium found is new?

– What is new about this study is the fine focus being applied to the search for functional elements. By way of analogy, this study is like a group of 35 treasure hunters with metal detectors and sifters combing the same 35m of a 3.5km long beach. (In fact, the 35m are broken up into 44 discrete stretches of beach, half of them chosen because they are known to contain lots of interesting objects and the other half selected to include areas with varying properties. The plan is eventually to comb the entire beach this way, but this first pass should be taken more as a proof-of-principle than a conclusive assessment).

– Some of the conclusions reinforce ideas that have already been in the literature for several years, for example that the majority of the human genome is transcribed (see, e.g., Wong et al. 2000; Wong et al. 2001). The identification of non-protein-coding transcripts, particularly in areas where this was not thought to occur, is novel. But, again, this particular study is based on only 1% of the genome and one should exercise caution in extrapolating it to the entire human genome.

– Other ideas, such that chromatin structure is important in regulation, are also not entirely new, but these data provide interesting new evidence for them.

How much of what was identified is likely to be functional?

– 5% of the genome sequence is conserved across mammals, and for about 60% of this (i.e., 3% of the genome) there is additional evidence of function. This includes the protein-coding exons as well as regulatory elements and other functional sequences. So, at this stage, we have increasingly convincing evidence of function for about 3% of the genome, with another 2% likely to fall into this category as it becomes more thoroughly characterized.

– The authors report the presence of sequences that are not conserved but show experimental (in the genomics sense) evidence of function. There need not be constraint on base pair sequences if merely the presence of non-coding DNA would fill the role independent of what that DNA is. For example, if it is simply a matter of physical spacing or structural arrangement, then it may not matter what the actual sequence of bases were. On the other hand, the authors argue that these elements “may serve as a ‘warehouse’ for natural selection, potentially acting as the source of lineage-specific elements and functionally conserved but non-orthologous elements between species”. Of course, this would be an effect, not a function, because natural selection does not have foresight and cannot maintain elements because they may someday be useful. Also, they suggest that these regions are “neutral”, meaning that they are “biochemically active” but “do not confer a selective advantage or disadvantage to the organism”. If they have no fitness effects then they cannot have a function in the usual sense of the term; however, it could be that their absence would be detrimental, in which case there would be convincing evidence of function of some sort.

– A large fraction of the sequences analyzed, both in introns and intergenic regions, appears to be transcribed. However, most of this DNA is not conserved and there is no clear indication of function. It could be that the transcripts themselves play a functional role or that the process of transcription but not the transcripts per se contributes an important effect. It could be that the regions they examined, which were typically gene-dense, included transcribed introns (no surprise) plus longer-than-expected regulatory regions such as promoters near but outside of genes (e.g., Cooper et al. 2007), but that on the whole the long stretches of non-coding DNA in between genes are not actually transcribed. Or, it could be that transcription in the human genome simply is very inefficient. For example, the data in this study suggest that 19% of pseudogenes in their sample are transcribed, even though by definition they cannot encode a protein and are unlikely to play a regulatory role. It also appears that in other groups, e.g., plants (Wong et al. 2000), there is lots of intergenic DNA that is not transcribed, which may indicate that this is a process unique to mammals and is not typical of eukaryotic genomes.

– Looking at a broader scale, we must bear in mind that about half the human genome consists of transposable elements. Some of these clearly do have functions (e.g., in gene regulation), but others persist as disease-causing mutagens. It could be that a large portion of these have taken on functions, but this remains to be shown. We are also left with the question of why a pufferfish would require only 10% as much non-coding DNA as a human whereas an average salamander needs 10 times more than we do. The well known patterns of genome size diversity make it difficult to explain the presence of all non-coding DNA in functional terms, even as there is growing evidence that a significant portion of non-coding DNA is indeed functionally important.

What does this tell us about the genome’s organization and evolution?

– This work follows the growing trend in which simplistic assumptions about genome form and function are being overturned. Previous examples include the assumption that each gene encodes one protein product and the associated expectation that there would be a relatively large number of genes in our genome. This study deals a blow to the notion that the human genome is organized and regulated in a simple way, and further suggests that our definition of “gene” may need to be expanded.

– This study shows us how far we are from a comprehensive understanding of the human genome, but it also provides some of the tools that will be needed to achieve this goal.

– The authors begin their paper with a conclusion (p. 799): “The human genome is an elegant but cryptic store of information.” Elegant, in the scientific sense, means “concise, simple, succinct”. This does not strike me as an accurate descriptor for such a complex, redundant evolutionary patchwork.

– This study reinforces the notion that the genome is a legitimate level of biological organization with its own complex evolutionary history.

You advise caution in extrapolating the results. Do you think it more likely that the study over- or under-represents the amount of complexity or underappreciated function in the genome? Why, or what other biases would you expect?

The concern is that it may over-estimate the level of function in the genome, given that they specifically selected regions rich in well-characterized genes for at least half the dataset. Of course, the objective of the study is to identify functional elements, so an aggressive approach to the question is warranted in that context. However, they probably considered few sequences that were not associated with genes in some way, such as long stretches of short repeats or transposable elements. The study does suggest that regulation is more complex than we thought, shows some evidence of function for some noncoding DNA, and indicates that lots of noncoding DNA is transcribed, but beyond that it hasn’t really clarified these issues — nor should it be expected to, as this was a pilot project only.

You say the study “deals a blow to the notion that the human genome is organized and regulated in a simple way, and further suggests that our definition of “gene” may need to be expanded.” Do most biologists believe the genome is simply organized and regulated? What’s the dominant view?

I would say that, for obvious pragmatic reasons, people assume that a system is simple until it is shown to be otherwise. At first, it was surprising that genome size is decoupled from organismal complexity. Then it was surprising that genes are split into coding exons and noncoding introns. Then it was surprising that half the human genome is transposable elements. Then it was surprising that there are only 25,000 genes. Now it is surprising that a significant portion of the noncoding DNA is transcribed and that gene regulation is not a simple on-off system but involves interactions – perhaps even networks – of coding and noncoding segments. I wouldn’t want to speak for “most biologists”, but I think overall we are coming to appreciate that less has been figured out about genome function than we first thought. And that is what makes the future of genomic science exciting.

And what further evidence would tell us whether we should redefine “gene”? (E.g., would we need to find disease mutations associated with these chimeric transcripts?)

It depends on what you want the term “gene” to represent. In its original definition, it did not specify “protein-coding exons” (because these were not discovered until decades later), and instead referred to a generalized notion of a genetic “determiner” (according to Johansen 1909 “The word gene is completely free from any hypothesis; it expresses only the evident fact that, in any case, many characteristics of the organism are specified in the germ cells by means of special conditions, foundations, and determiners which are present in unique, separate, and thereby independent ways”). After the rise of molecular genetics in the ‘50s, the focus shifted to individual protein-coding sequences (hence, “one gene, one protein”), though this was expanded to include the intron-exon arrangement after it was described by Gilbert in 1978. Now we see that “units of genetic specification”, or what we might want the term “gene” to describe, can include exons, introns (especially as they play a role in alternative splicing to generate several proteins from one “gene”), regulatory regions, promoters, noncoding RNAs, and other elements. Maybe we need a word to mean “an associated unit of protein-coding exons that specifies a particular set of protein products” and one for “all sequences that are involved in generating a particular set of protein products, including coding, regulation, and associated processes”. One of these could be “gene” but we’d need another term to refer to the other. It may be rendered more complex if some regulatory elements affect multiple coding regions (hence discussion regarding relative contributions of cis vs. trans mechanisms). I think it is becoming clear enough that there is more to it than simply transcribing the stretch of DNA and splicing out the introns without linking changes in non-exonic elements to deleterious effects. So, it’s not so much a requirement of more experimental work to identify disease mutations as a conceptual decision about what we want the word to mean based on the more fundamental discoveries about regulation, protein-coding, and non-protein-coding function.

Overall, I think Minkel did a very good job with this piece — especially given the complex issues being discussed and the input offered by several scientists.


"Because" versus "so that".

I want to make a quick point about how evolution works and how it does not. The reason is that two stories about non-coding DNA posted today include a major misconception about evolution. Unfortunately, this is a misconception attributed in the articles to biologists, so I can only imagine what the state of comprehension is among non-scientists.

The distinction is between “because” and “so that”. In evolution, things evolve “because,” meaning that there are causes and effects that can be identified. Why are some strains of bacteria resistant to antibiotics? Because a mutation that occurred that happened to be beneficial under the conditions of antibiotic treatment became common in the population over the course of several generations. By contrast, things do not evolve “so that”. Bacteria do not experience mutations so that they will become resistant to antibiotic agents.

Why is there so much non-coding DNA? Because transposable elements spread, or because there are accidental duplications that are not eliminated by selection, or because of the interaction of some other mutational processes and their consequences (or lack thereof). So much non-coding DNA did not evolve so that it might someday be useful, or so that it could be coopted when needed, or so that evolution would have more potential in the form of genetic raw materials.

So why, then, do we see quotes like these?

Wired One Scientist’s Junk Is a Creationist’s Treasure:

“I’ve stopped using the term [‘junk’],” Collins said. “Think about it the way you think about stuff you keep in your basement. Stuff you might need some time. Go down, rummage around, pull it out if you might need it.”

Reuters Human instruction book not so simple: studies:

“It is not the sort of clutter that you get rid of without consequences because you might need it. Evolution may need it,” [Collins] said.

That little extra padding might be just what an animal needs to adapt to some unforeseen circumstance, the researchers said. “They may become useful in the future,” Birney said.

The latter quote by Ewan Birney illustrates the problem that can arise when a detailed, nuanced discussion is summarized into a short soundbite. I know this from experience, and I suspect that this is what has happened here, given how his very reasonable interpretation is paraphrased in New Scientist ‘Junk’ DNA makes compulsive reading:

Birney says that the additional switches may be mutations that appear by accident and then generate new slugs of RNA, but because they are produced randomly, most are evolutionarily neutral ‘passengers’ in the genome. There might be rare occasions, however, when a new RNA does confer an advantage.

Collins, on the other hand, seems to have said his bit to two different reporters, so I strain to give him the benefit of the doubt on this one. When I began this blog, I did not think I would be pointing out obvious misconceptions about evolution, genomes, and DNA as propagated by the likes of Collins or Nature. But here we are.


Junk DNA gets Wired.

There is a new article on the Wired website about junk DNA [One Scientist’s Junk Is a Creationist’s Treasure]. I make a very brief appearance in it, and I just want to clarify what I meant by the statement cited (I’m still learning that even an hour-long interview might result in only a short blurb).

My quote is “Function at the organism level is something that requires evidence”. I make this statement because there are several different sorts of DNA sequences in the genome whose presence can be explained even if they do not benefit (and indeed, even if they slightly harm) the organism carrying them. Pseudogenes, satellite DNA, transposable elements (45% of our genome), and other non-coding sequences may or may not be functional — that requires evidence — and some may exist as a result of accidental duplication or even due to selection at the level of the elements themselves (by “intragenomic selection”). The old assumption that all non-coding DNA must be beneficial to the organism or it would have been deleted by now ignores genome-specific processes by which non-coding DNA evolves.

As I have discussed previously, both hardcore adaptationists (if any exist anymore) and creationists have a vested interest in having all non-coding DNA be functional. I believe that real-world variability in genome size argues strongly against such a prospect, but of course it is possible, and this is the point that people like Ohno, Doolittle, Orgel, and Crick made in the 1980s. The important point is that yes, some non-coding DNA is functional at the organism level (as opposed to existing for its own sake or because there is no strong selection against it). And certainly, non-coding DNA has effects at the organism level. But current evidence suggests that about 5% of the human genome is functional, and even the least conservative ENCODE participants (whose primary, and important, objective is to identify the functional elements and their features) are betting that 20% is functional.

In the end, it is obvious that non-coding DNA is the product of evolution whether it all turns out to be functional or not. The cases in which former parasites (transposons) have taken on function at the organism level are a perfect illustration of cooption, which is the same basic process that allows explanations for the evolution of complex structures like eyes or flagella. The research into function of non-coding DNA, which the creationists are eager to cite, can be carried out only under an evolutionary framework — it is meaningless to talk about “conserved non-coding DNA sequences” otherwise.

Finally, let me say one thing about Francis Collins’s quote: “Think about it the way you think about stuff you keep in your basement. Stuff you might need some time. Go down, rummage around, pull it out if you might need it.” With all due respect (which is considerable, given his contribution to the Human Genome Project), it makes no sense to explain the existence of non-coding DNA because it might someday prove useful. Evolution does not work that way. Elements might be coopted, but maintaining this option explains neither the origin nor the persistence of non-coding sequences.

As to what the creationists have to say, well, I leave that to others with more (or less?) patience to attend to.

____________

Updates:


Decoding the blueprint. Sigh.

The results of the proof-of-principle phase of ENCODE, the Encyclopedia of DNA Elements Project, appear in the June 14 issue of Nature. It’s a very interesting project, and it has revealed a few more surprises (or at least, added evidence in favour of previously surprising observations). I will probably post more about it soon, but for the time being let me just offer a brief apology to the science writers out there whom I have given a hard time about invoking sloppy language to describe non-coding DNA, sequencing, and genomes (recent example, but one I will leave alone, ‘Junk’ DNA makes compulsive reading online at New Scientist).

The reason I am sorry is that I simply cannot hold you to a higher standard than is maintained by one of the most prestigious journals on planet Earth. You see, Nature has decided to depict the ENCODE project on the cover as “Decoding the Blueprint”. Needless to say (again), genomes are not blueprints (as the ENCODE project shows!) and no one is decoding anything at this point.

I have said all this before, and even I am getting tired of my complaints about it. Thus, I will focus only on the interesting science in a later post.

Sigh.


Am I a MacGregor?

The name “Gregory” is used as both a first name and a surname, and I wish I had a nickel for every time someone said “No, your last name” after I told them my name was “Gregory”. Jokes about having two (actually, three) “first” names have been a staple in my life as well.

There have been 16 popes with the name “Gregory”, including Pope Gregory I (“Gregory the Great”, which, had it not been taken, would have been a nickname I would have aspired to myself; he can keep “Saint Gregory”). Think “Gregorian calendar” (Pope Gregory XIII) or “Gregorian chants” (though these are probably not actually a product of Pope Gregory I). Readers with a snarkier side may consider this blog an example of “Gregorian rants” if they so desire.

There are many derivatives of the name “Gregor”, of which “Gregory” is one. It appears to date back to the Latin “Gregorious” and the Greek “Gregorios”, meaning “alert, watchful, or vigilant”. When my father and stepmother were in Greece, they were often told that they had a “very good Greek name”. Other languages have their own versions as well.

When I was living in the west end of London (specifically, the “London Borough of Richmond-Upon-Thames“), I would have my hair cut by a fantastic old-school barber, an ex-merchant marine who lived in a long boat on the Thames and who did the final trim on one’s neck with a straight razor. On my first visit, he remarked that I “must have Scottish blood”. The reason, apparently, had to do with my thick hair and reddish goatee. “What’s your surname?” he asked. “Gregory,” I replied. “Well there you go,” he said.

You see, the other, more circuitous origin of the name “Gregory” is via the Scottish Clan MacGregor (meaning “son of Gregor”, and thus linked back to the Latin/Greek origin). It seems the MacGregors ran afoul of King James VI, who made bearing the MacGregor name a capital offence in 1603. You may be familiar with subsequent adventure involving the “Scottish Robin Hood”, Rob Roy MacGregor, as portrayed on screen by Liam Neeson (who is not a Scottish folk hero at all, but a Northern Irish Jedi).

When given the choice between changing their names or being executed, most MacGregors opted for the former. The resulting names, which numbered more than 100 and of which Gregory was one of the more obvious, became septs of Clan MacGregor. The ban on the name MacGregor was lifted in 1774, but the division into different septs remains.

Today, my fellow DNA Network member Blaine Bettinger of The Genetic Genealogist reports on an effort by the Clan Gregor Society to use DNA to reunite the Clan MacGregor.

The idea of the MacGregor DNA Project is to draw comparisons to a genetic profile from a known descendant of the chief’s line (known only as “kit 2124”). Anyone who shares 31 out of the 37 DNA markers with this individual will be given full membership in the Clan Gregor Society, regardless of current surname. Gregory is one of a few surnames focused on explicitly as part of the project.

The project primarily is making use of Y-chromosome loci, which would mean that only descendants related through their father’s side would register. It appears that some mitochondrial DNA analysis is also being conducted, which would identify individuals related through descent on their mother’s side.

As per the old tradition in our society, I received my surname from my father and, as per the old tradition in biology, I also received my Y chromosome from him. In other words, it would be perfectly feasible for me to take the test and see if my red beard is homologous to that of Rob Roy.

But really, what’s the point? I am not Scottish, I am Canadian, and I am perfectly happy with that identity. Moreover, like a great many North Americans, I represent a mixture of many different families: Gregory, Davis, Sager, MacKenzie, and who knows what else (though I confess that the ingredients here are pretty limited in their variety, coming as they all do from the British Isles). It’s really only because of a quirk of our culture that I associate almost exclusively with Gregory.

Still, it would be pretty cool to wear an official clan tartan…


Genomes large and small.

The past few years have witnessed the discovery of both very large and small genomes in different groups of organisms. Here are some highlights from this research.

The first represents the largest genome so far reported for a crustacean, in the Arctic-dwelling amphipod Ampelisca macrocephala. The genome of this small invertebrate is a whopping 63.2 billion base pairs, or about 20 times larger than the human genome (Rees et al. 2007). Again, this sort of observation should dispel the notion that all non-coding DNA is functional for protecting against mutagens or some such thing.

The second interesting finding is of the largest viral genome so far discovered. The virus, dubbed Mimivirus, was sufficiently odd that it was originally assumed to be a bacterium when first observed, but on closer examination was found to be a virus. Its genome size is estimated as 1.2 million bases, which is larger than the genome of many bacteria (Raoult et al 2007). So, now there is overlap in reported genome sizes between viruses and bacteria, which goes along with the known overlap between the genome sizes of bacteria and eukaryotes (Gregory 2005).

And now for some small genomes. More specifically, the smallest flowering plant genome, that of Genlisea margaretae at a mere 63 million base pairs, less than half the size of the previous record holder, Arabidopsis thaliana at about 157 million base pairs. This increases the range in angiosperm genome sizes to more than 2,000-fold. (In animals the total range is about 3,300-fold; Gregory et al. 2007).

The smallest insect genome so far estimated was reported fairly recently as well. It belongs to Caenocholax fenyesi, a twisted-wing parasite, and is a mere 108 million base pairs (Johnston et al. 2004). Not to spoil the fun, but my lab has also found genome sizes this small in other groups, though these have not yet been published. The largest insect genome size known is found in the mountain grasshopper Podisma pedestris at 16.6 billion base pairs (Westerman et al. 1987).

The smallest eukaryotic genome known to date is that of the protist Encephalitozoon intestinalis, a parasitic microsporidian with a genome size of only 2.3 million base pairs, which is smaller than that of many bacteria (Vivarès and Méténier 2000). The smallest free-living eukaryote genome size is found in Ostreococcus tauri at 12.6 million base pairs (Derelle et al. 2006). The largest reliable protozoan genome size estimate reported to date is 97.8 billion base pairs in the dinoflagellate Gonyaulax polyedra (Shuter et al. 1983). That is a more than 33,000-fold range among protists.

It should be pointed out that the largest published eukaryote genome size estimate is 1,400 billion base pairs (400 times larger than human) in the free-living amoeba Chaos chaos (Friz 1968), although the largest genome size is often attributed to Amoeba dubia at 700 billion base pairs based on the same study. These data are not generally considered reliable, for several reasons. First, these values for amoebae were based on rough biochemical measurements of total cellular DNA content, which probably includes a significant fraction of mitochondrial DNA. Second, Friz’s (1968) value of 300pg for Amoeba proteus is an order of magnitude higher than those reported in subsequent studies (Byers 1986). Third, some amoebae (e.g., A. proteus) contain 500-1000 small chromosomes and are quite possibly highly polyploid (Byers 1986), in which case these values would be inappropriate for a comparison of haploid genome sizes among eukaryotes.

Finally, the smallest genome so far known for any cellular organism also was discovered recently — that of the endosymbiotic bacterium Carsonella ruddii at a miniscule 159,662 base pairs (Nakabachi et al. 2006). This species resides within specialized cells inside the body of psyllid insect hosts. The genome is so small, and the insect and bacterium so mutually dependent, that this species blurs the lines between bacteria and organelles, and probably is similar in some ways to an intermediate stage in the evolution of other obligate intracellular symbionts turned organelles like mitochondria and chloroplasts.

The old assumption, still often repeated, that viruses have smaller genomes than bacteria which have smaller genomes than single-celled eukaryotes which have smaller genomes than multicellular eukaryotes is beginning to wear thin. The pattern remains in a general sense, but focusing only on such a coarse scale overlooks a significant amount of diversity within, and increasingly apparent overlap between, groups of life.

___________

Readers interested in exploring genome size data can check out the various online databases for more.

References

Byers, T.J. 1986. Molecular biology of DNA in Acanthamoeba, Amoeba, Entamoeba, and Naegleria. International Review of Cytology 99: 311-341.

Derelle, E., C. Ferraz, S. Rombauts, P. Rouzé, A.Z. Worden, S. Robbens, F. Partensky, S. Degroeve, S. Echeynié, R. Cooke, Y. Saeys, J. Wuyts, K. Jabbari, C. Bowler, O. Panaud, B. Piégu, S.G. Ball, J.P. Ral, F.Y. Bouget, G. Piganeau, B. De Baets, A. Picard, M. Delseny, J. Demaille, Y. Van de Peer, H. Moreau. 2006. Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. Proceedings of the National Academy of Sciences of the USA 103: 11647-11652.

Friz, C.T. 1968. The biochemical composition of the free-living amoebae Chaos chaos, Amoeba dubia, and Amoeba proteus. Comparative Biochemistry and Physiology 26: 81-90.

Gregory, T.R. 2005. Synergy between sequence and size in large-scale genomics. Nature Reviews Genetics 6: 699-708.

Gregory, T.R., J.A. Nicol, H. Tamm, B. Kullman, K. Kullman, I.J. Leitch, B.G. Murray, D.F. Kapraun, J. Greilhuber, and M.D. Bennett. 2007. Eukaryotic genome size databases. Nucleic Acids Research 35 (Suppl. 1): D332-D338.

Greilhuber, J., T. Borsch, K. Müller, A. Worberg, S. Porembski, W. Barthlott. 2006. Smallest angiosperm genomes found in lentibulariaceae, with chromosomes of bacterial size. Plant Biology 8: 770-777.

Johnston, J.S., L.D. Ross, L. Beani, D.P. Hughes, and J. Kathirithamby. Tiny genomes and endoreduplication in Strepsiptera. Insect Molecular Biology 13: 851-585.

Nakabachi A, A. Yamashita, H. Toh, H. Ishikawa, H.E. Dunbar, N.A. Moran, and M. Hattori. 2006. The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science 314: 267.

Raoult, D., B. La Scola, and R. Birtles. 2007. The discovery and characterization of Mimivirus, the largest known virus and putative pneumonia agent. Clinical Infectious Diseases 45: 95-102.

Rees, D.J., F. Dufresne, H. Glémet, and C. Belzile. 2007. Amphipod genome sizes: first estimates for Arctic species reveal genomic giants. Genome 50: 151-158.

Vivarès, C.P. and G. Méténier 2000. Towards the minimal eukaryotic parasitic genome. Current Opinion in Microbiology 3: 463–467.

Westerman, M., N.H. Barton, and G.M. Hewitt (1987). Differences in DNA content between two chromosomal races of the grasshopper Podisma pedestris. Heredity 58: 221-228