Genome size is good for you.

I imagine that every practicing scientist has experienced, in one form or another, the tendency of many non-scientists to expect all research to be directly beneficial to human health and well-being. I used to respond facetiously to these kinds of expectations when expressed by friends or family members, with something along the lines of “My work has absolutely no practical applications to human welfare whatsoever”.

Of course, this is not true. Genome size is becoming very relevant to fields of inquiry that are likely to have major significance for medicine. Notably, genome size data provide an important indication of the cost and difficulty of sequencing a given genome, and thus represent a prime criterion in the choice of sequencing targets. As an example, I performed a genome size estimate for Biomphalaria glabrata, a planorbid snail that serves as an intermediate host for the trematode flatworm Schistosoma mansoni which causes the debilitating disease known as schistosomiasis. The genome of B. glabrata is one of the smallest so far reported for a gastropod, and is now being sequenced (along with S. mansoni).

More recently, Jenner and Wills (2007) made explicit mention of genome size as an important factor in deciding on the next set of models for evo-devo studies. Discoveries regarding the fundamental genetic underpinnings of development have obvious implications for medical science and here, too, genome size is becoming increasingly seen as important. As they put it,

Whole-genome sequences are an increasingly important resource for many biological disciplines, including evo–devo15, 49, 50. However, financial and technical constraints mean that there is currently a preference for species with small genomes. This compounds the bias that is already introduced by the big six. First, putatively general conclusions about genome evolution might actually be specific to those smaller genomes that have been fully sequenced. For example, when focusing only on sequenced genomes, a close correspondence between genome size and gene number in eukaryotes is observed. The C-value paradox becomes apparent only when genome-size data from non-sequenced genomes is included51. Second, there are important genetic, morphological, physiological and ecological correlates of genome size in a range of animals and plants51, 52. Some correlates seem ubiquitous in animals and plants, such as those between genome size and cell size, body size and the inverse of developmental rate52. Others are group specific: genome size correlates mostly with metabolic rate in homeotherms, but with developmental type and ecology in amphibians53, and is positively correlated with egg size in copepods, plethodontid salamanders and fishes51, 52, 54. Studying these correlated traits in phylogenetically disparate taxa could illuminate the relationships between small genome size and rapid development, as well as the evolution of strongly cell-lineage-dependent development in taxa such as tunicates and nematodes, and the partial fragmentation of their Hox clusters55, 56.

References 51, 52, and 53 in that paragraph are papers of mine, so again I am forced to admit that my work may have some practical application after all.

My main focus is on genome size diversity in eukaryotes, which mostly means differences among species in the abundance of noncoding DNA. In bacteria, most of the genome is composed of protein-coding genes, so unlike in eukaryotes there is a very strong correlation between genome size and gene number. Genome size is generally small in parasites and endosymbionts and larger in free-living species (probably because population bottlenecks and relaxed selection on gene function result in gene loss by deletion bias in bacteria associated with hosts [Mira et al. 2001]).

But this observation is not the link between genome size and human health that I had in mind for this post. In this month’s issue of Antimicrobial Agents and Chemotherapy, Steven Projan argues that genome size is associated with the evolution of antibiotic resistance in bacteria. In Dr. Projan’s own words,

It is observed here that the ability of a given bacterium to evolve toward a multidrug resistance phenotype is a function of genome size. In Table 1, a number of examples are provided, but even an expanded analysis shows that this observation holds true. That is, the larger the genome the greater the propensity of a bacterium to display multidrug resistance phenotypes and the smaller the genome the less likely it is that antibacterial resistance will emerge and disseminate within that species. What is proposed here is that, just as there is a continuum of genome sizes among bacteria, there is a continuum in the ability or propensity of a bacterium to become “multidrug resistant” and that continuum is reflected in the size of the genome. This is not to say that we do not observe resistance to certain agents even in organisms with the smallest genomes (macrolide resistance appears in virtually every pathogen at some level). There is probably a solid biological reason for this observation; organisms with larger genomes are more adaptable to environmental changes because they have more (genetic) information to draw upon. It appears that organisms with smaller genomes have become more “specialized,” residing in particular environmental niches (Treponema pallidum and the Chlamydiae are cases in point), and their lack of versatility in adapting to different environments is also manifest in an inability to develop mechanisms for coping with antibiotics. Indeed, we have learned that virtually each and every time a bacterium either acquires a novel resistance determinant or a mutant strain arises with decreased susceptibility to an antibacterial drug, the bacterium experiences a “fitness burden.” With time, compensatory mutations are selected in which the bacterium accumulates mutations that allow for something like wild-type growth in a strain that is now phenotypically resistant (e.g., topA mutations in gyrB mutant strains). Bacteria with larger genomes simply have a greater opportunity to develop these compensatory mutations. It must be emphasized that it does not matter whether we are discussing the acquisition of a novel resistance gene as opposed to a mutation that alters the target or results in up-regulation of an efflux pump. The accumulating evidence tells us that all require some form of adaptation. Another consequence of this phenomenon is that antibiotic cycling in health care settings is unlikely to result in a reversion of the local microflora to susceptibility as the compensatory mutations “lock in” the resistance phenotype.

He continues by noting, “I and several of those I have discussed this observation with were perplexed that it had not previously been articulated. Although to be fair, others have suggested it is a trivial, if not nonsensical, observation and worthy only of cocktail party conversation… in fact, I believe that this is an important guide as to where and which organisms we actually need novel antibacterial agents for.” Projan blames an overemphasis on individual organisms with small genomes for the overlooking of this potentially important pattern. In other words, it is the sort of thing that can only be applied to human health research if one takes a broad view of genomic diversity.

As much fun as it is to study genome size for purely academic reasons, it seems it actually may be good for us too.


More interest in genome size.

The buzz on a few blogs today is the pending release of new books. Sort of the academic blogger equivalent to summer blockbusters, I suppose. In any case, it’s great to see that two of the eagerly anticipated items, Darwinian Detectives by Norman Johnson and The Origins of Genome Architecture by Michael Lynch, will both include significant space devoted to the topic of genome size. Not having read either book, it would not be prudent for me to recommend them to anyone (and it is no secret that I have problems with Lynch’s model, which is not the first and probably not the last one-dimensional explanation), but I do suggest that eyes be kept open for their arrival in June.

On another practical note, genome size is no longer just considered an important criterion for choosing genome sequencing targets, it has also been mentioned as directly relevant in the selection of the next wave of evo-devo models. It is also an interesting and important subject of investigation in its own right, of course.

So, while I may not ascribe to some of the explanations for genome size diversity that have been put forth of late, I am very glad to see that this is an active area of discussion that is gaining more attention every day.

__________

References

Evans, J.D. and D. Gundersen-Rindal. 2003. Beenomes to Bombyx: future directions in applied insect genomics. Genome Biology 4: 107.101-107.104.

Gregory, T.R. 2005. Synergy between sequence and size in large-scale genomics. Nature Reviews Genetics 6: 699-708.

Jenner, R.A. and M.A. Wills. 2007. The choice of model organisms in evo-devo. Nature Reviews Genetics 8: 311-319.

Pryer, K.M., H. Schneider, E.A. Zimmer, and J.A. Banks. 2002. Deciding among green plants for whole genome studies. Trends in Plant Sciences 7: 550-554.


Genomics, evolution, and health: comparisons of avian flu genomes.

An article by Steven Sternberg and colleagues is set to appear in the May issue of the journal Emerging Infectious Diseases. In it, the authors describe the results of complete genome sequence comparisons for 36 recent isolates of the avian flu virus (influenza H5N1). Their results “clearly depict the lineages now infecting wild and domestic birds in Europe and Africa and show the relationships among these isolates and other strains affecting both birds and humans”. More specifically,

The isolates fall into 3 distinct lineages, 1 of which contains all known non-Asian isolates. This new Euro-African lineage, which was the cause of several recent (2006) fatal human infections in Egypt and Iraq, has been introduced at least 3 times into the European-African region and has split into 3 distinct, independently evolving sublineages.


Figure 1. Phylogenetic tree of hemagglutinin (HA) segments from 36 avian influenza samples. A 2001 strain (A/duck/Anyang/AVL-1/2001) is used as an outgroup at top. Clade V1 comprises the 5 Vietnamese isolates at the bottom of the tree, and clade V2 comprises the 9 Vietnamese isolates near the top of the tree. The European-Middle Eastern-African (EMA) clade contains the remaining 22 isolates sequenced in this study; the 3 subclades are indicated by red, blue, and purple lines. The reassortant strain, A/chicken/Nigeria/1047–62/2006, is highlighted in red.

This is a study in phylogenetics — that is, it reconstructs evolutionary relationships among viral strains using the same tools that many evolutionary biologists use to study the relationships among species. It is well known that viruses evolve very rapidly, and tracking their their past changes contributes to the ability to predict future ones. As the authors conclude,

These findings show how whole-genome analysis of influenza (H5N1) viruses is instrumental to the better understanding of the evolution and epidemiology of this infection, which is now present in the 3 continents that contain most of the world’s population. This and related analyses, facilitated by global initiatives on sharing influenza data, will help us understand the dynamics of infection between wild and domesticated bird populations, which in turn should promote the development of control and prevention strategies.

Evolution is not something that only happened to the myriad fossil specimens housed in museum drawers, and evolutionary biology is not merely relevant to academics tucked away in research labs. Evolution is both an ongoing process and an active and exciting area of research. More than ever, an understanding of the processes involved is relevant to the well-being of people from all regions of the world.


Darwin’s death.

Today, April 19th, is the anniversary of Charles Darwin‘s death in 1882. I refer you to an excellent post by PZ Myers on Pharyngula about the details of Darwin’s passing [The Death of Darwin].

Darwin is buried at Westminster Abbey in London, within a few yards of Sir Isaac Newton. There is a bronze bust of Darwin as part of a memorial to several scholars near the grave that was installed by his family in 1888. The grave itself is very understated, a simple marble slab in the floor marking his name and the dates of his birth and death.


There is also a memorial to Darwin in Kent, where Down House is located, in the form of a sundial on the side of the local church.


Charles Robert Darwin, 12 February 1809 – 19 April 1882.


Something to ponder.

For those who fear that acknowledging the historical fact of evolution dooms one to a life of bleak insignificance, consider the following.

You are the product of an absolutely unbroken chain of successful ancestors stretching back nearly 4 billion years. In all that time, over billions of generations, not one member of your lineage ever failed to leave viable offspring who, in their turn, left yet more successful descendants. Not a single earthquake, volcanic eruption, meteorite impact, or glacier ever prevented one of your ancestors from contributing to the subsequent generation. Every one of your forebears prevailed in the face of predators, famines, parasites, diseases, and ill fortune. Whether in competition or cooperation, your antecedents triumphed. An untold number of beings have lived and died on this planet, but never — not a single time — did your line falter.

In this, you are not alone. The same is true of every living being on Earth, to whom you are connected directly through converging lines of common ancestry that date back to the very dawn of life. The world did not know you were coming and the machinations of nature did not have you in mind as an endproduct, yet here you are. As an individual, you and each of your brethren, cousins, and more distant evolutionary relatives represent an exceedingly, remarkably, staggeringly improbable occurrence — and are all the more wonderful for it.


From "Pangenesis" to "Genome".

The term “genetics” has been used in reference to the branch of science dealing with “the physiology of heredity and variation” since 1905. It was coined by the British biologist William Bateson, first in a 1905 letter (see Bateson 1928), and then publicly the following year (Bateson 1906). It was derived directly from the Greek for “birth” (or “origins”).

Straightforward enough. But what about “gene” and “genome”? These terms are interesting because they illustrate the evolution of both concept and language in science and involve both co-option and hybridization.

First, “gene”. Even after the term “genetics” was in use, it was not entirely clear what practitioners of the science were studying. Indeed, the concept of a fundamental physical and functional unit (or “determiner”) of heredity remained very vague. In 1909, Danish biologist Wilhelm Johannsen sought to pin down a term to describe these genetic elements. Although some people attribute the origin of “gene” to the same etymology as “genetics”, there is more to the story. In actuality, “gene” was derived indirectly from Darwin‘s (incorrect) theory of heredity known as “pangenesis“. Indirectly, because it morphed through the term “pangens” coined by the Dutch botanist Hugo de Vries in 1889 in reference to genetic units and as an homage to Darwin, even though his theory of heredity differed markedly from pangenesis (de Vries was a Mendelian).

According to Johannsen (1909, p.143), he came up with the term “gene” by choosing to isolate

the last syllable ‘gene’, which alone is of interest to us, from Darwin’s well known word (Pangenesis) and thereby replace the less desirable ambiguous word ‘determiner’. Consequently, we will speak of ‘the gene’ and ‘the genes’ instead of ‘pangen’ and ‘the pangens’. The word gene is completely free from any hypothesis; it expresses only the evident fact that, in any case, many characteristics of the organism are specified in the germ cells by means of special conditions, foundations, and determiners which are present in unique, separate, and thereby independent ways – in short, precisely what we wish to call genes. [Translation as in Portugal and Cohen 1977].

Johannsen (1909) was also responsible for the terms “genotype” and “phenotype“. As he summarized in 1911,

I have proposed the terms ‘gene’ and ‘genotype’ … to be used in the science of genetics. The ‘gene’ is nothing but a very applicable little word, easily combined with others, and hence it may be useful as an expression for the ‘unit-factors’, ‘elements’ or ‘allelomorphs’ in the gametes, demonstrated by modern Mendelian researches. A ‘genotype’ is the sum total of all the ‘genes’ in a gamete or in a zygote.

So, we have an evolution of the term from “pangenesis” (Darwin) to “pangens” (de Vries) to “genes” (Johannsen), passing through an incorrect theory of heredity to a term “completely free from any hypothesis” about inheritance to Mendelian genetics.

What about “genome”?

According to the Oxford English Dictionary, the term “genom(e)” was coined by the German botanist Hans Winkler in 1920 as a portmanteau of gene and chromosome (the latter term having been coined by Wilhelm Waldeyer in 1888). This story has been repeated by many authors (including yours truly; Gregory 2001), but has been challenged by Lederberg and McCray (2001), who suggest that Winkler probably merged gene with the generalized suffix ‘ome (referring to “the entire collectivity of units”), and not ‘some (“body”) from chromosome. In either case, Winkler’s intent was to “propose the expression Genom for the haploid chromosome set, which, together with the pertinent protoplasm, specifies the material foundations of the species” (translation as in Lederberg and McCray 2001).

Based on this initial formulation, “genome” can accurately be taken to mean either the total gene complement (interchangeably with Johannsen’s “genotype”), or the total DNA amount per haploid chromosome set – but not both, as we now know that these are not correlated with one another. This latter issue remains the subject of active study, and I shall have much more to say about it in future postings.

__________

References

Bateson, W. 1906. A text-book of genetics. Nature 74: 146-147.

Bateson, W. 1928. Letter to Sedgwick, April 18, 1905. In William Bateson, F.R.S.: His Essays and Addresses (ed. B. Bateson), pp. 93. Cambridge University Press, Cambridge.

De Vries, H. 1889. Intrazelluläre Pangenesis. Fischer, Jena.

Gregory, T.R. 2001. The bigger the C-value, the larger the cell: genome size and red blood cell size in vertebrates. Blood Cells, Molecules, and Diseases 27: 830-843.

Johannsen, W. 1909. Elemente der Exakten Erblichkeitslehre. Fischer, Jena.

Johannsen, W. 1911. The genotype conception of heredity. American Naturalist 45: 129-159.

Lederberg, J. and A.T. McCray. 2001. ‘Ome sweet ‘omics — a genealogical treasury of words. The Scientist 15: 8.

Portugal, F.H. and J.S. Cohen. 1977. A Century of DNA. MIT Press, Cambridge, MA.

Winkler, H. 1920. Verbeitung und Ursache der Parthenogenesis im Pflanzen und Tierreiche. Verlag Fischer, Jena.


Chimps are not more evolved than humans or anyone else.

I like New Scientist. I even did a short interview with them about a cool genomics story (“How chemicals can speed up evolution“, 6 May 2006, p.16). But this headline from their news service really annoys me: Chimps ‘more evolved’ than humans.

The short news article starts out with “It is time to stop thinking we are the pinnacle of evolutionary success…”, which of course is true except that it was time to stop thinking this 150 years ago, and then continues with “… chimpanzees are the more highly evolved species, according to new research”.

What they mean is that, based on the recent study, it appears that the rate of fixation by selection of mutations apparently has been higher in the lineage that has led to chimpanzees than in the lineage that has led to humans since they split from a common ancestor several million years ago. Which lineage experienced the changes can now be inferred by comparison with the macaque genome, which is less closely related to chimps and humans than the latter two are to each other; without such an external comparison, one can not say which lineage had changed, only that one or both of them had. Most likely, this boils down to differences in long-term historical population sizes in the two lineages (selection is stronger in large populations, genetic drift in small populations).

Couching this interesting finding in terms of who is “more evolved” than whom is not helpful, even with the scare quotes. As someone who teaches evolution at the upper-year undergraduate level, I can tell you that students come into the class with a lot of preconceptions about evolution, one of them being the notion that some extant species can be ranked as “more evolved” than others. It is subtle misinformation like this, compounded over many years, that makes my job harder by the time they arrive in my course.

Please, please, PLEASE stop appealing to common misconceptions about evolution in news stories, even if the headline will catch the attention of (previously misinformed) readers.

_________

Updates:


Genome sequences reduce the complexity of bacterial flagella.

I am not interested in engaging in debates with anti-evolutionists, though I am well aware of their key arguments. The big one, of course, is “irreducible complexity” — traits or features that supposedly could not have evolved because there is no conceivable function for their parts individually nor for a subset of their parts collectively. The bacterial flagellum apparently is the ultimate example of this, which explains why this microscopic protein “motor” can drive an entire philosophical argument along these lines.

I think Darwin said it best (as he often did) in 1871: “Ignorance more frequently begets confidence than does knowledge; it is those who know little, and not those who know much, who so positively assert that this or that problem will never be solved by science.”

There is little concern among biologists that the evolution of bacterial flagella will be worked out, just as a tremendous amount of information is now available about the evolution of eyes (the previous Paleyan example of a supposedly un-evolvable structure).

Last year, Pallen and Matzke (2006) presented a discussion of how bacterial flagella may have evolved, based in large part on comparisons of sequences from the various protein components. Many of the proteins that make up a flagellum have homologues that serve non-flagellar functions, strongly suggesting that they were co-opted from pre-existing proteins during the evolution of flagella. (See Matzke’s detailed model of flagellar evolution here and a video based on it here, and Ken Miller talking about flagella here). Specifically, there is ever-mounting evidence that bacterial flagella and the type III secretory system (TTSS) that toxic bacteria use to inject their prey are descended from the same ancestral structure. The fact that the TTSS lacks many of the proteins in flagella but remains functional (for toxin injection rather than locomotion) clearly indicates that not all the parts need to be present for some function to be carried out by the structure.

Pallen and Matzke (2006) noted that further comparisons of complete genome sequences (hence the post on this blog) would reveal additional insights into the evolution of flagella. Enter Liu and Ochman (2007) from the next issue of PNAS.

Liu and Ochman (2007) examined complete genome sequences from 41 species of bacteria with flagella, and were able to identify a core set of 24 proteins common to all of them, which was present in a very early ancestral bacterium. Not only this, but the core genes appear to be the product of multiple rounds of duplication and diversification, perhaps of one original precursor gene.

The gist of the story is that 1) some genes involved in the construction of flagella in modern bacteria are clearly co-opted from pre-existing genes that were doing something else in the cell (Pallen and Matzke 2006) and 2) a core of about two dozen genes common to all flagellated bacteria (and presumably found in their common ancestor) is the product of duplication and divergence whose reconstructed history agrees very well with the presumed evolutionary relationships among bacteria (Liu and Ochman 2007).

This just goes to show the usefulness of genome data for addressing questions that, for the reason outlined by Darwin, seem unanswerable to some. It also opens the door to some exciting future work.

I asked Howard Ochman what he thought the next key steps will be in this line of study. As he put it, “Naturally we would like to know the function of the structures that were specified by the ancestral set of flagellar genes, and how/why these genes remained functional through their successive duplications. We just completed a companion paper on the bacterial flagellar genes that arose later, and we are now branching out in into the other domains of life.”

I will positively assert, out of optimism rather than ignorance, that many more important insights will be forthcoming from these investigations.

(Update: Nick Matzke is very critical of the paper. He also has posted an updated critique that focuses more on the data.)

(Another update: See Carl Zimmer’s post about blogging as scientific debate).

(And yet another update: A complex tail, simply told at ScienceNOW)

_________

References

Aizawa, S.-I. 2001. Bacterial flagella and type III secretion systems. FEMS Microbiology Letters 202: 157-164.

Blocker, A., K. Komoriya, and S.-I. Aizawa. 2003. Type III secretion systems and bacterial flagella: insights into their function from structural similarities. Proceedings of the National Academy of Sciences of the USA 100: 3027-3030.

Gophna, U. , E.Z. Ron, and D. Graur. 2003. Bacterial type III secretion systems are ancient and evolved by multiple horizontal-transfer events. Gene 312: 151–163.

Liu, R. and H. Ochman. 2007. Stepwise formation of the bacterial flagellar system. Proceedings of the National Academy of Sciences of the USA 104: 7116-7121.

Matzke, N.J. 2003. Evolution in (Brownian) space: a model for the origin of the bacterial flagellum. Talk.Origins.

Miller, K.R. 2004. The flagellum unspun. In Debating Design: From Darwin to DNA, edited by W. Dembski and M. Ruse. Cambridge University Press, New York, pp. 81-97.
(available online here)

Musgrave, Ian. 2004. Evolution of the bacterial flagellum. In Why Intelligent Design Fails: A Scientific Critique of the New Creationism, edited by M. Young and T. Edis. Rutgers University Press, New Brunswick, NJ.
(available online here)

Nguyen, L., I.T. Paulsen, J. Tchieu, C.J. Hueck, and M.H. Saier. 2000. Phylogenetic analyses of the constituents of Type III protein secretion systems. Journal of Molecular Microbiology and Biotechnology 2: 125–144.

Pallen, M.J., C.W. Penn, and R.R. Chaudhuri. 2005. Bacterial flagellar diversity in the post-genomic era. Trends in Microbiology 13: 143-149.

Pallen, M. J., S.A. Beatson, and C.M. Bailey. 2005. Bioinformatics, genomics and evolution of non-flagellar type-III secretion systems: a Darwinian perspective. FEMS Microbiology Reviews 29: 201–229.

Pallen, M.J. and N.J. Matzke. 2006. From The Origin of Species to the origin of bacterial flagella. Nature Reviews Microbiology 4: 784-790.


DNA barcoding and taxonomy funding.

This may be old news, but it seems worthwhile responding anyway because hey, I have a blog now.

In 2005, Ebach and Holdrege wrote a letter to Nature in which they repeated the common misconception that DNA barcoding steals funding from taxonomic research. I responded by pointing out that DNA barcoding has not drawn support from the taxonomy pool, but rather so far has competed with medicine and genomics for support, or has brought in funds from sources that have not traditionally supported taxonomy. Apparently John Wilkins of Evolving Thoughts was not buying it.


I feel somewhat qualified to speak on this because I co-authored a Genome Canada grant with Paul Hebert that helped to fund the Canadian Barcode of Life Network. Needless to say, Genome Canada does not normally fund taxonomy. Other sources of support have been the Moore Foundation, the Sloan Foundation, the Canada Foundation for Innovation, NSERC, and various other government and industry sources — none of which has been taken from taxonomists (and in fact, a lot of it goes directly to taxonomists).

The “barcoders versus taxonomists” dichotomy, which is taken as a given by Ebach and Holdrege and other opponents of barcoding, is false. DNA barcoding is a collaborative enterprise, requiring people with different expertise. Existing DNA barcoding networks involve a large number of professional taxonomists, and major participants include museums and other taxonomic institutions. It also is a gross caricature to suggest that those who come at the issue from a molecular perspective are not interested in organismal biology. Paul Hebert, considered the father of DNA barcoding, spent most of his career cataloguing the diversity and phylogeography of aquatic microcrustaceans. This included both morphological and molecular work and even extended to development of traditional taxonomic keys. Many of the rest of us are proponents of barcoding because it allows us to access information about organisms that otherwise is inaccessible to non-taxonomists.

As far as I know, Wilkins is not a biologist, which means that his work is not contingent on being able to obtain species identifications. The situation is very different for biodiversity researchers. I have plenty of genome size estimates that remain unpublished because, try as I might, I cannot get them identified. Even assuming that a suitable taxonomic expert exists, he or she may be backlogged by a year or more. This is why DNA barcoding proponents consider it an enabling technology — a lot more scientists need taxonomy than do taxonomy.

As Wilkins notes,
“In computing terms, a DNA barcode is at best a record ID. The details of the record – the rest of the fields describing the name, address, and personal details of the species – still need to be recorded. Otherwise, all you have are empty records…”

I personally don’t know any DNA barcoders who would disagree with this. DNA barcoding is not the endpoint of biodiversity research, it is an access point. It may help to identify “groups” of genetically similar organisms, but the description of those groups, if warranted, will be done according to traditional taxonomic principles. This would not even be an issue, and DNA barcoding could focus only on its primary objective of identification, if not for the fact that most of life has yet to be described. If anything, barcoding should make these new taxonomic descriptions both easier and more accessible, which is the reason that so many taxonomists are working as part of the DNA barcoding initiative.

DNA barcoders are interested in information about biodiversity, and they want it to be accessible to everyone — and so far, they have worked to accomplish this without siphoning funds from the existing taxonomic pot.

__________

References

Ebach, M.C. and C. Holdrege. 2005. DNA barcoding is no substitute for taxonomy. Nature 434: 697.

Gregory, T.R. 2005. DNA barcoding does not compete with taxonomy. Nature 434: 1067.

Hebert, P.D.N. and T.R. Gregory. 2005. The promise of DNA barcoding for taxonomy. Systematic Biology 54: 852-859.
(Check out the cover)

Schindel, D.E. and S.E. Miller. 2005. DNA barcoding a useful tool for taxonomists. Nature 435: 17.

Links


Whose genome?

The term “genome” is oft-heard but seldom defined, and indeed has more than one meaning. Little wonder, then, that discussions about genome sequences and comparisons thereof can leave otherwise interested audiences more frustrated than enlightened. “What is a genome?” and “whose genome was sequenced?” are legitimate questions, and what follows is an attempt at clarification that is, by necessity, as much philosophical as scientific.

Definition #1: In a broad sense, a genome can be considered as the collective set of genes, non-coding DNA sequences, and all their variants that are located within the chromosomes of members of a given species. This definition does not consider variation among individuals within a species, and instead relates to distinctions between species. It is possible to apply such a definition because, for the most part, animal species do not share DNA extensively and hence their respective gene pools remain distinct (in fact, this forms the basis for defining species under some views). Thus, even though humans and chimpanzees are about 98% identical in terms of their DNA sequences, there is still such as thing as a “human genome” and a “chimpanzee genome” rather than a continuum with humans and chimps at two mildly divergent extremes. This is even true of far closer (but now extinct) relatives of humans such as Neanderthals; on average, the sections of Neanderthal DNA that have been recovered and sequenced are 99.5% identical to that of humans — but these, too, are considered to be part of a separate genome.

The genomic similarities described between species are usually based on comparing a few specific regions of DNA from a small number of representative individuals. If other factors are included in the comparison, such as insertions and deletions of DNA, then any two genomes will register a lower level of similarity — say, 95% for chimpanzees and humans rather than 98%. And indeed, no one would ever mistake a chimpanzee genome for a human genome, in part because they differ in DNA amount and chromosome number (human chromosome 2 is a product of fusion of what remain as two separate chromosomes in other great apes).

Of course, individuals within species are not genetically identical to one another (monozygotic twins notwithstanding), which leads to definition #2.

Definition #2: Because the DNA sequences of even close family members are not identical, it can also be said that each individual carries a unique genome consisting of the DNA in his or her chromosomes. In this case, the focus is entirely on one species and the important factor is the variability that exists among individuals of that species. In terms of DNA sequences on a large scale, members of the same species are extremely similar: overall, any two human beings are probably about 99.9% the same genetically. Nevertheless, complete genome sequencing, though conducted primarily under definition #1, has revealed two major sources of variation among individuals. The first are known as single nucleotide polymorphisms (SNPs, “snips”) and are as their name implies: differences at the level of single base pairs that are present in at least 1% of the population. It is estimated that there are some 3 million SNPs in the human genome (definition #1), with one occurring about every 100-300 base pairs along the more than 3 billion base pair sequence. The second major source of variation, first described in 2006, are known as copy number variants (CNVs). These involve differences among individuals in the insertion and deletion of larger DNA segments. CNVs have proved to be far more common than anyone would have imagined, and can result in differences not just in sequences but in the sizes of genomes among individuals (up to 20 million base pairs in humans, or about 0.5%).

The human genome, definition #1

Two independent research groups reported draft sequences of the human genome in February 2001: the publicly-funded and internationally collaborative Human Genome Project and the private company headed by J. Craig Venter known as Celera Genomics. The interaction of these two initiatives – typically branded as competitive, but also mutually informative – has been discussed many times. The question of how and why the two groups sequenced human DNA is not the subject of interest here – the question at present is whose DNA they analyzed.

The Human Genome Project, being a public effort, had an official policy of releasing all sequence data to public databases within 24 hours of completion, thereby making the information freely available to anyone who carried a copy of the “human genome” in their cells. In keeping with this outlook, the HGP implemented procedures intended to circumvent the focus on individuality and to keep their results in line with definition #1. Thus, they instituted a policy of voluntary donation by dozens of men and women from various ethnic backgrounds, provided samples with random numeric labels, shipped the samples to processing laboratories where they were re-labeled with new randomized codes, destroyed all records of previous labels, and then selected randomly from among the samples. Five to ten samples were collected for every one that was actually assayed, with the source of samples used unknown to both researchers and donors. In other words, the intent was to focus on definition #1 as much as possible and to provide a mixture, or at least a mystery, when it came to the source of the genome sequence of Homo sapiens.

Human nature being what it is, it is likely that most people would find this answer disappointing. Deep down, we want to know whose genome it was. The only information that has been available in this regard is that the largest portion of the source DNA came from a male donor in Buffalo, New York, code named “RPCI-11” (for Roswell Park Cancer Institute, where the genomic library was generated). No name, no other information, and yet somehow it seems satisfying to know that there really is an individual human – a real person in a specific part of the world who walked into a lab, stuck out his arm, and donated his blood – corresponding to all those A’s, T’s, G’s, and C’s.

The situation at Celera was quite different in terms of both data sharing and DNA sampling policies. Celera’s data were not made publicly available during the course of sequencing, and their sampling involved 20 donors, five of which were selected for analysis — though evidently not entirely at random. In fact, it was later revealed that Celera’s president and lead investigator, J. Craig Venter, was the primary source of DNA for the sequence. Venter argued that revealing this fact would dispel the myth of a single “human genome” (i.e., an excessive emphasis on definition #1 that ignores the individual uniqueness inherent in definition #2). Others may have felt that sequencing his own genome made the resulting sequence the property of one individual rather than of humanity at large (i.e., adopting definition #2 exclusively at the expense of the broadly shared definition #1).

The human genome, definition #2

While the Human Genome Project and Celera’s efforts generated single (partially composite) genome sequences, another major initiative is underway which focuses on variation among individuals at the genomic level; i.e., on definition #2. The International HapMap Project aims to identify associated collections of SNPs known as haplotypes, and currently includes samples from 270 people drawn from four major groups. Thirty sets of “trios” (two parents and a child) have come from the Yoruba people of Ibadan, Nigeria. Forty-five unrelated individuals from Tokyo and 45 from Beijing have provided samples. Thirty trios from residents of the United States with roots in western and northern Europe have also been included. SNP haplotypes may vary among populations and are important in the search for particular genes of medical significance. As with the DNA sequence information of the Human Genome Project, data from the HapMap Project are made freely available. A similar initiative to catalogue human diversity from the perspective of CNVs, the Copy Number Variation Project, has also been launched.

“Whose genome?” and individual identity

The question “whose genome was sequenced?” is predicated on concepts of individuality and personhood, which tend to be applied to the members of only a handful of species. Thus, one might be interested in which strain of fruit fly (Drosophila melanogaster), which population of sea urchins (Strongylocentrotus purpuratus), or which varieties of rice (Oryza sativa) had been sequenced, but it would not make sense to ask “who” the fly, sea urchin, or rice plant was. The situation gains complexity when dealing with vertebrate genomes because humans associate closely and emotionally with members of some species and not with others. The desire (or not) to know “who” was sequenced correlates directly with this. By way of example, consider the fact that a single male pufferfish (Takifugu rubripes), a single female chicken (red jungle fowl, Gallus gallus), two female brown rats (Rattus norvegicus), and a small number of female mice (Mus musculus) of the B6 strain have been sequenced, but that there has not been much interest in “who” these individuals were – nor would many people even think to ask the question.

Now consider “man’s best friend”. Not only is it known that the two reported canine genome sequences were from individual dogs, but it is known who those dogs were: Craig Venter’s poodle, Shadow, and a boxer named Tasha. It was also widely noted that samples for the chimpanzee genome were taken from a captive-born male named Clint who lived at the Yerkes National Primate Research Center in Atlanta, Georgia. Indeed, many a news story reported Clint’s untimely death in 2005 at the young age of 24. One might be tempted to argue that intelligence is the determining factor in this case – dogs and chimps are smart and have personalities, but pufferfish and rats do not. Perhaps. But surely the recently sequenced rhesus macaque (whatever her name was) should qualify under these criteria.

In the end, this post is not meant to be a statement about the apparent arbitrariness of our decisions to grant or deny individuality to members of other species. This is about genomes, and how definition #1 is applied intuitively and automatically when dealing with a species like mouse or rat, but that one cannot help but invoke definition #2 when dealing with a dog or human. The fact is that all of these species are composed of variable individuals, each with a unique genome under definition #2. Indeed, it is this variation that makes evolutionary divergence – and thus definition #1 – possible at all.