Assembling, Visualizing, and Analyzing the Tree of Life

Latest

Apply for Tree-for-all: a hackathon to access OpenTree resources

Full call for participation and link to application: http://bit.ly/1ioPPMc

A global “tree of life” will transform biological research in a broad range of disciplines from ecology to bioengineering. To help facilitate that transformation, the OpenTree <http://opentreeoflife.org> project [1] now provides online access to >4000 published phylogenies, and a newly generated tree covering more than 2.5 million species.

The next step is to build tools to enable the community to use these resources.  To meet this aim, OpenTree <http://www.opentreeoflife.org/>, Arbor <http://www.arborworkflows.com/> [2] and NESCent’s HIP<http://www.evoio.org/wiki/HIP> working groups [3] are staging a week-long hackathon September 15 to 19 at U. Michigan, Ann Arbor.  Participants in this “Tree-for-all” will work in small teams to develop tools that use OpenTree’s web services to extract, annotate, or add data in ways useful to the community.  Teams also may focus on testing, expanding and documenting the web services.

How could a global phylogeny be useful in your research or teaching?  What other data from OpenTree would be valuable?  How could OpenTree web services be integrated into familiar workflows and analysis tools?   How could we add to the database of published trees, or enrich it with annotations?

If you can imagine using these resources, and you have the skills to work collaboratively to turn those ideas into products (as a coder, or working side-by-side with coders), we invite you to apply for the hackathon.  The full call for participation (http://bit.ly/1ioPPMc) provides instructions for how to apply, and how to share your ideas with potential teammates (strongly encouraged prior to applying).  Applications are due July 8th. Travel support is provided.  Women and underrepresented minorities are especially encouraged to apply.

If you have questions, contact Karen Cranston (karen.cranston@nescent.org, @kcranstn, OpenTree), Arlin Stoltzfus (arlin@umd.edu, HIP), Julie Allen (juliema@illinois.edu, HIP), or Luke Harmon (lukeh@uidaho.edu, Arbor).

[1] http://www.opentreeoflife.org

[2] http://www.arborworkflows.com/

[3] http://www.evoio.org/wiki/HIP (Hackathons, Interoperability, Phylogenies)

PhyloCode names are not useful for phylogenetic synthesis

Ok, the title is intentionally a bit provocative, but bear with me.

A primary aim of the Open Tree project is to synthesize increasingly comprehensive estimates of phylogeny from “source trees” — published phylogenies constructed to resolve relationships in disparate parts of the tree of life. The general idea is to combine these localized efforts into a unified whole, using clever bioinformatic algorithms.

In this context, a basic operational question is: how do we know if a clade in one source tree is the same as a clade in another source tree? This can be difficult to answer, because source trees are typically constructed from carefully selected samples of individual organisms and their characters (usually DNA sequences). If two source trees are inferred from completely non-overlapping samples of individual organisms, as is commonly the case, is it possible for them to have clades in common, or rather, is it possible for us to determine whether they have clades in common?

I would argue that the answer is yes, with a very important condition: that the organisms sampled for each tree are placed into a common taxonomic hierarchy that embodies a working hypothesis of named clades in the tree of life.

Note an important distinction here: a clade in a source tree depicts common ancestry of selected individual organisms, while a clade in the tree of life is a conceptual group defined by common ancestry that effectively divides all organisms, living and dead, into members and non-members. So a taxon in this sense is a name that refers to a particular tree-of-life clade whose membership is formalized by its position in the comprehensive taxonomic hierarchy.

By placing sampled organisms into a common taxonomic hierarchy, one can compute the relationships between source-tree clades and tree-of-life clades in terms of taxa, a process that I refer to as “taxonomic normalization.”

An idea that emerges from this line of thinking is that the central paradigm of systematics is (or should be) the reciprocal illumination of phylogeny and taxonomy. That is, phylogenetic research tests and refines taxonomic concepts, and those taxonomic concepts in turn guide the selection of individual organisms for future research. I would argue that this, in a nutshell, is “phylogenetic synthesis.”

Which brings me to the title of this post. In the PhyloCode, taxonomic names are not hypothetical concepts that can be refuted or refined by data-driven tests. Instead, they are definitions involving specifiers (designated specimens) that are simply applied to source trees that include those specifiers. This is problematic for synthesis because if two source trees differ in topology, and/or they fail to include the appropriate specifiers, it may be impossible to answer the basic question I began with: do the trees share any clades (taxa) in common? If taxa are functions of phylogenetic topology, then there can be no taxonomic basis for meaningfully comparing source trees that either differ in topology, or do not permit the application of taxon definitions.

So phylogenetic synthesis requires taxa that are explicitly not functions of phylogenetic topology. Instead, taxa should exist independently as hypotheses to be tested by phylogenetic evidence, and as systematists we should strive to construct comprehensive taxonomic hierarchies. I think this is going to be the real key to making progress in answering the question, “what do we know about the tree of life, and how do we know it?”

Data sharing, OpenTree and GoLife

NSF has released GoLife, the new solicitation that replaces both AToL and AVAToL.  From the GoLife text:

The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.

Data completeness, open data and data integration are key components of these proposals – inferring well-sampled trees that are linked with other types of data (molecular, morphological, ecological, spatial, etc) and made easily available to scientific and non-scientific users. The solicitation requires that trees published by GoLife projects are published in a way that allows them to be understood and re-used by Open Tree of Life and other projects:

Integration and standardization of data consistent with three AVAToL projects: Open Tree of Life (www.opentreeoflife.org), ARBOR (www.arborworkflows.com), and Next Generation Phenomics (www.avatol.org/ngp) is required. Other data should be made available through broadly accessible community efforts (i.e., specimen data through iDigBio, occurrence data through BISON, etc).

What does it mean to publish data consistent with Open Tree of Life? We have a short page on data sharing with OpenTree, and a publication in PLOS Currents Tree of Life with best practices for sharing phylogenetic data. Our phylogeny curation application allows you to upload and annotate phylogenies consistent with OpenTree synthesis, and you can quickly import trees from TreeBASE.

If you have questions about a GoLife proposal (or any other data sharing / integration issue), feel free to ask on our mailing list or contact Karen Cranston directly.

Which came first? A pivotal position in the plant tree of life

Amborella trichopoda

Amborella trichopoda

The question of which extant angiosperm (flowering plant) lineage “came first” (i.e., is basal in the flowering plant tree of life) has long puzzled biologists. This question is fascinating and important in its own right, but the answer also has potentially profound ramifications including plant gene and genome evolution (which, for example, has implications for crop improvement). Such information is also important for understanding habit and habitat evolution and for the inference of ancestral character states in the angiosperms (e.g., the ancestral flower as well as the ancestral angiosperm genome). Although great 20th century plant taxonomists such as Arthur Cronquist, Armen Takhtajan, and Robert Thorne generally agreed that taxa from the subclass Magnoliidae comprised the “basal” angiosperm lineage, there was no way to “prove”, one way or another, which extant angiosperm lineage came first until the advent of molecular systematics towards the end of the 20th century.Untitled

With the aid of modern molecular phylogenetic techniques it is now known that the major groups they recognized, such as Magnoliidae sensu Cronquist and Takhtajan, are typically polyphyletic. Most research now indicates instead that Amborellaceae, Nymphaeales (water lilies), and Austrobaileyales are the earliest branching extant angiosperm lineages. However, the relative branching order of these three lineages, particularly in regards to Amborella trichopoda (the sole species within Amborellaceae) and Nymphaeales, was, until recently, somewhat contentious.

While most molecular analyses during the past 20 years have recovered Amborella as the earliest-diverging angiosperm lineage, some studies have suggested a clade comprising Amborella + Nymphaeales, or even Nymphaeales alone, as the root of all angiosperms. Recently, at the University of Florida, Soltis lab postdoc Bryan Drew and colleagues (including AVATOL team member Stephen Smith at the University of Michigan) endeavored to definitively answer the longstanding question of which angiosperm came first—that is, what living angiosperm is sister to all other living angiosperms in the angiosperm tree of life. Using a plastid data set consisting of 236 taxa, 78 genes, and ~58,000 nucleotides, Drew et al. performed a myriad of analyses with the express purpose of discerning the first-diverging angiosperm lineage; this study by Drew et al. was just accepted by Systematic Biology and will be viewable online in the coming months. Their results: Virtually every analysis conducted found Amborella as the earliest-diverging living angiosperm lineage with high internal support, and every plastid analysis performed using their original datasets recovered a topology in which Amborella alone is sister to all other living angiosperms.

CaptureThese findings lend strong affirmation to the Amborella sister hypothesis, and should help guide future research regarding angiosperm character (including genomic features) and habitat evolution. Although the “first” angiosperms are long extinct, a better understanding of Amborella will aid in our understanding of angiosperm evolution as a whole. This was the impetus behind the Amborella Genome Project. As a result of this ongoing project, the Amborella nuclear genome has recently been fully sequenced (www.amborella.org; Amborella Genome Project, Science, in press), and this major achievement should lead to unprecedented insights within flowering plants.

 

Doug Soltis is a distinguished professor at the University of Florida.

Bryan Drew is a post-doctoral researcher in the Soltis lab at the University of Florida.

How computer scientists are using map distance to determine phylogeny

What is distance?

Distance is a way to measure the relatedness of two things. It is phrased in terms of similarity or difference relative to a feature. Different features expose different information about how the things are related. For instance, if we compare two cities, we might compute their geographical distance or how far apart they are in terms of miles or kilometers. But, if we are making a car trip, we may want to compute a different distance. Roads rarely directly connect two points, so we may care more about the driving distance or driving time. On the other hand, if we’re looking for somewhere warm to spend the winter, we may care most about the difference between the temperatures of two cities.

Distance is a requirement for comparison. It fundamental to the assessment data required by scientific pursuits as well as the value judgments made in our daily lives. Thus, distance is a cornerstone of the human experience.

What does distance tell us about trees?

PantheraBlogConsider four phylogenies over the genus Panthera or big cats shown below. Here, the trees are from actual phylogenetic analyses performed by different researchers over the years. The fourth tree is the current best estimate of the big cats by Davis, Li, and Murphy. (For further details, see their 2010 paper “Supermatrix and species tree methods resolve phylogenetic relationships within the big cats, panthera (carnivora: Felidae)” in Molecular Phylogenetics and Evolution.)

There are different trees because researchers use different combinations of phylogenetic reconstruction methods and phylogenetic data. Typically, these discrepencies are resolved by a consensus tree where relationships are included in the consensus tree if they appear in either most of the trees (majority consensus) or all of the trees (strict consensus). For our example, the majority consensus tree only retains one relationship as shown below. Most of the information from the trees is lost, which is one disadvantage of summarizing a set of trees with a single consensus tree.consensus

In our example, the consensus shows that there is not much in common among the four trees. But, if we look at distance, we could gain more information. For example, which of the trees are most closely related? In phylogenetics, distance is generally defined by relationships defined by bipartitions. A bipartition is an edge that when removed separates the tree into two partitions. Assume that C, S, T, J, L, and N represent Clouded Leopard, Snow Leopard, Tiger, Jaquar, Leopard, and Lion, respectively. For tree 1, the bipartitions are C|STJLN, CS|TJLN, CST|JLN, CSTJ|LN, and CSTJL|N. Bipartiton C|STJLN means there is an edge that when removed has one partition containing Clouded Leopard and the other partition containing Snow Leopard, Tiger, Jaquar, Leopard, and Lion. We can compute the Robinson-Foulds (RF) distance between two trees Ti and Tj by counting the number of bipartions in Ti but not in Tj and adding that to the number of bipartions in Tj but not in Ti. The RF distance is then this sum divided by 2. Based on the RF distance matrix of our big cat trees shown below, Trees 1 and 4 as well as Trees 3 and 4 are the closest trees since they have the smallest RF distance of 1.

RFMatrix

In our example, the consensus shows that there is not much in common among the four trees. But, if we look at distance, we could gain more information. For example, which of the trees are most closely related? In phylogenetics, distance is generally defined by relationships defined by bipartitions. A bipartition is an edge that when removed separates the tree into two partitions. Assume that C, S, T, J, L, and N represent Clouded Leopard, Snow Leopard, Tiger, Jaquar, Leopard, and Lion, respectively. For tree 1, the bipartitions are C|STJLN, CS|TJLN, CST|JLN, CSTJ|LN, and CSTJL|N. Bipartiton C|STJLN means there is an edge that when removed has one partition containing Clouded Leopard and the other partition containing Snow Leopard, Tiger, Jaquar, Leopard, and Lion. We can compute the Robinson-Foulds (RF) distance between two trees Ti and Tj by counting the number of bipartions in Ti but not in Tj and adding that to the number of bipartions in Tj but not in Ti. The RF distance is then this sum divided by 2. Based on the RF distance matrix of our big cat trees shown below, Trees 1 and 4 as well as Trees 3 and 4 are the closest trees since they have the smallest RF distance of 1.

What tools exist for computing tree distances?

One of the main focuses in our lab is designing high-performance algorithms for comparing trees. For computing RF distances between thousands of trees, we have designed the algorithms HashRF and MrsRF. Besides bipartions, quartets are also used for describing the relationships in a tree. Whereas a bipartition shows the relationship between all of the taxa in a tree, a quartet is based on 4 taxa. Similarly to bipartitons, we can then use quartets to compare trees. To compute the quartet distance quickly, we have designed the Quick Quartet algorithm. Finally, an interesting consequence of tree distance is that we can use it to compress collections of trees. If trees have much in common, they can be stored in a smaller representation. Our TreeZip algorithm is a first step in the direction of compressing phylogenetic trees.

How can distance measures help us build the Tree of Life?

Distance measures are essential in the synthesis of new trees into the ToL. If for a particular set of taxa the distances are large, this could mean there is significant disagreement on the relationships in that part of the ToL. On the other hand, if the trees are close in terms of distance, there is evidence for substantial agreement within the trees. For trees being added to the ToL, distances can help guide the integration of the new trees. Large distances may require significant manual curation to integrate the trees whereas small distance indicate substantial agreement with the existing ToL and allow the curator to focus on a smaller set of trees.

Tiffani Williams is an assistant professor in the department of computer science at Texas A&M University.

Ralph Crosby is a graduate teaching assistant at Texas A&M University.

Grant Brammer is a graduate teaching assistant at Texas A&M University.

Mapping the Tree of Life: the ARBOR Project

arbor

Open Tree of Life met with ARBOR, a program funded by the National Science Foundation, to talk about what changes have been made featuring the synthetic tree of life. We spoke with Dr. Luke Harmon, an associate professor at the University of Idaho’s department of Biology.  Dr. Harmon has been using comparative biology to determine what the tree of life can tell us about evolution over long time scales.

What has ARBOR been working on right now?

 Comparative Biology is at the heart of the ARBOR project. Using the evolutionary relationships among species, we can learn something about trait evolution and the formation of new species. For example, there really is no basic ‘ladder of life’ stemming from simpler organisms to more complex; instead, evolution varies among groups and through time in complex and interesting ways. It’s hard to do what we do with traditional tools. Instead, we have to use new tools to analyze how species have diversified to generate the tree of life

How have phylogeny studies changed over time?

A lot of progress has been made in the last twenty years regarding our understanding of the relationships among different species. We now know a lot more about how species are related to one another and how they evolved from their common ancestors. The Open Tree of Life is the best possible example of this sort of synthesis – it’s almost like the human genome project in that it is generating a very good map that will connect all organisms on earth in a single phylogenetic tree. One problem, though, is that there is just so much information contained in large phylogenetic trees, and we don’t always know how to extract information about how organisms evolve. ARBOR is developing tools to read the stories of evolution from these phylogenies.

Taxonomy and the tree of life

What’s in a name?

It is now widely accepted that taxonomy should reflect phylogeny — that the names we use in biological classifications should refer to branches on the tree of life. This was one of Darwin’s most revolutionary ideas, that common ancestry is the fundamental organizing principle for natural classification:

“… community of descent is the hidden bond which naturalists have been unconsciously seeking.”

Charles Darwin, On the Origin of Species

One of the main goals of the Open Tree of Life project is to facilitate phylogenetic “synthesis”. What does this mean? The general idea is to take disparate pieces of information — in this case, phylogenetic trees from the scientific literature, or the data sets on which they are based — and merge them together in ways that yield more comprehensive and (hopefully) more accurate inferences of the tree of life as a whole. Like a jigsaw puzzle, the assembled pieces reveal the big picture.

Taxonomy is central to this exercise, because names are the primary link between the products of phylogenetic research. Without taxonomy, a phylogenetic tree from a typical study would simply depict relationships among individual organisms. This would not, in general, be very useful. Imagine if someone told you: “I know of a red house and a blue house, and the road between them runs north-south for about 100 miles.” Without any additional information, this statement has little if any value. For it to make sense, you would ideally want to know the address of each house, and the name of the road connecting them; but even incomplete information (what cities and states are the houses in?) is better than nothing. Only then could you figure out that the route in question is, for example, Interstate 94 between Chicago, IL and Milwaukee, WI.

Similarly, the organisms used in a particular phylogenetic study must be taxonomically classified in order to establish, like pins on a map, how the branches of the inferred tree represent “real” branches in the tree of life. This allows common relationships across studies to be discovered. To continue the analogy, if you know of a yellow house in Chicago and a green house in Milwaukee, you also know that I-94 connects them just as it does the red and blue houses mentioned above. The phylogenetic tree relating a rose, pumpkin, and oak depicts the same relationships — that is, it traces essentially the same evolutionary history — as the tree relating an apple, cucumber, and walnut. In each case, different organisms were chosen to represent the angiosperm orders Rosales, Cucurbitales, and Fagales, respectively.

You might recognize something paradoxical here. I started off by stating that taxonomy should reflect phylogeny. But then, I proceeded to describe how taxonomy is needed to interpret the results of phylogenetic studies. If taxonomy reflects knowledge of phylogeny, and knowledge of phylogeny is derived from studies of organisms chosen for the taxa they represent, isn’t this a chicken-and-egg problem?

The short answer: yes, it is. Systematic biology is a science of reciprocal illumination between, on one hand, what we discover about the tree of life, and on the other, how we reflect and communicate that knowledge through taxonomy. One can view a taxonomic hierarchy — the arrangement of species within genera, genera within families, and so on — as a working hypothesis, subject to revision. Taxonomic names refer to branches on the tree of life that we believe to exist, but we are open to new information that may change our view. For example, we might discover that members of two genera, hypothesized to be exclusive groups based on their morphological differences, are in fact co-mingled on the same branch of the tree of life when DNA evidence is studied. The question then arises: what happens to the names of the original genera? How should we refer to their common branch? These are issues of nomenclature, a topic beyond the scope of this blog post, but the bottom line is that eventually, taxonomy should be updated to reflect this new knowledge.

The tension between taxonomy and phylogeny is at the heart of the basic question, “what do we know about the tree of life, and how do we know it?” While this question is somewhat metaphysical, it also has very practical implications of immediate concern to the Open Tree project. Most importantly, it has been necessary for us to cobble together a comprehensive taxonomic hierarchy that includes all of life, since none existed previously that were reasonably up-to-date. This “Open Tree Taxonomy” serves a critical purpose — basically, it is what allows us to wrangle herds of phylogenetic trees into a common bioinformatic corral. The challenge we face moving forward is how our synthesis efforts can be leveraged to improve and refine our working taxonomy, closing the loop of reciprocal illumination that is central to the discipline of systematics.

Richard Ree is a curator at the Field Museum of Natural History and a faculty member of the Committee on Evolutionary Biology at the University of Chicago.

Follow

Get every new post delivered to your Inbox.

Join 252 other followers