Latest

Which came first? A pivotal position in the plant tree of life

Amborella trichopoda

Amborella trichopoda

The question of which extant angiosperm (flowering plant) lineage “came first” (i.e., is basal in the flowering plant tree of life) has long puzzled biologists. This question is fascinating and important in its own right, but the answer also has potentially profound ramifications including plant gene and genome evolution (which, for example, has implications for crop improvement). Such information is also important for understanding habit and habitat evolution and for the inference of ancestral character states in the angiosperms (e.g., the ancestral flower as well as the ancestral angiosperm genome). Although great 20th century plant taxonomists such as Arthur Cronquist, Armen Takhtajan, and Robert Thorne generally agreed that taxa from the subclass Magnoliidae comprised the “basal” angiosperm lineage, there was no way to “prove”, one way or another, which extant angiosperm lineage came first until the advent of molecular systematics towards the end of the 20th century.Untitled

With the aid of modern molecular phylogenetic techniques it is now known that the major groups they recognized, such as Magnoliidae sensu Cronquist and Takhtajan, are typically polyphyletic. Most research now indicates instead that Amborellaceae, Nymphaeales (water lilies), and Austrobaileyales are the earliest branching extant angiosperm lineages. However, the relative branching order of these three lineages, particularly in regards to Amborella trichopoda (the sole species within Amborellaceae) and Nymphaeales, was, until recently, somewhat contentious.

While most molecular analyses during the past 20 years have recovered Amborella as the earliest-diverging angiosperm lineage, some studies have suggested a clade comprising Amborella + Nymphaeales, or even Nymphaeales alone, as the root of all angiosperms. Recently, at the University of Florida, Soltis lab postdoc Bryan Drew and colleagues (including AVATOL team member Stephen Smith at the University of Michigan) endeavored to definitively answer the longstanding question of which angiosperm came first—that is, what living angiosperm is sister to all other living angiosperms in the angiosperm tree of life. Using a plastid data set consisting of 236 taxa, 78 genes, and ~58,000 nucleotides, Drew et al. performed a myriad of analyses with the express purpose of discerning the first-diverging angiosperm lineage; this study by Drew et al. was just accepted by Systematic Biology and will be viewable online in the coming months. Their results: Virtually every analysis conducted found Amborella as the earliest-diverging living angiosperm lineage with high internal support, and every plastid analysis performed using their original datasets recovered a topology in which Amborella alone is sister to all other living angiosperms.

CaptureThese findings lend strong affirmation to the Amborella sister hypothesis, and should help guide future research regarding angiosperm character (including genomic features) and habitat evolution. Although the “first” angiosperms are long extinct, a better understanding of Amborella will aid in our understanding of angiosperm evolution as a whole. This was the impetus behind the Amborella Genome Project. As a result of this ongoing project, the Amborella nuclear genome has recently been fully sequenced (www.amborella.org; Amborella Genome Project, Science, in press), and this major achievement should lead to unprecedented insights within flowering plants.

 

Doug Soltis is a distinguished professor at the University of Florida.

Bryan Drew is a post-doctoral researcher in the Soltis lab at the University of Florida.

How computer scientists are using map distance to determine phylogeny

What is distance?

Distance is a way to measure the relatedness of two things. It is phrased in terms of similarity or difference relative to a feature. Different features expose different information about how the things are related. For instance, if we compare two cities, we might compute their geographical distance or how far apart they are in terms of miles or kilometers. But, if we are making a car trip, we may want to compute a different distance. Roads rarely directly connect two points, so we may care more about the driving distance or driving time. On the other hand, if we’re looking for somewhere warm to spend the winter, we may care most about the difference between the temperatures of two cities.

Distance is a requirement for comparison. It fundamental to the assessment data required by scientific pursuits as well as the value judgments made in our daily lives. Thus, distance is a cornerstone of the human experience.

What does distance tell us about trees?

PantheraBlogConsider four phylogenies over the genus Panthera or big cats shown below. Here, the trees are from actual phylogenetic analyses performed by different researchers over the years. The fourth tree is the current best estimate of the big cats by Davis, Li, and Murphy. (For further details, see their 2010 paper “Supermatrix and species tree methods resolve phylogenetic relationships within the big cats, panthera (carnivora: Felidae)” in Molecular Phylogenetics and Evolution.)

There are different trees because researchers use different combinations of phylogenetic reconstruction methods and phylogenetic data. Typically, these discrepencies are resolved by a consensus tree where relationships are included in the consensus tree if they appear in either most of the trees (majority consensus) or all of the trees (strict consensus). For our example, the majority consensus tree only retains one relationship as shown below. Most of the information from the trees is lost, which is one disadvantage of summarizing a set of trees with a single consensus tree.consensus

In our example, the consensus shows that there is not much in common among the four trees. But, if we look at distance, we could gain more information. For example, which of the trees are most closely related? In phylogenetics, distance is generally defined by relationships defined by bipartitions. A bipartition is an edge that when removed separates the tree into two partitions. Assume that C, S, T, J, L, and N represent Clouded Leopard, Snow Leopard, Tiger, Jaquar, Leopard, and Lion, respectively. For tree 1, the bipartitions are C|STJLN, CS|TJLN, CST|JLN, CSTJ|LN, and CSTJL|N. Bipartiton C|STJLN means there is an edge that when removed has one partition containing Clouded Leopard and the other partition containing Snow Leopard, Tiger, Jaquar, Leopard, and Lion. We can compute the Robinson-Foulds (RF) distance between two trees Ti and Tj by counting the number of bipartions in Ti but not in Tj and adding that to the number of bipartions in Tj but not in Ti. The RF distance is then this sum divided by 2. Based on the RF distance matrix of our big cat trees shown below, Trees 1 and 4 as well as Trees 3 and 4 are the closest trees since they have the smallest RF distance of 1.

RFMatrix

In our example, the consensus shows that there is not much in common among the four trees. But, if we look at distance, we could gain more information. For example, which of the trees are most closely related? In phylogenetics, distance is generally defined by relationships defined by bipartitions. A bipartition is an edge that when removed separates the tree into two partitions. Assume that C, S, T, J, L, and N represent Clouded Leopard, Snow Leopard, Tiger, Jaquar, Leopard, and Lion, respectively. For tree 1, the bipartitions are C|STJLN, CS|TJLN, CST|JLN, CSTJ|LN, and CSTJL|N. Bipartiton C|STJLN means there is an edge that when removed has one partition containing Clouded Leopard and the other partition containing Snow Leopard, Tiger, Jaquar, Leopard, and Lion. We can compute the Robinson-Foulds (RF) distance between two trees Ti and Tj by counting the number of bipartions in Ti but not in Tj and adding that to the number of bipartions in Tj but not in Ti. The RF distance is then this sum divided by 2. Based on the RF distance matrix of our big cat trees shown below, Trees 1 and 4 as well as Trees 3 and 4 are the closest trees since they have the smallest RF distance of 1.

What tools exist for computing tree distances?

One of the main focuses in our lab is designing high-performance algorithms for comparing trees. For computing RF distances between thousands of trees, we have designed the algorithms HashRF and MrsRF. Besides bipartions, quartets are also used for describing the relationships in a tree. Whereas a bipartition shows the relationship between all of the taxa in a tree, a quartet is based on 4 taxa. Similarly to bipartitons, we can then use quartets to compare trees. To compute the quartet distance quickly, we have designed the Quick Quartet algorithm. Finally, an interesting consequence of tree distance is that we can use it to compress collections of trees. If trees have much in common, they can be stored in a smaller representation. Our TreeZip algorithm is a first step in the direction of compressing phylogenetic trees.

How can distance measures help us build the Tree of Life?

Distance measures are essential in the synthesis of new trees into the ToL. If for a particular set of taxa the distances are large, this could mean there is significant disagreement on the relationships in that part of the ToL. On the other hand, if the trees are close in terms of distance, there is evidence for substantial agreement within the trees. For trees being added to the ToL, distances can help guide the integration of the new trees. Large distances may require significant manual curation to integrate the trees whereas small distance indicate substantial agreement with the existing ToL and allow the curator to focus on a smaller set of trees.

Tiffani Williams is an assistant professor in the department of computer science at Texas A&M University.

Ralph Crosby is a graduate teaching assistant at Texas A&M University.

Grant Brammer is a graduate teaching assistant at Texas A&M University.

Mapping the Tree of Life: the ARBOR Project

arbor

Open Tree of Life met with ARBOR, a program funded by the National Science Foundation, to talk about what changes have been made featuring the synthetic tree of life. We spoke with Dr. Luke Harmon, an associate professor at the University of Idaho’s department of Biology.  Dr. Harmon has been using comparative biology to determine what the tree of life can tell us about evolution over long time scales.

What has ARBOR been working on right now?

 Comparative Biology is at the heart of the ARBOR project. Using the evolutionary relationships among species, we can learn something about trait evolution and the formation of new species. For example, there really is no basic ‘ladder of life’ stemming from simpler organisms to more complex; instead, evolution varies among groups and through time in complex and interesting ways. It’s hard to do what we do with traditional tools. Instead, we have to use new tools to analyze how species have diversified to generate the tree of life

How have phylogeny studies changed over time?

A lot of progress has been made in the last twenty years regarding our understanding of the relationships among different species. We now know a lot more about how species are related to one another and how they evolved from their common ancestors. The Open Tree of Life is the best possible example of this sort of synthesis – it’s almost like the human genome project in that it is generating a very good map that will connect all organisms on earth in a single phylogenetic tree. One problem, though, is that there is just so much information contained in large phylogenetic trees, and we don’t always know how to extract information about how organisms evolve. ARBOR is developing tools to read the stories of evolution from these phylogenies.

Taxonomy and the tree of life

What’s in a name?

It is now widely accepted that taxonomy should reflect phylogeny — that the names we use in biological classifications should refer to branches on the tree of life. This was one of Darwin’s most revolutionary ideas, that common ancestry is the fundamental organizing principle for natural classification:

“… community of descent is the hidden bond which naturalists have been unconsciously seeking.”

Charles Darwin, On the Origin of Species

One of the main goals of the Open Tree of Life project is to facilitate phylogenetic “synthesis”. What does this mean? The general idea is to take disparate pieces of information — in this case, phylogenetic trees from the scientific literature, or the data sets on which they are based — and merge them together in ways that yield more comprehensive and (hopefully) more accurate inferences of the tree of life as a whole. Like a jigsaw puzzle, the assembled pieces reveal the big picture.

Taxonomy is central to this exercise, because names are the primary link between the products of phylogenetic research. Without taxonomy, a phylogenetic tree from a typical study would simply depict relationships among individual organisms. This would not, in general, be very useful. Imagine if someone told you: “I know of a red house and a blue house, and the road between them runs north-south for about 100 miles.” Without any additional information, this statement has little if any value. For it to make sense, you would ideally want to know the address of each house, and the name of the road connecting them; but even incomplete information (what cities and states are the houses in?) is better than nothing. Only then could you figure out that the route in question is, for example, Interstate 94 between Chicago, IL and Milwaukee, WI.

Similarly, the organisms used in a particular phylogenetic study must be taxonomically classified in order to establish, like pins on a map, how the branches of the inferred tree represent “real” branches in the tree of life. This allows common relationships across studies to be discovered. To continue the analogy, if you know of a yellow house in Chicago and a green house in Milwaukee, you also know that I-94 connects them just as it does the red and blue houses mentioned above. The phylogenetic tree relating a rose, pumpkin, and oak depicts the same relationships — that is, it traces essentially the same evolutionary history — as the tree relating an apple, cucumber, and walnut. In each case, different organisms were chosen to represent the angiosperm orders Rosales, Cucurbitales, and Fagales, respectively.

You might recognize something paradoxical here. I started off by stating that taxonomy should reflect phylogeny. But then, I proceeded to describe how taxonomy is needed to interpret the results of phylogenetic studies. If taxonomy reflects knowledge of phylogeny, and knowledge of phylogeny is derived from studies of organisms chosen for the taxa they represent, isn’t this a chicken-and-egg problem?

The short answer: yes, it is. Systematic biology is a science of reciprocal illumination between, on one hand, what we discover about the tree of life, and on the other, how we reflect and communicate that knowledge through taxonomy. One can view a taxonomic hierarchy — the arrangement of species within genera, genera within families, and so on — as a working hypothesis, subject to revision. Taxonomic names refer to branches on the tree of life that we believe to exist, but we are open to new information that may change our view. For example, we might discover that members of two genera, hypothesized to be exclusive groups based on their morphological differences, are in fact co-mingled on the same branch of the tree of life when DNA evidence is studied. The question then arises: what happens to the names of the original genera? How should we refer to their common branch? These are issues of nomenclature, a topic beyond the scope of this blog post, but the bottom line is that eventually, taxonomy should be updated to reflect this new knowledge.

The tension between taxonomy and phylogeny is at the heart of the basic question, “what do we know about the tree of life, and how do we know it?” While this question is somewhat metaphysical, it also has very practical implications of immediate concern to the Open Tree project. Most importantly, it has been necessary for us to cobble together a comprehensive taxonomic hierarchy that includes all of life, since none existed previously that were reasonably up-to-date. This “Open Tree Taxonomy” serves a critical purpose — basically, it is what allows us to wrangle herds of phylogenetic trees into a common bioinformatic corral. The challenge we face moving forward is how our synthesis efforts can be leveraged to improve and refine our working taxonomy, closing the loop of reciprocal illumination that is central to the discipline of systematics.

Richard Ree is a curator at the Field Museum of Natural History and a faculty member of the Committee on Evolutionary Biology at the University of Chicago.

Social curation of phylogenetic studies

People associated with the Open Tree of Life effort are busy on several fronts: writing a paper describing the initial draft release of a comprehensive tree of life, continuing their efforts to obtain estimates of different parts of the tree, improving the Open Tree Taxonomy (OTT) used for name matching, experimenting with new methods for building large trees…

In the midst of that activity (and well aware that we missed our initial goal of having the first release in the first year of the grant), we have recently started to redesign the study curation tool. The goal is to build a tool that is built around git and GitHub. This decision could be described using a wide variety of adjectives ranging from “foolish” to “inspired” (and probably including several that are not printable on this family-friendly blog). So, I (Mark Holder is writing this post) thought that I’d explain the rationale behind this decision.

Why do we need to “curate” published trees in the first place?

Unfortunately, even when we can find a phylogenetic estimate in a digital format, some crucial information is often missing. The tasks in the “curation” process typically include:

  • matching the tips of the tree to the appropriate taxon in a taxonomy (OTT in our case);
  • indicating which parts of the tree are rooted with high confidence. Many phylogenetic estimation procedures produce unrooted estimates, and the trees that they emit are often arbitrarily rooted. Properly identifying the “outgroup” is important for the supertree methods that we are using; and
  • describing what the branch lengths and internal node labels on the tree mean.

In our first year of work on the Open Tree of Life project, we’ve also found many cases in which it would be nice if a downstream software tool could annotate the source tree.

For example, if a phylogeny of plants contains a single animal species, this odd sampling of species could be caused by an incorrect matching of names when the study was imported into the Open Tree of Life system (there are valid homonyms in parts of life that are governed by different nomenclatural codes; the wikipedia page on homonyms has a nice discussion of this topic, including the example of the genus name Erica being used for a jumping spider and a large group of flowering plants known as “heath”). The warning signs of incorrect name matching may not be obvious when a new study is added to the Open Tree of Life system. Ideally, these potential errors would be flagged with comments so that a taxonomic expert could double check the name matching.

Why not just build a database driven website with a “page” for each study so that you can update the study information in one place?

This is exactly what we have done. Fortunately for the project, Rick Ree’s lab already had a tool (phylografter) that did many of these tasks. Rick and his group have continued to improve phylografter as a part of the Open Tree of Life project. The fact that we started the project with a nice tool for study curation is a big part of the reason that we were able to get trees from about 2500 studies into the Open Tree of Life system in this first year (the other “big parts” are the herculean efforts of Bryan Drew, Romina Gazis, Jiabin Deng, Chris Owen, Jessica Grant, Laura Katz, and others to import and curate studies).

If it is not broken, why are we trying to “fix” it?

One of the primary goals of the Open Tree of Life project is to enable the community of biologists to collaboratively assemble phylogenetic knowledge. We are trying to build infrastructure for a system that is as inviting as possible to the community of biologists and software developers. Those goals imply that we should track the contribution of users in a fine-grained manner (so people will get the credit that they deserve), and that the system be open to contributions through many avenues (so that developers will not be constrained to work within one tightly integrated code base).

Phylografter is open in many senses: the code is open-source (see its repository), the study data can be exported via web services (this code snippet is an example of using the service), and interested parties can become study curators. However, the fundamental data store used by phylografter is an SQL database. All writing to the core data store has to be done via adding new functionality to the phylografter tool itself. This is certainly not impossible, but it is not very inviting to developers outside the project who want to dabble with the project.

For example, imagine that you wrote a tool that identifies groupings which might be the result of long branch attraction. To integrate that sort of annotation tool into our current architecture, you would need to figure out the SQL tables that would be affected, write an interface for adding this form of annotation, and implement a system for keeping track of the provenance of each change. This is all possible to do, but much more complicated than writing a tool that simply adds an annotation to a file.

Maybe it won’t be too hard to open up the database of phylogenetic studies as versioned text.

Fortunately, the process of adding corrections and annotations to a text file in a collaborative setting is a common problem, and some excellent software tools exist for dealing with this situation. In particular we can use the git content tracker to store the versions of a study in a reliable, secure manner with full history of the file and rich tools that allow many people to collaborate on the same file. GitHub offers some great add-on features (including dealing with authentication of users) and makes it easy to have a core data store that anyone can access. The Open Tree of Life is making heavy use of NexSON already, and that format supports rich annotation (though we do need to iron out the details of a controlled vocabulary). So we should not have to spend much time on designing the format of the files to be managed by git.

We certainly aren’t the first to think of using git as the database for an application (see the gollum project and git-orm, for example). Nor are we the first to think of using GitHub to make data in systematics more open. I love Rutger Vos’ dump of treebase data in https://github.com/rvosa/supertreebase. Ross Mounce has recently started putting many datafiles that he uses in his research on https://github.com/rossmounce/cladistic-data. Rod Page had a nice post a while back titled “Time to put taxonomy into GitHub.” I’m sure there are more examples.

git and GitHub keep coming up in the context of collaboratively editing data, because most software developers who have used the tools recognize how they have really transformed collaborative software development. Implementing a social tool is tough, but git seems to have done it right. Every one gets an entire copy of the data (via git clone). You can make your changes and save them in your own sandbox (via committing to a fork or branch). When you think that you have a set of changes that are of interest to others, you can ask that they get incorporated into the primary version of the data base (via a pull request).

Of course, most biologists won’t want to use the git tool itself. Fortunately we have some very talented developers (Jim Allman, Jonathan “Duke” Leto, and Jonathan Rees) working on a web application that will hide the ugly details from most users. We’re also working on allowing phylografter to receive updated NexSON files, so we won’t have to abandon that tool for curating study data.

It is a bit scary to be adding a new tool this late in our timeline. But we’re really excited about the prospect of having a phylogenetic data curation tool built on top of a proven system for collaboration.

Comments, questions and suggestions are certainly welcome. The software dev page on our wiki has links to many of the communication tools that the Open Tree of Life software developers are using to discuss these (and other) ideas in more detail.

Mark Holder is an associate professor at the University of Kansas’s Department of Ecology and Evolutionary Biology.

Minor edits on Sunday, Oct 6 at 1:30 Eastern: links added for OTT and SQL

The Crandall lab explores solutions to incomplete phylogenies

The Crandall Lab is in charge of uploading and curating animal studies for the AVAToL-Open Tree project.  Chris Owen, postdoctoral researcher, has been leading this portion of the project for the animals beginning in March 2013.  To date, the Crandall Lab has contributed over 400 studies and sent requests for over 100 studies for authors to contribute their phylogenies to the Open Tree project.

Similar to the Solitis Lab group, the Crandall Lab success rate for obtaining published phylogenies directly from authors has been rather low.  As a result, many animal lineages are currently represented in the Open Tree as taxonomic graphs.  One example of a poorly sampled group is the decapods (crabs, crayfish, lobsters, prawns, and shrimp).  Dr. Keith Crandall has studied decapods most of his career and his phylogenies generate a well-sampled backbone, but each higher taxon is represented by few species.  Many researchers want to use the tree for some downstream analysis that benefits from sampling all species; therefore, at this stage of the project one must ask, “How can I obtain a phylogeny of all species for my favorite group, if the only thing available in Open Tree is a well-resolved backbone, while lower taxonomic ranks are represented primarily by unresolved taxonomic graphs?”.

Recently, a paper was published in the journal Nature that may present a workaround for people who wish to obtain a mostly bifurcating comprehensive phylogeny, although only a bifurcating backbone is available on OpenTree.  The published study by Jetz et al. (2013) aimed to use a phylogeny of birds to explore changes in speciation and extinction rate through time, while also mapping all bird diversity, to gain insight into bird evolution.  In order to explore these characteristics of bird evolution, the authors first needed a phylogeny of birds that included all species.  However, no such phylogeny has ever been published and the most comprehensive bird phylogenies available at the time of the study did not contain all species for each crown clade.  Their solution to generating a phylogeny of all birds began first by assigning each avian genus to a crown clade represented in the backbone phylogenies.  Next, sequence data for a set of loci for each species in a crown clade was downloaded from public databases and the phylogeny was estimated using Bayesian inference.  Since the crown clades of the backbone tree contain taxa also in the newly estimated crown phylogenies, the newly estimated crown phylogenies were sub-sampled with the backbone phylogenies to generate a pseudo-posterior distribution of complete avian phylogenies, which was used to depict the avian phylogeny with all species for downstream analyses.

As the organismal labs continue to track down studies and wait for requested published phylogenies, a method such as this may be a temporary solution to obtain mostly bifurcating phylogenies for lineages not well-represented by source trees. Furthermore, variations of this theme could also be used. For example, one could estimate a single tree for each crown clade and merge each tree with the Open Tree phylogeny that has a well-resolved backbone that has unresolved recent clades, using Open Tree Software, and ultimately create a synthetic tree for your favorite group.

These are a couple of potential methods to generate comprehensive phylogenies using the Open Tree for poorly resolved lineages represented only by taxonomy and we look forward to new ideas other researchers offer once the tree becomes public.

Keith Crandall is a professor and director at the George Washington University Institute of Computational Biology.

Chris Owen is a post-doctoral researcher for the AVAToL grant at George Washington University.

Recommending CC0 for GBIF data

GBIF (Global Biodiversity Information Facility) recently issued a request for comment on its data licensing policy. While Open Tree of LIfe does not currently use specimen data, we do use the GBIF classification in order to help resolve names and also as part of the opentree backbone. Jonathan Rees, Karen Cranston, Todd Vision and Hilmar Lapp wrote a response recommending a CC0 waiver for all GBIF data. Here is our summary, and a link to the full response on Figshare.

Summary

As a data aggregator, the goal of GBIF should be to find policies that benefit both its data providers and data reusers. Clearly, a GBIF that has no or few data will have little value, but so will a GBIF full of data that is encumbered with restrictions to an extent that stifles reuse.  Our response follows from the proposition that promoting data reuse should be a shared interest of all the parties: data providers, data users, and GBIF itself. We feel the consultation document missed the opportunity to recognize this shared interest, and that furthering the goal of data reuse should in fact be a primary yardstick by which different licensing options are measured.

Tracking the reuse of data is a critically important goal, as it provides a means of reward to data providers, allows scrutiny of derived results, and enables discovery of related research. Initiatives such as DataCite have have made considerable progress in recent years in enabling tracking of data reuse by addressing sociotechnical obstacles to tracking data reuse. By contrast, the consultation, in our view, puts undue weight on legal requirements for attribution. Legal instruments such as licenses are unsuitable, not designed for, and of little if any benefit for this purpose. Moreover, in most of the world, there is little to no formally recognized intellectual property protection for data, and it is on such protection that licenses rest.

In short, our recommendations are (1) that all data in GBIF be released under Creative Commons Zero (CC0), which is a public domain dedication that waives copyright rather than asserting it; (2) GBIF should set clear expectations in the form of community norms for how the data that it serves is to be referenced when reused, and (3) GBIF should work with partner organizations in promoting standards and technologies that enable the effective tracking of data reuse.

We note that our analysis is based on our understanding of the law; we are not legal professionals and this is not legal advice.

Full response

Response to GBIF request for consultation on data licenses. Karen Cranston, Todd Vision, Hilmar Lapp, Jonathan Rees. figshare.
http://dx.doi.org/10.6084/m9.figshare.799766

The Soltis lab fills the gaps in green plant phylogeny for the Open Tree of Life

Phylogenetic tree summarizing relationships among major lineages of green plants (Viridiplantae)

Phylogenetic tree summarizing relationships among major lineages of green plants (Viridiplantae)

In the Soltis lab at the University of Florida, Bryan Drew and Jiabin Deng have spent much of the past year collecting trees and alignments of green plants (Viridiplantae) as part of an effort to produce a synthetic tree that represents all of the described organisms on Earth. As part of the tree-gathering process, they have gleaned public database archives and contacted corresponding authors directly to request data. Although these methods were not as successful as had been hoped, they recovered trees from over 1000 publications involving green plants.

As might be expected, some areas of the green plant tree are better resolved than others. For example, within gymnosperms and flowering plants we have authorsubmitted trees that support the monophyly of most major lineages, but for other major lineages of green plants, such as green algae and bryophytes, sampling is not as complete and those parts of the tree are not as well resolved. Fortunately, for green algae at least, help is on the way in the form of the NSF funded “Assembling the Green Algae Tree of Life” project. Although results from this project will not be incorporated into the upcoming Open Tree of Life “Big Bang Tree”, within a few years the green algae portion of the Open Tree will undoubtedly greatly benefit by inclusion of trees from the Green Algae Tree of Life project. Other parts of the green plant tree are shaping up nicely, and the Soltis lab is sending out some last minute requests to authors in an attempt to shore up regions of the tree that are presently underrepresented.

Here we provide a basic summary of what we know about green plant phylogeny, stressing that there is much we still do not know about relationships in this large clade of perhaps 500,000 species. We know from the fossil record that many green plant taxa have gone extinct; these extinctions contribute to “long branches” in the Tree of Life and can make it very difficult to determine relationships between older lineages. In the green plant tree, two main clades have been recovered, the Chlorophyta and the Streptophyta. The chlorophytes contain most of what is traditionally known as green algae, while the streptophytes contain the remaining green algae as well as land plants (Embryophyta). One of the many insights provided by molecular systematics during the past twenty years is that “green algae” as long recognized are not actually a natural group (i.e., they are not monophyletic), and that some traditionally classified “green algae” are actually more closely related to land plants. However, the closest “green algal” relative of land plants remains unclear—some studies suggest Charales whereas others indicate Zygnemetales or Coleochaetales The land plants (embryophytes) include bryophytes (mosses, hornworts, and liverworts) and vascular plants (tracheophytes). There is still some question as to whether the bryophytes are a natural group or comprise separate evolutionary lineages. The vascular plants are comprised of lycophytes (clubmosses and quillworts), monilophytes (e.g., ferns and horsetails), gymnosperms (cycads, Ginkgo, gnetophytes, and conifers), and angiosperms (flowering plants).

Though the relationships of come large clades are uncertain, these uncertainties will be shown in the Big Bang tree given that we possess many of the trees that highlight these different clade placements. In other areas of the green plant tree we are sorely lacking data, and the Soltis lab (in close collaboration with Stephen Smith’s lab at the University of Michigan) is still working hard to fill in the tens of thousands of holes in the tree that remain. This is a beautiful part of the Open Tree of Life: as with the organisms that it represents, the tree is ever growing!

Doug Soltis is a distinguished professor at the University of Florida.

What do mycologists think about the tree of life?

David Hibbet screenshot of presentation

Two Open Tree participants, Romina Gazis and David Hibbett, recently attended the annual meeting of the Mycological Society of America in Austin, Texas. Romina gave a presentation about the Open Tree of Life Project, which gave us a chance to hear some thoughts from our community. Questions (paraphrased) included the following:

When the synthetic tree is available, will we be able to filter on a node-by-node basis, or just tree-by-tree? For example, will we be able to identify the strongly supported nodes in individual trees and then constrain the synthetic tree to include those nodes, but not other, weakly supported nodes?”

Capturing information about individual branches, such as support values and branch lengths, is difficult, and in some cases impossible, because the trees were deposited without such information included. It is possible to make decisions about priority on a node-by-node basis, but this requires decision-making by the individual performing the synthesis.

Can this synthetic view of the tree be used to guide genome sampling priorities?”

 Absolutely! In fact, the ongoing 1000 Fungal Genomes Project is already using taxonomy to guide sampling. Open Tree will be able to help in this effort by providing a comprehensive view of phylogenetic diversity of Fungi that will help identify clades that are poorly sampled. We will also be able to prioritize genome-based studies during synthesis, which should allow us to create trees based on a very robust backbone.

Numerous talks and posters at MSA concerned fungal phylogenetics and taxonomy. So much progress is being made! For example, there were presentations on systematics of chytrids, downy midlews, rusts, earth tongues, lichens, mushrooms, and many more. At the same time, in the course of developing the first synthetic trees for this project, it has become abundantly clear that the major centralized taxonomic resources, like Global Biodiversity Information Facility (GBIF) and National Center for Biotechnology Information (NCBI) have a hard time capturing phylogenetic knowledge. To be fair, it is unreasonable to think that any single organization can keep track of all the progress in taxon discovery and phylogenetic inference across the entire tree of life. Sitting in the audience at MSA, I wondered how long it would take for the trees being projected on-screen to be reflected in the taxonomy presented by organizations like GBIF or NCBI (or EoL , CoL, etc). Perhaps a new, community-based approach is needed for building a taxonomic commons?

For the .pdf file of Open Tree of Life’s Challenges and Progress for Fungi, check out Mycological Society of America 2013.

Dr. David Hibbett is a professor of Biology at Clark University.

Online publication to follow the three AVAToL projects

PLOS Currents: Tree of Life

PLOSPeer-reviewed articles about the Open Tree of Life as well as two related projects, Arbor and Phenomics, will be available on PLOS Currents: Tree of Life. The online publication allows the researchers to document their progress in developing software and other tools.

The three research endeavors were developed during an Ideas Lab last year as part of the National Science Foundation’s (NSF) Assembling, Visualizing, and Analyzing the Tree of Life (AVAToL) program. The Open Tree of Life project strives to produce the first draft of a comprehensive tree of life and provides tools for community enhancement and annotation. The Arbor project is developing comparative methods with utility across large sections and the entire tree of life. Finally, the Phenomics project is developing approaches for exploring and documenting phenotypic diversity across the tree of life.

“It’s meant to be a quick outlet for solid phylogenetic studies”

PLOS Currents websites encourage researchers to share their findings with a minimal delay to their peers. The Tree of Life section is focused on rapid publication of phylogenetic and systematic studies with novel data and/or analyses. According to Keith Crandall, one of the three editors of the journal and an investigator of the Open Tree of Life, “it’s meant to be a quick outlet for solid phylogenetic studies to get them and their data into the public domain.” Read the rest of this page »

%d bloggers like this: