Building Open Tree

The Crandall lab explores solutions to incomplete phylogenies

The Crandall Lab is in charge of uploading and curating animal studies for the AVAToL-Open Tree project.  Chris Owen, postdoctoral researcher, has been leading this portion of the project for the animals beginning in March 2013.  To date, the Crandall Lab has contributed over 400 studies and sent requests for over 100 studies for authors to contribute their phylogenies to the Open Tree project.

Similar to the Solitis Lab group, the Crandall Lab success rate for obtaining published phylogenies directly from authors has been rather low.  As a result, many animal lineages are currently represented in the Open Tree as taxonomic graphs.  One example of a poorly sampled group is the decapods (crabs, crayfish, lobsters, prawns, and shrimp).  Dr. Keith Crandall has studied decapods most of his career and his phylogenies generate a well-sampled backbone, but each higher taxon is represented by few species.  Many researchers want to use the tree for some downstream analysis that benefits from sampling all species; therefore, at this stage of the project one must ask, “How can I obtain a phylogeny of all species for my favorite group, if the only thing available in Open Tree is a well-resolved backbone, while lower taxonomic ranks are represented primarily by unresolved taxonomic graphs?”.

Recently, a paper was published in the journal Nature that may present a workaround for people who wish to obtain a mostly bifurcating comprehensive phylogeny, although only a bifurcating backbone is available on OpenTree.  The published study by Jetz et al. (2013) aimed to use a phylogeny of birds to explore changes in speciation and extinction rate through time, while also mapping all bird diversity, to gain insight into bird evolution.  In order to explore these characteristics of bird evolution, the authors first needed a phylogeny of birds that included all species.  However, no such phylogeny has ever been published and the most comprehensive bird phylogenies available at the time of the study did not contain all species for each crown clade.  Their solution to generating a phylogeny of all birds began first by assigning each avian genus to a crown clade represented in the backbone phylogenies.  Next, sequence data for a set of loci for each species in a crown clade was downloaded from public databases and the phylogeny was estimated using Bayesian inference.  Since the crown clades of the backbone tree contain taxa also in the newly estimated crown phylogenies, the newly estimated crown phylogenies were sub-sampled with the backbone phylogenies to generate a pseudo-posterior distribution of complete avian phylogenies, which was used to depict the avian phylogeny with all species for downstream analyses.

As the organismal labs continue to track down studies and wait for requested published phylogenies, a method such as this may be a temporary solution to obtain mostly bifurcating phylogenies for lineages not well-represented by source trees. Furthermore, variations of this theme could also be used. For example, one could estimate a single tree for each crown clade and merge each tree with the Open Tree phylogeny that has a well-resolved backbone that has unresolved recent clades, using Open Tree Software, and ultimately create a synthetic tree for your favorite group.

These are a couple of potential methods to generate comprehensive phylogenies using the Open Tree for poorly resolved lineages represented only by taxonomy and we look forward to new ideas other researchers offer once the tree becomes public.

Keith Crandall is a professor and director at the George Washington University Institute of Computational Biology.

Chris Owen is a post-doctoral researcher for the AVAToL grant at George Washington University.

The Soltis lab fills the gaps in green plant phylogeny for the Open Tree of Life

Phylogenetic tree summarizing relationships among major lineages of green plants (Viridiplantae)

Phylogenetic tree summarizing relationships among major lineages of green plants (Viridiplantae)

In the Soltis lab at the University of Florida, Bryan Drew and Jiabin Deng have spent much of the past year collecting trees and alignments of green plants (Viridiplantae) as part of an effort to produce a synthetic tree that represents all of the described organisms on Earth. As part of the tree-gathering process, they have gleaned public database archives and contacted corresponding authors directly to request data. Although these methods were not as successful as had been hoped, they recovered trees from over 1000 publications involving green plants.

As might be expected, some areas of the green plant tree are better resolved than others. For example, within gymnosperms and flowering plants we have authorsubmitted trees that support the monophyly of most major lineages, but for other major lineages of green plants, such as green algae and bryophytes, sampling is not as complete and those parts of the tree are not as well resolved. Fortunately, for green algae at least, help is on the way in the form of the NSF funded “Assembling the Green Algae Tree of Life” project. Although results from this project will not be incorporated into the upcoming Open Tree of Life “Big Bang Tree”, within a few years the green algae portion of the Open Tree will undoubtedly greatly benefit by inclusion of trees from the Green Algae Tree of Life project. Other parts of the green plant tree are shaping up nicely, and the Soltis lab is sending out some last minute requests to authors in an attempt to shore up regions of the tree that are presently underrepresented.

Here we provide a basic summary of what we know about green plant phylogeny, stressing that there is much we still do not know about relationships in this large clade of perhaps 500,000 species. We know from the fossil record that many green plant taxa have gone extinct; these extinctions contribute to “long branches” in the Tree of Life and can make it very difficult to determine relationships between older lineages. In the green plant tree, two main clades have been recovered, the Chlorophyta and the Streptophyta. The chlorophytes contain most of what is traditionally known as green algae, while the streptophytes contain the remaining green algae as well as land plants (Embryophyta). One of the many insights provided by molecular systematics during the past twenty years is that “green algae” as long recognized are not actually a natural group (i.e., they are not monophyletic), and that some traditionally classified “green algae” are actually more closely related to land plants. However, the closest “green algal” relative of land plants remains unclear—some studies suggest Charales whereas others indicate Zygnemetales or Coleochaetales The land plants (embryophytes) include bryophytes (mosses, hornworts, and liverworts) and vascular plants (tracheophytes). There is still some question as to whether the bryophytes are a natural group or comprise separate evolutionary lineages. The vascular plants are comprised of lycophytes (clubmosses and quillworts), monilophytes (e.g., ferns and horsetails), gymnosperms (cycads, Ginkgo, gnetophytes, and conifers), and angiosperms (flowering plants).

Though the relationships of come large clades are uncertain, these uncertainties will be shown in the Big Bang tree given that we possess many of the trees that highlight these different clade placements. In other areas of the green plant tree we are sorely lacking data, and the Soltis lab (in close collaboration with Stephen Smith’s lab at the University of Michigan) is still working hard to fill in the tens of thousands of holes in the tree that remain. This is a beautiful part of the Open Tree of Life: as with the organisms that it represents, the tree is ever growing!

Doug Soltis is a distinguished professor at the University of Florida.

What do mycologists think about the tree of life?

David Hibbet screenshot of presentation

Two Open Tree participants, Romina Gazis and David Hibbett, recently attended the annual meeting of the Mycological Society of America in Austin, Texas. Romina gave a presentation about the Open Tree of Life Project, which gave us a chance to hear some thoughts from our community. Questions (paraphrased) included the following:

When the synthetic tree is available, will we be able to filter on a node-by-node basis, or just tree-by-tree? For example, will we be able to identify the strongly supported nodes in individual trees and then constrain the synthetic tree to include those nodes, but not other, weakly supported nodes?”

Capturing information about individual branches, such as support values and branch lengths, is difficult, and in some cases impossible, because the trees were deposited without such information included. It is possible to make decisions about priority on a node-by-node basis, but this requires decision-making by the individual performing the synthesis.

Can this synthetic view of the tree be used to guide genome sampling priorities?”

 Absolutely! In fact, the ongoing 1000 Fungal Genomes Project is already using taxonomy to guide sampling. Open Tree will be able to help in this effort by providing a comprehensive view of phylogenetic diversity of Fungi that will help identify clades that are poorly sampled. We will also be able to prioritize genome-based studies during synthesis, which should allow us to create trees based on a very robust backbone.

Numerous talks and posters at MSA concerned fungal phylogenetics and taxonomy. So much progress is being made! For example, there were presentations on systematics of chytrids, downy midlews, rusts, earth tongues, lichens, mushrooms, and many more. At the same time, in the course of developing the first synthetic trees for this project, it has become abundantly clear that the major centralized taxonomic resources, like Global Biodiversity Information Facility (GBIF) and National Center for Biotechnology Information (NCBI) have a hard time capturing phylogenetic knowledge. To be fair, it is unreasonable to think that any single organization can keep track of all the progress in taxon discovery and phylogenetic inference across the entire tree of life. Sitting in the audience at MSA, I wondered how long it would take for the trees being projected on-screen to be reflected in the taxonomy presented by organizations like GBIF or NCBI (or EoL , CoL, etc). Perhaps a new, community-based approach is needed for building a taxonomic commons?

For the .pdf file of Open Tree of Life’s Challenges and Progress for Fungi, check out Mycological Society of America 2013.

Dr. David Hibbett is a professor of Biology at Clark University.

Online publication to follow the three AVAToL projects

PLOS Currents: Tree of Life

PLOSPeer-reviewed articles about the Open Tree of Life as well as two related projects, Arbor and Phenomics, will be available on PLOS Currents: Tree of Life. The online publication allows the researchers to document their progress in developing software and other tools.

The three research endeavors were developed during an Ideas Lab last year as part of the National Science Foundation’s (NSF) Assembling, Visualizing, and Analyzing the Tree of Life (AVAToL) program. The Open Tree of Life project strives to produce the first draft of a comprehensive tree of life and provides tools for community enhancement and annotation. The Arbor project is developing comparative methods with utility across large sections and the entire tree of life. Finally, the Phenomics project is developing approaches for exploring and documenting phenotypic diversity across the tree of life.

“It’s meant to be a quick outlet for solid phylogenetic studies”

PLOS Currents websites encourage researchers to share their findings with a minimal delay to their peers. The Tree of Life section is focused on rapid publication of phylogenetic and systematic studies with novel data and/or analyses. According to Keith Crandall, one of the three editors of the journal and an investigator of the Open Tree of Life, “it’s meant to be a quick outlet for solid phylogenetic studies to get them and their data into the public domain.” (more…)

Presentation slides from Evolution 2013 available

Open Tree of Life at meetings

The Open Tree of Life project is one of the many phylogeny projects that are featured during the Evolution 2013 meeting that currently takes place in Snowbird (UT). The presentation slides from Karen Cranston, the principal investigator of Open Tree of Life, are available online (LINK). Presentation slides from other investigators are added here in the upcoming days.

Evolution 2013 is the joint annual meeting of the Society for the Study of Evolution (SSE), the Society of Systematic Biologists (SSB), and the American Society of Naturalists (ASN). The conference meets jointly with the iEvoBio conference. Open Tree of Life is represented at both events. About 1400 participants are expected to share their research in evolution, systematics, biodiversity, software, and mathematics.

Free webinar: Putting all species in a graph database

Biology + Technology = OTOL

Neo4j screenshotOne of the developers of the Open Tree of Life demonstrates Thursday, during a free webinar, how graph databases are used to construct a tree of life. The lecture is organized by Neo Technology, which is the maker of Neo4j, an open-source database that is used for OTOL.

Stephen Smith, an ecology and evolutionary biology professor at the University of Michigan, is going to explain how Neo4j and other digital technologies are assisting in constructing the tree of life. Starting at 10:00 PDT (19:00 CEST), he will also discuss other aspects of the interface of biology with next generation technologies.

“Our project is building the tools with which scientists in the community can continually improve the tree of life as we gather new information. Neo4j allows us to not only store trees in their native graph form, but also allows us to map trees to the same structure, the graph. So in fact, we are facilitating the construction of the graph of life,” says Smith.

Neo4j approached the Open Tree of Life team to present a webinar because it is a project that utilizes the Neo4j graph database to represent the interconnectedness of biological data. The company considers the project a great example of how a graph database can better model the natural world.

The online lecture is intended for a broad audience including beginner computer programmers, advanced hackers, data scientists, natural scientists, and anyone interested in the cross-section of science and technology, especially data modeling. Over 150 people have already registered online.

The registration form: LINK

Update: The video from this webinar is available on vimeo:

Building an API for the Open Tree of Life database

Do you want an app for this?

Screen Shot 2012-08-29 at 9.22.20 PMThe developers of the Open Tree of Life would like to know from the phylogenetic community what kind of information they want to extract from its database when the first draft is released later this year. With those preferences, it is possible to develop an API that gives scientists the opportunity to build their own websites or software packages that use the data.

An API (application programming interface) is a digital tool that allows one website or software program to “talk” to another website to dig up certain pieces of data. For instance, a lot of people use Tweetdeck to navigate the ongoing bombardment of messages in the Twittersphere. In that case, Tweetdeck is connecting to Twitter, through its API, to receive and order the messages according to the preferences of the user.

In case of the Open Tree of Life, an API gives researchers advanced access to the data of about two million species, the phylogenies that have been created to illustrate possible relationships between them, and the underlying data and methods of synthesis. “For example, it will be possible to select smaller trees for specific species or find out how many studies there are for a particular node within the database,” says Karen Cranston, the lead investigator of the project. (more…)