The current Open Tree of Life synthesis methods are based on the Tree Alignment Graphs described by Smith et al 2013. The examples presented in that paper used much simpler datasets than the dataset that is used for draft tree synthesis by the Open Tree of Life (which contains hundreds of original source trees and the entire OTT taxonomy with over 2.3 million terminal taxa). To accommodate the goals of synthesis, some modifications were made to the methods presented in Smith et al 2013. The current version of the draft tree (v2, which is presented at http://tree.opentreeoflife.org as of February 2015 and described in a preprint on bioRxiv), was built using these modified methods. The changes to synthesis that were introduced since Smith et al 2013 are not well-described elsewhere, so we present them below in this document.
We are continually testing and improving the methods we use to develop synthesis trees, and through this process we have recently discovered some methodological properties of the modified TAG procedures that are undesirable for our synthesis goals. We are making progress toward fixing them for the next version of the draft tree, and there are details at the end of this post.
General background on the Open Tree of Life project and the draft tree
The overall goal of OpenTree is to summarize what is known about phylogenetic relationships in a transparent manner with a clear connection to analyses and the published studies that support different clades. Comprehensive coverage of published phylogenetic statements is a very long term goal which would require work from a large community of biologists. The short-term goal for the supertree presented on the tree browser is to summarize a small set of well-curated inputs in a clear manner.
Background on Tree Alignment Graph methods
The current synthesis method uses a Tree Alignment Graph (TAG), described in Smith et al 2013. We have been using TAGs because:
- These graphs can provide a view on conflict and congruence among input trees.
- TAG-based are computationally tractable on the scale which the open tree of life project operates (2.3 million tips on the tree, and hundreds of input trees).
- TAG-based approaches provide a straightforward way to handle inputs in which tips of a tree are assigned to higher taxa (any taxon above the species level). It is fairly common for published phylogenies to have tips mapped at the genus level (or higher).
- When coupled with expert knowledge in the form of ranking of input trees, TAG methods can produce a sensible summary of our (rather limited) input trees. At this point in the project, our data store does not contain a large number of trees sufficiently curated* to be included in the supertree operations.
* Sufficiently curated = 1. tips mapped to taxa in the Open Tree Taxonomy; 2. rooted as described in the publication; 3. ingroup noted. Incorrect rootings and assignments of tips to taxa can introduce a lot of noise in the estimate, so we have opted for careful vetting of input trees rather scraping together every estimate available. We are hopeful that community involvement in the curation will get us to a point of having enough input trees to allow more traditional supertree approaches to work well, so that we can present multiple estimates of the tree of life.
Methods used to produce the v2 draft tree
The open tree of life project has been alternating between phases where we (1) add more trees to our set of curated input trees, and then (2) generate new versions of the “synthetic” draft tree of life. Thus far two versions of the tree have been publicly posted to http://tree.opentreeoflife.org. The process of generating a new public draft tree involves the creation and critical review of many unpublished draft trees in order to detect errors or problems with the process (which could be due to misspecified taxa in input trees, software bugs, etc.).
This process has led to a few modifications of the TAG procedure as it was described in the PLoS Comp. Bio. paper. These modifications have been made to our treemachine software, and they include:
- In the original paper, conflict was assessed by whether there was conflicting overlap among the descendant taxa of the nodes, not the edes. The software that produced the v2 tree assessed conflict between edges of the graph by looking for conflict based on the taxon sets contributed by each tree. This change is referred to as the “relationship taxa” rule in this issue on GitHub).
- The supertree operation moves from root to tips, and occasionally a species attaches to a node via a series of low ranking relationships. When all of these are rejected (due to conflict with higher ranking trees), the species would be absent in the full tree if we followed the original TAG description faithfully. Instead, the treemachine version for v2 tree reattached these taxa based on their taxonomy after sweeping over the full tree.
- The “Partially overlapping taxon sets” section of the paper described a procedure for eliminating order-dependence of the input trees. We have recently discovered a case in which the structure of a TAG built according to those procedures would differ depending on the input order of the trees. We have implemented a new procedure that pre-processes all the input trees, which removes this order-dependence (code for the new procedure can be accessed in the find-mrcas-when-creating-nodes branch of the treemachine repo on github).
- To increase the overlap between different input trees, an additional step was implemented in treemachine that mapped the tips of an input tree to deeper nodes in the taxonomy that they may have represented. This was done by determining the most inclusive taxon that a tip could belong to without including any other tips in the tree, and then mapping the tip to that taxon instead of the taxon actually specified for the tip in the input tree itself. For example if the only primate in a tree was Homo sapiens, but the tree contained other mammals from the taxon sister to Primates (in the taxonomy), then the Homo sapiens tip would be assigned to the taxon Primates.
Undesirable properties of the procedures used to produce v2
- It was possible for edges to exist in the draft tree that were not supported by any of the input trees. There were a very small number (111) of such groups in the v2 tree; this GitHub issue discusses the issue more thoroughly. This is not an unusual property for a supertree method to have – in fact most supertree methods can produce such groups. And under some definitions of support (e.g. induced triples) these groupings would probably have had support in our input trees. However, not being able to link every branch in the supertree to an branch in at least one supporting branch in an input tree made the draft tree more difficult to understand. We are working on modifications to the procedure that do not produce these groupings.
- There were 22 taxonomic groupings mislabeled in the supertree (see issue 154 for details) and the definition of support used to indicate when an input tree “supported” a particular edge in the synthesis could be counterintuitive in some cases. The current view of the tree reports an input tree in the “supported by” panel if the branch in the draft tree passes along an edge that is parallel to an edge contributed by that input tree. Because some of the included taxa may have been culled from the group and reattached in a position closer to the root, the input tree can be in conflict with a grouping but still be listed as supporting it (see issues 155 and 157).
The draft tree contains over 2 million tips and many hundreds of thousands of internal edges. Thus, the undesirable properties mentioned above affected less than 0.0001% of the draft tree v2. Nonetheless, we are in the process of developing fixes for these problems, which should further improve the interpretability as well as the biological accuracy of future versions.
Biology + Technology = OTOL
One of the developers of the Open Tree of Life demonstrates Thursday, during a free webinar, how graph databases are used to construct a tree of life. The lecture is organized by Neo Technology, which is the maker of Neo4j, an open-source database that is used for OTOL.
Stephen Smith, an ecology and evolutionary biology professor at the University of Michigan, is going to explain how Neo4j and other digital technologies are assisting in constructing the tree of life. Starting at 10:00 PDT (19:00 CEST), he will also discuss other aspects of the interface of biology with next generation technologies.
“Our project is building the tools with which scientists in the community can continually improve the tree of life as we gather new information. Neo4j allows us to not only store trees in their native graph form, but also allows us to map trees to the same structure, the graph. So in fact, we are facilitating the construction of the graph of life,” says Smith.
Neo4j approached the Open Tree of Life team to present a webinar because it is a project that utilizes the Neo4j graph database to represent the interconnectedness of biological data. The company considers the project a great example of how a graph database can better model the natural world.
The online lecture is intended for a broad audience including beginner computer programmers, advanced hackers, data scientists, natural scientists, and anyone interested in the cross-section of science and technology, especially data modeling. Over 150 people have already registered online.
The registration form: LINK
Update: The video from this webinar is available on vimeo: http://vimeo.com/67870035
Do you want an app for this?
The developers of the Open Tree of Life would like to know from the phylogenetic community what kind of information they want to extract from its database when the first draft is released later this year. With those preferences, it is possible to develop an API that gives scientists the opportunity to build their own websites or software packages that use the data.
An API (application programming interface) is a digital tool that allows one website or software program to “talk” to another website to dig up certain pieces of data. For instance, a lot of people use Tweetdeck to navigate the ongoing bombardment of messages in the Twittersphere. In that case, Tweetdeck is connecting to Twitter, through its API, to receive and order the messages according to the preferences of the user.
In case of the Open Tree of Life, an API gives researchers advanced access to the data of about two million species, the phylogenies that have been created to illustrate possible relationships between them, and the underlying data and methods of synthesis. “For example, it will be possible to select smaller trees for specific species or find out how many studies there are for a particular node within the database,” says Karen Cranston, the lead investigator of the project. (more…)