The current Open Tree of Life synthesis methods are based on the Tree Alignment Graphs described by Smith et al 2013. The examples presented in that paper used much simpler datasets than the dataset that is used for draft tree synthesis by the Open Tree of Life (which contains hundreds of original source trees and the entire OTT taxonomy with over 2.3 million terminal taxa). To accommodate the goals of synthesis, some modifications were made to the methods presented in Smith et al 2013. The current version of the draft tree (v2, which is presented at http://tree.opentreeoflife.org as of February 2015 and described in a preprint on bioRxiv), was built using these modified methods. The changes to synthesis that were introduced since Smith et al 2013 are not well-described elsewhere, so we present them below in this document.
We are continually testing and improving the methods we use to develop synthesis trees, and through this process we have recently discovered some methodological properties of the modified TAG procedures that are undesirable for our synthesis goals. We are making progress toward fixing them for the next version of the draft tree, and there are details at the end of this post.
General background on the Open Tree of Life project and the draft tree
The overall goal of OpenTree is to summarize what is known about phylogenetic relationships in a transparent manner with a clear connection to analyses and the published studies that support different clades. Comprehensive coverage of published phylogenetic statements is a very long term goal which would require work from a large community of biologists. The short-term goal for the supertree presented on the tree browser is to summarize a small set of well-curated inputs in a clear manner.
Background on Tree Alignment Graph methods
The current synthesis method uses a Tree Alignment Graph (TAG), described in Smith et al 2013. We have been using TAGs because:
- These graphs can provide a view on conflict and congruence among input trees.
- TAG-based are computationally tractable on the scale which the open tree of life project operates (2.3 million tips on the tree, and hundreds of input trees).
- TAG-based approaches provide a straightforward way to handle inputs in which tips of a tree are assigned to higher taxa (any taxon above the species level). It is fairly common for published phylogenies to have tips mapped at the genus level (or higher).
- When coupled with expert knowledge in the form of ranking of input trees, TAG methods can produce a sensible summary of our (rather limited) input trees. At this point in the project, our data store does not contain a large number of trees sufficiently curated* to be included in the supertree operations.
* Sufficiently curated = 1. tips mapped to taxa in the Open Tree Taxonomy; 2. rooted as described in the publication; 3. ingroup noted. Incorrect rootings and assignments of tips to taxa can introduce a lot of noise in the estimate, so we have opted for careful vetting of input trees rather scraping together every estimate available. We are hopeful that community involvement in the curation will get us to a point of having enough input trees to allow more traditional supertree approaches to work well, so that we can present multiple estimates of the tree of life.
Methods used to produce the v2 draft tree
The open tree of life project has been alternating between phases where we (1) add more trees to our set of curated input trees, and then (2) generate new versions of the “synthetic” draft tree of life. Thus far two versions of the tree have been publicly posted to http://tree.opentreeoflife.org. The process of generating a new public draft tree involves the creation and critical review of many unpublished draft trees in order to detect errors or problems with the process (which could be due to misspecified taxa in input trees, software bugs, etc.).
This process has led to a few modifications of the TAG procedure as it was described in the PLoS Comp. Bio. paper. These modifications have been made to our treemachine software, and they include:
- In the original paper, conflict was assessed by whether there was conflicting overlap among the descendant taxa of the nodes, not the edes. The software that produced the v2 tree assessed conflict between edges of the graph by looking for conflict based on the taxon sets contributed by each tree. This change is referred to as the “relationship taxa” rule in this issue on GitHub).
- The supertree operation moves from root to tips, and occasionally a species attaches to a node via a series of low ranking relationships. When all of these are rejected (due to conflict with higher ranking trees), the species would be absent in the full tree if we followed the original TAG description faithfully. Instead, the treemachine version for v2 tree reattached these taxa based on their taxonomy after sweeping over the full tree.
- The “Partially overlapping taxon sets” section of the paper described a procedure for eliminating order-dependence of the input trees. We have recently discovered a case in which the structure of a TAG built according to those procedures would differ depending on the input order of the trees. We have implemented a new procedure that pre-processes all the input trees, which removes this order-dependence (code for the new procedure can be accessed in the find-mrcas-when-creating-nodes branch of the treemachine repo on github).
- To increase the overlap between different input trees, an additional step was implemented in treemachine that mapped the tips of an input tree to deeper nodes in the taxonomy that they may have represented. This was done by determining the most inclusive taxon that a tip could belong to without including any other tips in the tree, and then mapping the tip to that taxon instead of the taxon actually specified for the tip in the input tree itself. For example if the only primate in a tree was Homo sapiens, but the tree contained other mammals from the taxon sister to Primates (in the taxonomy), then the Homo sapiens tip would be assigned to the taxon Primates.
Undesirable properties of the procedures used to produce v2
- It was possible for edges to exist in the draft tree that were not supported by any of the input trees. There were a very small number (111) of such groups in the v2 tree; this GitHub issue discusses the issue more thoroughly. This is not an unusual property for a supertree method to have – in fact most supertree methods can produce such groups. And under some definitions of support (e.g. induced triples) these groupings would probably have had support in our input trees. However, not being able to link every branch in the supertree to an branch in at least one supporting branch in an input tree made the draft tree more difficult to understand. We are working on modifications to the procedure that do not produce these groupings.
- There were 22 taxonomic groupings mislabeled in the supertree (see issue 154 for details) and the definition of support used to indicate when an input tree “supported” a particular edge in the synthesis could be counterintuitive in some cases. The current view of the tree reports an input tree in the “supported by” panel if the branch in the draft tree passes along an edge that is parallel to an edge contributed by that input tree. Because some of the included taxa may have been culled from the group and reattached in a position closer to the root, the input tree can be in conflict with a grouping but still be listed as supporting it (see issues 155 and 157).
The draft tree contains over 2 million tips and many hundreds of thousands of internal edges. Thus, the undesirable properties mentioned above affected less than 0.0001% of the draft tree v2. Nonetheless, we are in the process of developing fixes for these problems, which should further improve the interpretability as well as the biological accuracy of future versions.
We’ve just posted a preprint on bioRXiv of our submitted manuscript on how we are combining taxonomy and phylogeny into a comprehensive tree of life:
You can browse the complete tree at http://tree.opentreeoflife.org
Comments welcome (either here or on bioRXiv). Note that the authorship list is woefully incomplete – biorxiv only allows 20 authors in the submission process. Here is the complete list:
Stephen A. Smith, Karen A. Cranston, James F. Allman, Joseph W. Brown, Gordon Burleigh, Ruchi Chaudhary, Lyndon M. Coghill, Keith A. Crandall, Jiabin Deng, Bryan T. Drew, Romina Gazis, Karl Gude, David S. Hibbett, Cody Hinchliff, Laura A. Katz, H. Dail Laughinghouse IV, Emily Jane McTavish, Christopher L. Owen, Richard Ree, Jonathan A. Rees, Douglas E. Soltis, Tiffani Williams
With the soft release of the v 1.0 of the Open Tree of Life (see Karen Cranston’s Evolution talk for details) we also have methods for accessing the data:
* a not-very-pretty but functional page to download the enture 2.5 million tip tree as newick
* API access to subtrees and source trees as well as taxon name services
* clone the github repository of all input trees
A few folks have started to think about ways to interact with the very large newick file, specifically extracting subtrees. Yan Wong posted a perl solution a few weeks ago:
Michael Elliot has a C++ package called Gulo which seems to be very efficient (see comments on the post):
Thrilled to see people working with the data! I note that, despite having APIs to return a subtree or a pruned subtree, downloading all of the data and working with it remotely is still an easy and flexible option for many users. We will continue to make our datasets available, and that download page should have more options and tree metrics soon!
Full call for participation and link to application: http://bit.ly/1ioPPMc
A global “tree of life” will transform biological research in a broad range of disciplines from ecology to bioengineering. To help facilitate that transformation, the OpenTree <http://opentreeoflife.org> project  now provides online access to >4000 published phylogenies, and a newly generated tree covering more than 2.5 million species.
The next step is to build tools to enable the community to use these resources. To meet this aim, OpenTree <http://www.opentreeoflife.org/>, Arbor <http://www.arborworkflows.com/>  and NESCent’s HIP<http://www.evoio.org/wiki/HIP> working groups  are staging a week-long hackathon September 15 to 19 at U. Michigan, Ann Arbor. Participants in this “Tree-for-all” will work in small teams to develop tools that use OpenTree’s web services to extract, annotate, or add data in ways useful to the community. Teams also may focus on testing, expanding and documenting the web services.
If you can imagine using these resources, and you have the skills to work collaboratively to turn those ideas into products (as a coder, or working side-by-side with coders), we invite you to apply for the hackathon. The full call for participation (http://bit.ly/1ioPPMc) provides instructions for how to apply, and how to share your ideas with potential teammates (strongly encouraged prior to applying). Applications are due July 8th. Travel support is provided. Women and underrepresented minorities are especially encouraged to apply.
If you have questions, contact Karen Cranston (firstname.lastname@example.org, @kcranstn, OpenTree), Arlin Stoltzfus (email@example.com, HIP), Julie Allen (firstname.lastname@example.org, HIP), or Luke Harmon (email@example.com, Arbor).
 http://www.evoio.org/wiki/HIP (Hackathons, Interoperability, Phylogenies)
NSF has released GoLife, the new solicitation that replaces both AToL and AVAToL. From the GoLife text:
The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.
Data completeness, open data and data integration are key components of these proposals – inferring well-sampled trees that are linked with other types of data (molecular, morphological, ecological, spatial, etc) and made easily available to scientific and non-scientific users. The solicitation requires that trees published by GoLife projects are published in a way that allows them to be understood and re-used by Open Tree of Life and other projects:
Integration and standardization of data consistent with three AVAToL projects: Open Tree of Life (www.opentreeoflife.org), ARBOR (www.arborworkflows.com), and Next Generation Phenomics (www.avatol.org/ngp) is required. Other data should be made available through broadly accessible community efforts (i.e., specimen data through iDigBio, occurrence data through BISON, etc).
What does it mean to publish data consistent with Open Tree of Life? We have a short page on data sharing with OpenTree, and a publication in PLOS Currents Tree of Life with best practices for sharing phylogenetic data. Our phylogeny curation application allows you to upload and annotate phylogenies consistent with OpenTree synthesis, and you can quickly import trees from TreeBASE.
GBIF (Global Biodiversity Information Facility) recently issued a request for comment on its data licensing policy. While Open Tree of LIfe does not currently use specimen data, we do use the GBIF classification in order to help resolve names and also as part of the opentree backbone. Jonathan Rees, Karen Cranston, Todd Vision and Hilmar Lapp wrote a response recommending a CC0 waiver for all GBIF data. Here is our summary, and a link to the full response on Figshare.
As a data aggregator, the goal of GBIF should be to find policies that benefit both its data providers and data reusers. Clearly, a GBIF that has no or few data will have little value, but so will a GBIF full of data that is encumbered with restrictions to an extent that stifles reuse. Our response follows from the proposition that promoting data reuse should be a shared interest of all the parties: data providers, data users, and GBIF itself. We feel the consultation document missed the opportunity to recognize this shared interest, and that furthering the goal of data reuse should in fact be a primary yardstick by which different licensing options are measured.
Tracking the reuse of data is a critically important goal, as it provides a means of reward to data providers, allows scrutiny of derived results, and enables discovery of related research. Initiatives such as DataCite have have made considerable progress in recent years in enabling tracking of data reuse by addressing sociotechnical obstacles to tracking data reuse. By contrast, the consultation, in our view, puts undue weight on legal requirements for attribution. Legal instruments such as licenses are unsuitable, not designed for, and of little if any benefit for this purpose. Moreover, in most of the world, there is little to no formally recognized intellectual property protection for data, and it is on such protection that licenses rest.
In short, our recommendations are (1) that all data in GBIF be released under Creative Commons Zero (CC0), which is a public domain dedication that waives copyright rather than asserting it; (2) GBIF should set clear expectations in the form of community norms for how the data that it serves is to be referenced when reused, and (3) GBIF should work with partner organizations in promoting standards and technologies that enable the effective tracking of data reuse.
We note that our analysis is based on our understanding of the law; we are not legal professionals and this is not legal advice.