New version of tree, APIs
We have just released a new version of the synthetic tree, along with new versions of the APIs.
Tree
The biggest change in this version is that we have completely replaced the synthesis method used to produce the tree. We are still using neo4j to serve the tree, but have moved synthesis out of the graph database and into a make-based pipeline that uses a C++ library. This new method is improves efficiency, reproducibility, and allows us to more clearly connect input sources with edges in the tree. In addition to support statements, the new pipeline also produces conflict statements about the inputs that do not support a given edge (we are working to get these displayed on the properties panel for each node).
You can view the new version here and read the release notes. We want to particularly highlight the self-documenting nature of the new method. Primary credit for the new method goes to Mark Holder and Ben Redelings – paper coming soon!
APIs
The largest change is the node IDs. We have previously mentioned issues with node stability, and in this version of the tree & APIs, we use either Open Tree Taxonomy IDs for taxa nodes or mrca statements for non-taxa nodes rather than unstable neo4j node IDs which will transfer (or fail gracefully) for new versions of the tree. We have also made input and output parameters more stable across methods.
We also make public the verbose subtree format that we use to build the tree browser – rather than simply a newick string, you can obtain the tree with all provenance information, including support and conflict.
All v2 methods should continue to work, but we plan to deprecate the v2 methods in June 2016.
Take a look at the API docs and release notes for more information.
FuturePhy clade workshops
OpenTree, FuturePhy and Arbor jointly held the first round of clade workshops in Gainesville at the end of February. There were three taxon-focused groups taking part, studying barnacles, beetles, and catfish – each with a very diverse set of participants. Expertise in the room included taxonomy, systematics, ecology, phylogenetic methods, bioinformatics, genomes, ontologies, and scientific illustration (to name a few). While each group had different goals for progress in understanding the biology of their taxon of interest, each group required a unified tree merging taxonomic and phylogenetic information for their clade. Sounds like a job for OpenTree! In advance of the workshop, we created tree collections (ranked lists of published trees in OpenTree) for each clade, and completed the beta version of our new synthesis algorithm. While there was only limited curation of new studies in the lead up to the meeting, during the workshop participants imported more than 40 new published phylogenies into the OpenTree database and curated tree collections. Not only will this burst of skilled curation improve accuracy of the synthetic tree in the future, we were able to use our new rapid synthesis method to produce on-the-fly custom synthesis trees for each clade collection during the workshop. By reviewing these synthetic trees and updating the input trees and rankings, participant groups were able to simultaneously achieve a better understanding of the relationships in their clade of interest, and of the OpenTree synthesis procedure. These clade synthetic trees were an efficient and reproducible methods for providing a unified view of taxon relationships which could then be compared to publications and to expert-curated supertrees produced by grafting existing trees.
On the last day, we asked each group to list the top features they want from OpenTree. This list included:
- Better conflict visualization – between published input trees, the synthetic tree of life, and the Open Tree Taxonomy
- Ways to summarize / visualize the annotations file created along with the synthetic tree (this file includes information about sources that support & conflict with each edge).
- Provide a method for proposing new taxa for tips in a phylogeny that cannot be mapped to OTT (and for collecting supporting information about these new taxa)
- Include branch length and / or time information in the synthetic tree
- More fine-grained control over synthesis (be able to mask part of an input tree, suppress poorly-supported branches)
Many thanks to all of the participants, and in particular to Nico Cellinese and Rob Guralnick for local logistics, and to the University of Florida Informatics Institute for hosting. It was an enjoyable and productive meeting for the OpenTree crew, and hopefully for all the attendees!
Open Tree Taxonomy browser
Our three major outputs so far are the synthetic tree, the collection of well-curated input phylogenies (with a graphical interface to the underlying github repository) and the reference taxonomy. Up until now, there hasn’t been a simple way to browse the Open Tree Taxonomy (OTT). You could download the full reference taxonomy, or use the low-level scripting language in the source code, but it wasn’t easy to get an overview of the structure.
We have just released the first version of a browser for OTT. Each taxon page includes information about the input taxonomies that contain the taxon, synonyms, lineage, and children. Here is a sample page for Eukaryota:
To open the taxonomy browser, click on the OTT identifier from any node in the synthetic tree:
We hope this will make it easier to see how the taxonomy influences the synthetic tree. This is only an initial, rough, version of the browser – there is still much to do! The source code is in the opentree repository. If you have feedback or suggestions, please do create an issue or see the list of existing suggestions using the taxonomy label.
Publication of first draft of the tree of life
We are excited to publish the first draft of the Open Tree of Life in PNAS:
http://www.pnas.org/content/early/2015/09/16/1423041112.abstract
Scientists have used gene sequences and morphological data to construct tens of thousands of evolutionary trees that describe the evolutionary history of animals, plants, and microbes. This study is the first, to our knowledge, to apply an efficient and automated process for assembling published trees into a complete tree of life. This tree and the underlying data are available to browse and download from the Internet, facilitating subsequent analyses that require evolutionary trees. The tree can be easily updated with newly published data. Our analysis of coverage not only reveals gaps in sampling and naming biodiversity but also further demonstrates that most published phylogenies are not available in digital formats that can be summarized into a tree of life.
This is only a first draft, and there are plenty of places where the tree does not represent what we know about phylogenetic relationships. We can improve this tree through incorporation of new taxonomic and phylogenetic data. Our data store of trees (which contains many more trees than are included in the draft tree of life) is also a resource for other analyses. If you want to contribute a published tree for synthesis (or for analyses of coverage, conflict, etc), you can upload it through our curation interface.
Other pages and links:
- supplemental doc with details about methods
- Dryad data package when you can download the taxonomy and the tree
- infographics about the tree of life
- interactive tree browser
- roundup of news coverage
Many thanks to all of the people that provided data, discussion, review, curation, and code and of course to NSF Biology for funding this work!
Proposal for OpenTree node stability
Currently, OpenTree has two different types of node IDs. Taxonomy (OTT) IDs are assigned to named nodes when we construct a taxonomy release, and phylogenetic node IDs are assigned by the treemachine neo4j graph database for nodes that do not align to an OTT ID (i.e. nodes added due to phylogenetic resolution). The OTT IDs are fairly stable over time, but the neo4j node IDs are definitely not stable, and the same neo4j ID may point to a completely unrelated node in future versions of the graph.
This system is problematic because we expose both types of IDs in the APIs (and also in URLs for the tree browser). The lack of neo4j node stability therefore affects API calls that use nodeIDs, browser bookmarks to nodes in the synthetic tree, and feedback left by users about specific nodes in the tree (see feedback issue #63 and treemachine issue #183). The OTT IDs are problematic as well: it is not straightforward to document when we reuse an existing OTT ID, mint a new ID, or delete an existing ID, when going from one version of the taxonomy version to the next.
At our recent face-to-face meeting, we discussed a proposal for a node identifier registry and are looking for feedback. We don’t intend this system to be a universally-used set of node definitions (i.e. we aren’t trying making a PhyloCode registry). We want a lightweight system that prevents exposure of unstable nodeIDs through the APIs to clients (including our own web application) and provides some measure of predictability. Feeedback on this proposal would be greatly appreciated.
Requirements
- be able to use the same node ID definitions across OTT and the synthetic tree
- transparency about when we re-use a nodeID from a previous version of tree or taxonomy (or not)
- users get an error when using a node ID from a previous version where there is no current node that fits that definition
- fixing errors (such as moving a snail found in a worm taxon to its proper location) should not involve massive numbers of ID changes
- generation of node definitions based on a given taxonomy must be automated and efficient
- application of node definitions to an existing tree / taxonomy must be automated and efficient
Proposal
Develop a lightweight registry of node definitions based on the structure of the OpenTree taxonomy. For each new version of the taxonomy and synthetic tree, use the registry to decide when to re-use existing node IDs and when to register a new definition + ID.
Leaf nodes will be assigned IDs during creation of OTT based on name (together with enough taxonomic context to separate homonyms).
The definition of the ID for a non-leaf node will include a list of IDs for nodes that are descendents of the intended clade, a list that are excluded from being descendents, and (optionally) a taxonomic name.
Definitions would never be deleted from the registry, although not all definitions will be used in any given tree / taxonomy.
Implementation questions
- How many descendant and excluded nodes to include in the definitions: The definition needs some specificity but also can’t assume a complete list due to future addition of new species. Perhaps, for example, four descendants and three exclusions would be a decent compromise between one and thousands?
- How to choose the specific nodes in the lists of descendants and exclusions: Should be ‘popular’ (should occur in as many sources as possible) and informative (if T has children T1 and T2 then at least one definition descendant should be taken from T1, and at least one from T2). Excluded nodes should be ‘near misses’ rather than arbitrarily chosen.
- What to do when >1 node meets the definition: Add an option of adding constraints to the registered definition in order to remove the ambiguity while preserving the ID.
- What to do when >1 definition matches a node: Ambiguous assignments can be resolved either by the addition of constraints, or by the creation of new ids.
- Modification / versioning of definitions: If we add constraints to a definition (for example, to resolve ambiguity), does this mint a new ID or version the existing definition?
Workshop: Barriers to assembling phylogeny and data layers across the tree of life
The challenges to completing the Tree of Life and integrating data layers (NSF GoLife goals) are huge and vary across clades. Some groups have a nearly-complete tree but lack publicly available data layers, whereas other groups lack phylogenetic resolution or the resources to support tree / data integration. Partnering with Open Tree of Life and Arbor Workflows, FuturePhy will support a series of clade-based workshops to identify and solve specific challenges in tree of life synthesis and data layer integration.
RFP: 2 page proposals to fund small workshops and/or hackathons on completing the tree of life and integrating data layers for specific clades.
Proposal deadline: Nov. 1, 2015
Meeting dates: Feb 20-23 26-28, 2016 *note changed dates!*
Location: Gainesville, University of Florida
Participants per workshop: 10 maximum funded (virtual attendees possible)
Contacts: mwestneat@uchicago.edu (FuturePhy), karen.cranston@gmail.com (OpenTree), lukejharmon@gmail.com (Arbor)
The full call for participation and a link to a proposal template is available at the FuturePhy website.
Have questions about this or future workshops? Attend our webinar Thursday, September 17 at 1 pm EDT. See details on how to connect.
FuturePhy
This is the first in a series of posts about several phylogeny initiatives newly-funded by NSF focused on both technical and community aspects of phylogeny. Plenty of potential for mutually beneficial work with OpenTree, and we are excited to help.
First up… FuturePhy!
FuturePhy is an NSF-sponsored, three-year program of conferences, workshops and hackathons on the Tree of Life. The project aims to promote novel, integrative data analyses and visualization, interdisciplinary syntheses of phylogenetic sciences, and cross-cutting uses of phylogenetics to develop and address new research questions and applications.
The first phase of this mission is critical: to bring together a broad community of people from diverse backgrounds who are active in phylogenetics research, who use the tree of life in research or education, who will benefit in applied or practical ways from a comprehensive tree of life, or who come from a background that offers new perspectives on defining, addressing or transcending key challenges in phylogenetics.
Help accelerate progress in all aspects of phylogenetics research by joining FuturePhy today. Diverse opportunities will be available to attend FuturePhy sessions in person or virtually, and to link FuturePhy to existing projects and initiatives.
- We invite you to participate in the project in several ways:
Register on futurephy.org. Scientists from all aspects of the phylogenetic sciences, educators, members of the tree-using community, and others interested in phylogenetics are welcome. - Take the community survey and let FuturePhy what workshop and hackathon topics they should fund.
- Contribute to the discussion forum on futurephy.org. This is the best way to log your interest and contribute ideas.
- Send email at contact@futurephy.org with ideas or comments
- Tweet to the FuturePhy community: @FuturePhy
- Comment in the FuturePhy phylobabble thread
Update on synthesis methods
The current Open Tree of Life synthesis methods are based on the Tree Alignment Graphs described by Smith et al 2013. The examples presented in that paper used much simpler datasets than the dataset that is used for draft tree synthesis by the Open Tree of Life (which contains hundreds of original source trees and the entire OTT taxonomy with over 2.3 million terminal taxa). To accommodate the goals of synthesis, some modifications were made to the methods presented in Smith et al 2013. The current version of the draft tree (v2, which is presented at http://tree.opentreeoflife.org as of February 2015 and described in a preprint on bioRxiv), was built using these modified methods. The changes to synthesis that were introduced since Smith et al 2013 are not well-described elsewhere, so we present them below in this document.
We are continually testing and improving the methods we use to develop synthesis trees, and through this process we have recently discovered some methodological properties of the modified TAG procedures that are undesirable for our synthesis goals. We are making progress toward fixing them for the next version of the draft tree, and there are details at the end of this post.
General background on the Open Tree of Life project and the draft tree
The overall goal of OpenTree is to summarize what is known about phylogenetic relationships in a transparent manner with a clear connection to analyses and the published studies that support different clades. Comprehensive coverage of published phylogenetic statements is a very long term goal which would require work from a large community of biologists. The short-term goal for the supertree presented on the tree browser is to summarize a small set of well-curated inputs in a clear manner.
Background on Tree Alignment Graph methods
The current synthesis method uses a Tree Alignment Graph (TAG), described in Smith et al 2013. We have been using TAGs because:
- These graphs can provide a view on conflict and congruence among input trees.
- TAG-based are computationally tractable on the scale which the open tree of life project operates (2.3 million tips on the tree, and hundreds of input trees).
- TAG-based approaches provide a straightforward way to handle inputs in which tips of a tree are assigned to higher taxa (any taxon above the species level). It is fairly common for published phylogenies to have tips mapped at the genus level (or higher).
- When coupled with expert knowledge in the form of ranking of input trees, TAG methods can produce a sensible summary of our (rather limited) input trees. At this point in the project, our data store does not contain a large number of trees sufficiently curated* to be included in the supertree operations.
* Sufficiently curated = 1. tips mapped to taxa in the Open Tree Taxonomy; 2. rooted as described in the publication; 3. ingroup noted. Incorrect rootings and assignments of tips to taxa can introduce a lot of noise in the estimate, so we have opted for careful vetting of input trees rather scraping together every estimate available. We are hopeful that community involvement in the curation will get us to a point of having enough input trees to allow more traditional supertree approaches to work well, so that we can present multiple estimates of the tree of life.
Methods used to produce the v2 draft tree
The open tree of life project has been alternating between phases where we (1) add more trees to our set of curated input trees, and then (2) generate new versions of the “synthetic” draft tree of life. Thus far two versions of the tree have been publicly posted to http://tree.opentreeoflife.org. The process of generating a new public draft tree involves the creation and critical review of many unpublished draft trees in order to detect errors or problems with the process (which could be due to misspecified taxa in input trees, software bugs, etc.).
This process has led to a few modifications of the TAG procedure as it was described in the PLoS Comp. Bio. paper. These modifications have been made to our treemachine software, and they include:
- In the original paper, conflict was assessed by whether there was conflicting overlap among the descendant taxa of the nodes, not the edes. The software that produced the v2 tree assessed conflict between edges of the graph by looking for conflict based on the taxon sets contributed by each tree. This change is referred to as the “relationship taxa” rule in this issue on GitHub).
- The supertree operation moves from root to tips, and occasionally a species attaches to a node via a series of low ranking relationships. When all of these are rejected (due to conflict with higher ranking trees), the species would be absent in the full tree if we followed the original TAG description faithfully. Instead, the treemachine version for v2 tree reattached these taxa based on their taxonomy after sweeping over the full tree.
- The “Partially overlapping taxon sets” section of the paper described a procedure for eliminating order-dependence of the input trees. We have recently discovered a case in which the structure of a TAG built according to those procedures would differ depending on the input order of the trees. We have implemented a new procedure that pre-processes all the input trees, which removes this order-dependence (code for the new procedure can be accessed in the find-mrcas-when-creating-nodes branch of the treemachine repo on github).
- To increase the overlap between different input trees, an additional step was implemented in treemachine that mapped the tips of an input tree to deeper nodes in the taxonomy that they may have represented. This was done by determining the most inclusive taxon that a tip could belong to without including any other tips in the tree, and then mapping the tip to that taxon instead of the taxon actually specified for the tip in the input tree itself. For example if the only primate in a tree was Homo sapiens, but the tree contained other mammals from the taxon sister to Primates (in the taxonomy), then the Homo sapiens tip would be assigned to the taxon Primates.
Undesirable properties of the procedures used to produce v2
- It was possible for edges to exist in the draft tree that were not supported by any of the input trees. There were a very small number (111) of such groups in the v2 tree; this GitHub issue discusses the issue more thoroughly. This is not an unusual property for a supertree method to have – in fact most supertree methods can produce such groups. And under some definitions of support (e.g. induced triples) these groupings would probably have had support in our input trees. However, not being able to link every branch in the supertree to an branch in at least one supporting branch in an input tree made the draft tree more difficult to understand. We are working on modifications to the procedure that do not produce these groupings.
- There were 22 taxonomic groupings mislabeled in the supertree (see issue 154 for details) and the definition of support used to indicate when an input tree “supported” a particular edge in the synthesis could be counterintuitive in some cases. The current view of the tree reports an input tree in the “supported by” panel if the branch in the draft tree passes along an edge that is parallel to an edge contributed by that input tree. Because some of the included taxa may have been culled from the group and reattached in a position closer to the root, the input tree can be in conflict with a grouping but still be listed as supporting it (see issues 155 and 157).
The draft tree contains over 2 million tips and many hundreds of thousands of internal edges. Thus, the undesirable properties mentioned above affected less than 0.0001% of the draft tree v2. Nonetheless, we are in the process of developing fixes for these problems, which should further improve the interpretability as well as the biological accuracy of future versions.
Preprint: Synthesizing phylogeny and taxonomy into a comprehensive tree of life
We’ve just posted a preprint on bioRXiv of our submitted manuscript on how we are combining taxonomy and phylogeny into a comprehensive tree of life:
http://www.biorxiv.org/content/early/2014/12/05/012260
You can browse the complete tree at http://tree.opentreeoflife.org
Comments welcome (either here or on bioRXiv). Note that the authorship list is woefully incomplete – biorxiv only allows 20 authors in the submission process. Here is the complete list:
Stephen A. Smith, Karen A. Cranston, James F. Allman, Joseph W. Brown, Gordon Burleigh, Ruchi Chaudhary, Lyndon M. Coghill, Keith A. Crandall, Jiabin Deng, Bryan T. Drew, Romina Gazis, Karl Gude, David S. Hibbett, Cody Hinchliff, Laura A. Katz, H. Dail Laughinghouse IV, Emily Jane McTavish, Christopher L. Owen, Richard Ree, Jonathan A. Rees, Douglas E. Soltis, Tiffani Williams
Accessing OpenTree data
With the soft release of the v 1.0 of the Open Tree of Life (see Karen Cranston’s Evolution talk for details) we also have methods for accessing the data:
* a not-very-pretty but functional page to download the enture 2.5 million tip tree as newick
* API access to subtrees and source trees as well as taxon name services
* clone the github repository of all input trees
A few folks have started to think about ways to interact with the very large newick file, specifically extracting subtrees. Yan Wong posted a perl solution a few weeks ago:
http://yanwong.me/?page_id=1090
Michael Elliot has a C++ package called Gulo which seems to be very efficient (see comments on the post):
Thrilled to see people working with the data! I note that, despite having APIs to return a subtree or a pruned subtree, downloading all of the data and working with it remotely is still an easy and flexible option for many users. We will continue to make our datasets available, and that download page should have more options and tree metrics soon!
Apply for Tree-for-all: a hackathon to access OpenTree resources
Full call for participation and link to application: http://bit.ly/1ioPPMc
A global “tree of life” will transform biological research in a broad range of disciplines from ecology to bioengineering. To help facilitate that transformation, the OpenTree <http://opentreeoflife.org> project [1] now provides online access to >4000 published phylogenies, and a newly generated tree covering more than 2.5 million species.
The next step is to build tools to enable the community to use these resources. To meet this aim, OpenTree <http://www.opentreeoflife.org/>, Arbor <http://www.arborworkflows.com/> [2] and NESCent’s HIP<http://www.evoio.org/wiki/HIP> working groups [3] are staging a week-long hackathon September 15 to 19 at U. Michigan, Ann Arbor. Participants in this “Tree-for-all” will work in small teams to develop tools that use OpenTree’s web services to extract, annotate, or add data in ways useful to the community. Teams also may focus on testing, expanding and documenting the web services.
If you can imagine using these resources, and you have the skills to work collaboratively to turn those ideas into products (as a coder, or working side-by-side with coders), we invite you to apply for the hackathon. The full call for participation (http://bit.ly/1ioPPMc) provides instructions for how to apply, and how to share your ideas with potential teammates (strongly encouraged prior to applying). Applications are due July 8th. Travel support is provided. Women and underrepresented minorities are especially encouraged to apply.
If you have questions, contact Karen Cranston (karen.cranston@nescent.org, @kcranstn, OpenTree), Arlin Stoltzfus (arlin@umd.edu, HIP), Julie Allen (juliema@illinois.edu, HIP), or Luke Harmon (lukeh@uidaho.edu, Arbor).
[1] http://www.opentreeoflife.org
[2] http://www.arborworkflows.com/
[3] http://www.evoio.org/wiki/HIP (Hackathons, Interoperability, Phylogenies)
Data sharing, OpenTree and GoLife
NSF has released GoLife, the new solicitation that replaces both AToL and AVAToL. From the GoLife text:
The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.
Data completeness, open data and data integration are key components of these proposals – inferring well-sampled trees that are linked with other types of data (molecular, morphological, ecological, spatial, etc) and made easily available to scientific and non-scientific users. The solicitation requires that trees published by GoLife projects are published in a way that allows them to be understood and re-used by Open Tree of Life and other projects:
Integration and standardization of data consistent with three AVAToL projects: Open Tree of Life (www.opentreeoflife.org), ARBOR (www.arborworkflows.com), and Next Generation Phenomics (www.avatol.org/ngp) is required. Other data should be made available through broadly accessible community efforts (i.e., specimen data through iDigBio, occurrence data through BISON, etc).
What does it mean to publish data consistent with Open Tree of Life? We have a short page on data sharing with OpenTree, and a publication in PLOS Currents Tree of Life with best practices for sharing phylogenetic data. Our phylogeny curation application allows you to upload and annotate phylogenies consistent with OpenTree synthesis, and you can quickly import trees from TreeBASE.
If you have questions about a GoLife proposal (or any other data sharing / integration issue), feel free to ask on our mailing list or contact Karen Cranston directly.
Recommending CC0 for GBIF data
GBIF (Global Biodiversity Information Facility) recently issued a request for comment on its data licensing policy. While Open Tree of LIfe does not currently use specimen data, we do use the GBIF classification in order to help resolve names and also as part of the opentree backbone. Jonathan Rees, Karen Cranston, Todd Vision and Hilmar Lapp wrote a response recommending a CC0 waiver for all GBIF data. Here is our summary, and a link to the full response on Figshare.
Summary
As a data aggregator, the goal of GBIF should be to find policies that benefit both its data providers and data reusers. Clearly, a GBIF that has no or few data will have little value, but so will a GBIF full of data that is encumbered with restrictions to an extent that stifles reuse. Our response follows from the proposition that promoting data reuse should be a shared interest of all the parties: data providers, data users, and GBIF itself. We feel the consultation document missed the opportunity to recognize this shared interest, and that furthering the goal of data reuse should in fact be a primary yardstick by which different licensing options are measured.
Tracking the reuse of data is a critically important goal, as it provides a means of reward to data providers, allows scrutiny of derived results, and enables discovery of related research. Initiatives such as DataCite have have made considerable progress in recent years in enabling tracking of data reuse by addressing sociotechnical obstacles to tracking data reuse. By contrast, the consultation, in our view, puts undue weight on legal requirements for attribution. Legal instruments such as licenses are unsuitable, not designed for, and of little if any benefit for this purpose. Moreover, in most of the world, there is little to no formally recognized intellectual property protection for data, and it is on such protection that licenses rest.
In short, our recommendations are (1) that all data in GBIF be released under Creative Commons Zero (CC0), which is a public domain dedication that waives copyright rather than asserting it; (2) GBIF should set clear expectations in the form of community norms for how the data that it serves is to be referenced when reused, and (3) GBIF should work with partner organizations in promoting standards and technologies that enable the effective tracking of data reuse.
We note that our analysis is based on our understanding of the law; we are not legal professionals and this is not legal advice.
Full response
Response to GBIF request for consultation on data licenses. Karen Cranston, Todd Vision, Hilmar Lapp, Jonathan Rees. figshare.
http://dx.doi.org/10.6084/m9.figshare.799766