Latest

Open Tree Taxonomy browser

Our three major outputs so far are the synthetic tree, the collection of well-curated input phylogenies (with a graphical interface to the underlying github repository)  and the reference taxonomy. Up until now, there hasn’t been a simple way to browse the Open Tree Taxonomy (OTT). You could download the full reference taxonomy, or use the low-level scripting language in the source code, but it wasn’t easy to get an overview of the structure.

We have just released the first version of a browser for OTT. Each taxon page includes information about the input taxonomies that contain the taxon, synonyms, lineage, and children. Here is a sample page for Eukaryota:

Screen Shot 2016-01-11 at 12.26.17 PM.png

To open the taxonomy browser, click on the OTT identifier from any node in the synthetic tree:

Screen Shot 2016-01-11 at 12.29.57 PM

We hope this will make it easier to see how the taxonomy influences the synthetic tree. This is only an initial, rough, version of the browser – there is still much to do! The source code is in the opentree repository. If you have feedback or suggestions, please do  create an issue or see the list of existing suggestions using the taxonomy label.

Publication of first draft of the tree of life

We are excited to publish the first draft of the Open Tree of Life in PNAS:

http://www.pnas.org/content/early/2015/09/16/1423041112.abstract

Scientists have used gene sequences and morphological data to construct tens of thousands of evolutionary trees that describe the evolutionary history of animals, plants, and microbes. This study is the first, to our knowledge, to apply an efficient and automated process for assembling published trees into a complete tree of life. This tree and the underlying data are available to browse and download from the Internet, facilitating subsequent analyses that require evolutionary trees. The tree can be easily updated with newly published data. Our analysis of coverage not only reveals gaps in sampling and naming biodiversity but also further demonstrates that most published phylogenies are not available in digital formats that can be summarized into a tree of life.

This is only a first draft, and there are plenty of places where the tree does not represent what we know about phylogenetic relationships. We can improve this tree through incorporation of new taxonomic and phylogenetic data. Our data store of trees (which contains many more trees than are included in the draft tree of life) is also a resource for other analyses. If you want to contribute a published tree for synthesis (or for analyses of coverage, conflict, etc), you can upload it through our curation interface.

Other pages and links:

Many thanks to all of the people that provided data, discussion, review, curation, and code and of course to NSF Biology for funding this work!

Proposal for OpenTree node stability

Currently, OpenTree has two different types of node IDs. Taxonomy (OTT) IDs are assigned to named nodes when we construct a taxonomy release, and phylogenetic node IDs are assigned by the treemachine neo4j graph database for nodes that do not align to an OTT ID (i.e. nodes added due to phylogenetic resolution). The OTT IDs are fairly stable over time, but the neo4j node IDs are definitely not stable, and the same neo4j ID may point to a completely unrelated node in future versions of the graph.

This system is problematic because we expose both types of IDs in the APIs (and also in URLs for the tree browser). The lack of neo4j node stability therefore affects API calls that use nodeIDs, browser bookmarks to nodes in the synthetic tree, and feedback left by users about specific nodes in the tree (see feedback issue #63 and treemachine issue #183). The OTT IDs are problematic as well: it is not straightforward to document when we reuse an existing OTT ID, mint a new ID, or delete an existing ID, when going from one version of the taxonomy version to the next.

At our recent face-to-face meeting, we discussed a proposal for a node identifier registry and are looking for feedback. We don’t intend this system to be a universally-used set of node definitions (i.e. we aren’t trying making a PhyloCode registry). We want a lightweight system that prevents exposure of unstable nodeIDs through the APIs to clients (including our own web application) and provides some measure of predictability. Feeedback on this proposal would be greatly appreciated.

Requirements

  • be able to use the same node ID definitions across OTT and the synthetic tree
  • transparency about when we re-use a nodeID from a previous version of tree or taxonomy (or not)
  • users get an error when using a node ID from a previous version where there is no current node that fits that definition
  • fixing errors (such as moving a snail found in a worm taxon to its proper location) should not involve massive numbers of ID changes
  • generation of node definitions based on a given taxonomy must be automated and efficient
  • application of node definitions to an existing tree / taxonomy must be automated and efficient

Proposal

Develop a lightweight registry of node definitions based on the structure of the OpenTree taxonomy. For each new version of the taxonomy and synthetic tree, use the registry to decide when to re-use existing node IDs and when to register a new definition + ID.

Leaf nodes will be assigned IDs during creation of OTT based on name (together with enough taxonomic context to separate homonyms).

The definition of the ID for a non-leaf node will include a list of IDs for nodes that are descendents of the intended clade, a list that are excluded from being descendents, and (optionally) a taxonomic name.

Definitions would never be deleted from the registry, although not all definitions will be used in any given tree / taxonomy.

Implementation questions

  • How many descendant and excluded nodes to include in the definitions: The definition needs some specificity but also can’t assume a complete list due to future addition of new species. Perhaps, for example, four descendants and three exclusions would be a decent compromise between one and thousands?
  • How to choose the specific nodes in the lists of descendants and exclusions: Should be ‘popular’  (should occur in as many sources as possible) and informative (if T has children T1 and T2 then at least one definition descendant should be taken from T1, and at least one from T2). Excluded nodes should be ‘near misses’ rather than arbitrarily chosen.
  • What to do when >1 node meets the definition: Add an option of adding constraints to the registered definition in order to remove the ambiguity while preserving the ID.
  • What to do when >1 definition matches a node: Ambiguous assignments can be resolved either by the addition of constraints, or by the creation of new ids.
  • Modification / versioning of definitions: If we add constraints to a definition (for example, to resolve ambiguity), does this mint a new ID or version the existing definition?

 

Workshop: Barriers to assembling phylogeny and data layers across the tree of life

The challenges to completing the Tree of Life and integrating data layers (NSF GoLife goals) are huge and vary across clades. Some groups have a nearly-complete tree but lack publicly available data layers, whereas other groups lack phylogenetic resolution or the resources to support tree / data integration. Partnering with Open Tree of Life and Arbor Workflows, FuturePhy will support a series of clade-based workshops to identify and solve specific challenges in tree of life synthesis and data layer integration.

RFP: 2 page proposals to fund small workshops and/or hackathons on completing the tree of life and integrating data layers for specific clades.
Proposal deadline: Nov. 1, 2015
Meeting dates: Feb 20-23 26-28, 2016 *note changed dates!*
Location: Gainesville, University of Florida
Participants per workshop: 10 maximum funded (virtual attendees possible)
Contacts: mwestneat@uchicago.edu (FuturePhy), karen.cranston@gmail.com (OpenTree), lukejharmon@gmail.com (Arbor)

The full call for participation and a link to a proposal template is available at the FuturePhy website.

Have questions about this or future workshops? Attend our webinar Thursday, September 17 at 1 pm EDT. See details on how to connect.

The Open Tree of Life’s education and outreach site

Screen Shot 2015-06-26 at 1.50.43 PM

A little known side element to the Open Tree of Life project is the “Edu Tree of Life,” an interactive educational experience to engage the public. Nearing completion, our goal with this website has been to educate young students as well as the general public on topics surrounding evolution and phylogenetic trees. Our approach is to visually inform and engage users with colorful and entertaining animation, interactive features, and contextualization of facts and figures.

Our educational site is composed of three unique, interactive views of the ToL:

1) A “Big Picture” tree provides a zoomed-out timeline perspective of life’s history on earth and explains key elements of the tree of life using a stylized, graphic visualization. This ‘macro’ view presents the evolutionary history of Earth, starting from the creation of our planet and spanning all the way to present day. As the user moves up the timeline, the tree ‘grows’ in front of them revealing historical information; each new screen also offers a detailed explanation of one of several core concepts surrounding evolution. Video explanations containing animations live narrators explain each of these core concepts. Along with the videos, ‘pop-up’ information boxes also offer information.

Key elements:

  • A macro View
  • Key Concepts
  • Timeline of Life
  • Chaptered Format/Parallax Scrolling
  • Videos and Animation

The core concepts we explore are:

  • The Origin of Life
  • The Three Domains of Life
  • Common Ancestors
  • Extinction
  • Biodiversity
  • Lateral/Horizontal Gene Transfer & Genes

2) The page titled ““Categorizing Life on Earth” is a mid-sized view of life, a data-driven interactive tree with a focus on the groupings of species (clades). This Tree uses a sampling of data to illustrate hierarchy with a familiar ‘tree’ structure that employs branching lines of evolution. It pulls images from Phylopic and data from EOL for descriptions. A user can expand and contract nodes to view clades they find interesting. Still to come: we are exploring ways to illustrate LGT and are working on connecting nodes back via their common ancestry, so that clicking any two nodes will show you a visualization of how those species are connected through the whole tree of life.

Key elements:

  • Mid-sized view of major clades
  • Data-driven interactive
  • Shows Common Ancestry, Phylogeny and Clades
  • Species groupings ending in Clades

3) The “Explore Species” page is our ‘micro view’ of species on Earth. This interactive spinning wheel allows a user to select any of about 180 species to learn about. The 180 species were chosen as exemplary based on many factors: some were chosen for their relative familiarity with the general public, but many were chosen due to specific scientific breakthroughs associated with them. Many were the first species within their field of study to be gene-sequenced, some are keystone species with important evolutionary relatives, and others have strange or unique characteristics worthy of mention.

The information offered for each species includes an image (when available), scientific and common names, the major domain within which the species resides, and then a brief description of the species. This was achieved using the Encyclopedia of Life’s online API, which allowed us to pull information and other resources off of their site to show on ours. As a way of opening an educational portal between the two, any species you click on in the Wheel of Life can also be visited on its parent page at the Encyclopedia of Life, where much more information about all species can be found. We hope that this partnership will prove very fruitful for bringing in casual interest and turning it into a burning passion for evolutionary science and history. Even if we only end up with a few more zoologists, we’ll be happy.

Key elements:

  • Micro View
  • Exemplary/representative species
  • Connects to EoL API, a gateway for further learning
  • Catalogue of interesting species.­­
  • Some info on Major Domains.
  • Fun, introductory look into species and their connections.

We welcome your comments. —John Allison and Karl Gude

FuturePhy

This is the first in a series of posts about several  phylogeny initiatives newly-funded by NSF focused on both technical and community aspects of phylogeny.  Plenty of potential for mutually beneficial work with OpenTree, and we are excited to help.

First up… FuturePhy!

FuturePhy is an NSF-sponsored, three-year program of conferences, workshops and hackathons on the Tree of Life. The project aims to promote novel, integrative data analyses and visualization, interdisciplinary syntheses of phylogenetic sciences, and cross-cutting uses of phylogenetics to develop and address new research questions and applications.

The first phase of this mission is critical: to bring together a broad community of people from diverse backgrounds who are active in phylogenetics research, who use the tree of life in research or education, who will benefit in applied or practical ways from a comprehensive tree of life, or who come from a background that offers new perspectives on defining, addressing or transcending key challenges in phylogenetics.

Help accelerate progress in all aspects of phylogenetics research by joining FuturePhy today. Diverse opportunities will be available to attend FuturePhy sessions in person or virtually, and to link FuturePhy to existing projects and initiatives.

  1. We invite you to participate in the project in several ways:
    Register on futurephy.org. Scientists from all aspects of the phylogenetic sciences, educators, members of the tree-using community, and others interested in phylogenetics are welcome.
  2. Take the community survey and let FuturePhy what workshop and hackathon topics they should fund.
  3. Contribute to the discussion forum on futurephy.org. This is the best way to log your interest and contribute ideas.
  4. Send email at contact@futurephy.org with ideas or comments
  5. Tweet to the FuturePhy community: @FuturePhy
  6. Comment in the FuturePhy phylobabble thread

Crandall Lab Update: What can we do with synthetic trees?

Currently, the Crandall Lab is examining ways to use the underlying OpenTree taxonomy to gather metadata, associate it with nodes and tips in our synthetic trees, and apply it to evolutionary studies. Below we discuss them in the context of ongoing projects in the lab.

 

Curated taxonomy

One of the major outcomes of the OpenTree project is the underlying taxonomy. A curated taxonomy allows us to search and align names across independent databases to pull out additional information to associate with node and tip names. The Crandall Lab taxonomy curation started with the freshwater crayfish, which are in the Infraorder Astacidea and includes 711 species spread among 7 families.   This was a great group to start with because of the limited number of species and there are only a few active systematists revising the alpha-taxonomy, which makes the literature less dense and easier to work with. Initially, our investigations deemed the taxonomy very accurate, but the main issue we had to contend with was spelling errors attributed to depositing sequences into GenBank. In all, we identified and removed 10 misspelled taxa. Although that seems small, it was a great warm-up for two larger groups we are now working on, the Decapoda (crabs, shrimps, lobsters) which includes ~15,000 species and the Hemiptera (true bugs) which includes ~ 50,000-80,000 species.

 

Using a curated taxonomy to obtain additional data

As mentioned above, once the names have been curated we can use them to search across databases. This has been extremely useful in obtaining additional metadata to associate with our synthetic trees. For example, the Crandall Lab recently published a synthetic tree of the crayfish (Fig. 1), which included IUCN Red List values plotted for those taxa with assigned values (Owen et al. 2015, Richman et al. 2015). This is only feasible because we are able to search across IUCN Red Listed crayfish species using the OpenTree curated taxonomy names.

Other applications of using a curated taxonomy to obtain metadata include searching across GenBank to identify whether a particular taxon or rank has molecular data associated with it. This is useful for determining sampling strategies for new and continuing studies. For example, using the OpenTree taxonomy to search GenBank for Hemiptera families and genera, we found a wealth of sequence data has been generated for most of the higher taxa. The most diverse suborder within Hemiptera is Heteroptera and our query of names against NCBI GenBank suggests 70 of the 83 described families within Heteroptera have sequence data for one or more of the traditional eight molecular loci used in Hemiptera systematics (Fig. 2A). As for the Hemiptera genera identified in GenBank, we are currently validating the numbers in Fig. 2B because Hemiptera alpha-taxonomy is very active because many species are vectors for human pathogens and agricultural pests (e.g., kissing bug, aphids, psyllids, etc.).

 

In addition to searching GenBank, we are currently associating geographic, morphological, and ecological metadata to our curated names through GBIF and EOL TraitBank. We believe the curated OpenTree taxonomies of these groups and the accumulation of metadata for taxa will surely add a new dimension to our evolutionary studies and allow us to expand the scope the questions we can answer.

Figure 1 Synthetic tree of crayfish with 20 source trees.  Family names noted on the edge of the synthetic tree.  Paraphyly of Cambaridae is not novel and needs to be addressed in a morphological revision.  Color blocks note the IUCN Redlist value.

Figure 1 Synthetic tree of crayfish with 20 source trees. Family names noted on the edge of the synthetic tree. Paraphyly of Cambaridae is not novel and needs to be addressed in a morphological revision. Color blocks note the IUCN Redlist value.

Figure 2 Histograms depicting number of sequences found on GenBank given OTT names. 2A) Hemiptera families within suborders with nucleotide sequence data on NCBI GenBank. 2B) Hemiptera genera within suborders with nucleotide sequence data on NCBI GenBank.

Figure 2 Histograms depicting number of sequences found on GenBank given OTT names. 2A) Hemiptera families within suborders with nucleotide sequence data on NCBI GenBank. 2B) Hemiptera genera within suborders with nucleotide sequence data on NCBI GenBank.

Keith Crandall is a professor and director of the Computational Biology Institute at George Washington University. 

Chris Owen is a post-doctoral researcher for the AVAToL grant.

Literature

Owen, C. L., Bracken-Grissom, H., Stern, D., & Crandall, K. A. (2015). A synthetic phylogeny of freshwater crayfish: insights for conservation.Philosophical Transactions of the Royal Society of London B: Biological Sciences370(1662), 20140009.

Richman, N. I., Böhm, M., Adams, S. B., Alvarez, F., Bergey, E. A., Bunn, J. J., … & Collen, B. (2015). Multiple drivers of decline in the global status of freshwater crayfish (Decapoda: Astacidea). Philosophical Transactions of the Royal Society B: Biological Sciences370(1662), 20140060.

Follow

Get every new post delivered to your Inbox.

Join 270 other followers

%d bloggers like this: