Assembling, Visualizing, and Analyzing the Tree of Life

Latest

Tree-for-All hackathon series: Taxon sampling, part 1 

Sampling taxa with Python and Perl scripts

This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.  To read the whole series, go to the Introduction page.

More specifically, this is the first of two posts addressing the outputs of the “Sampling taxa” team, consisting of Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (OpenTree) and Arlin Stoltzfus (NIST).[1]

The “taxon sampling” idea

Although users seeking a tree may have a predetermined set of species in mind, often the user is focused on taxon T without having a prior list of species. For instance, the typical user interested in a tree of mammals does not really want the full tree of > 5000 known species of mammals, but some subset, e.g., a tree with a random subset of 100 species, or a tree of the 94 species with known genomes in NCBI, or a tree with one species for each of ~150 mammal families.

If we think about this more broadly, we can identify a number of different types of sampling, depending on what kinds of information we are using, and how we are using it. First, sampling T by sub-setting is simply getting all the species in T that satisfy some criterion, e.g., being on the IUCN red list of endangered species,  or having a genome entry in NCBI genomes, a species page in EOL, or an image in phylopic.org (organism silhouettes for adorning trees).

Second, we might use a kind of hierarchical taxonomic sampling to get 1 (or more) species from each genus (or family, order, etc.).

sampling_taxa_poster

Poster from hackathon day 1, making the pitch for sampling taxa as a hackathon target

Third, we could reduce the complexity of a taxon or clade without using any outside information— what we might call down-sampling—, e.g., get a random sample of N species from taxon T, down-sample nodes according to subnode density, or choose N species to maximize phylogenetic diversity.

Finally, we can imagine a kind of relevance sampling, where we choose (from taxon T) the top N species based on some external measure of importance or relevance, e.g., the number of occurrence records in iDigBio (or GBIF, iNaturalist, etc.), the number of google hits (i.e., popular species), or the number of PubMed hits (i.e., biomedically relevant species).

Products of the “taxon sampling” team

At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases:

  • sub-setting: get species in T with entries in NCBI genomes
  • down-sampling: get a random sample of N species from T
  • relevance sampling: get the N species in T with the most records in iDigBio

Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to match species names to OpenTree taxon identifiers (ottIds), and the induced_tree service to get a tree for species designated by these identifiers.

Here I’ll describe two projects based on command-line scripts in Python and Perl.  In the next post, I’ll describe how taxon sampling was implemented within an existing platform with a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE, and Arbor.

Down-sampling in Python

A simple down-sampling approach via random choice is implemented in the random_sample.py script developed by Dilrini De Silva (Oxford) and Jonathan Rees (OpenTree), as in this example:

python random_sample.py -t Mammalia -m random -n 50 -o my_induced_tree.nwk

Here, “Mammalia” can be replaced by another taxon name, “50” may be replaced by another number, and the -o flag is used to specify an output file. The script calls on the OpenTree functions via the ‘opentreelib’ python library (another hackathon product available on github) to interact with OpenTree. It retrieves the unique OTTid of a higher taxon specified via the -t flag, and queries OpenTree to retrieve a subtree under that node. It parses the subtree to identify the implicated species, selects a random sample of the species, and requests the induced subtree, writing this to a newick file.
my_induced_subtree_example_mammalia copy
This script also invokes a rendering library to create a graphic image of the tree from the command-line, as in the example (figure) showing a random sample of 10 mammals.

Sub-setting in Perl

The specific sub-setting challenge that the team picked was to get a tree for those species (in a named taxon) that have a genome entry in NCBI genomes. NCBI offers a programmable web-services interface called “eutils” to access its databases. Because NCBI searches can be limited to a named taxon, it is possible to query the genomes database with the “esearch” service for “Mammalia” (or Carnivora, Reptilia, Carnivora, Felidae, Thermoprotei), cross-link to NCBI’s taxonomy database using the “elink” service, get the species names using the “esummary” service, and then use OpenTree services (as described in the Introduction) to match names and extract the induced tree.

This 5-step workflow, which illustrates the potential for chaining together web services to build useful tools, was implemented by Arlin Stoltzfus (NIST) as a set of Perl scripts. The master script invokes 5 other standalone scripts, one for each step. The last 2 scripts are simply command-line wrappers for OpenTree’s match_names and induced_subtree methods. All the scripts are available in the Perl subdirectory of the team’s github repo. They are demonstrated in the brief (<2 min) screencast below.

Next

The taxon sampling group produced several other products.  In the next post, I’ll describe how taxon sampling was implemented within environments that provide a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE (phylogeographic visualization), and Arbor (phylogeny workflows).


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.

 

Why Do We Need Big Trees, Anyway?

An explicit goal of the Open Tree of Life is to create a single phylogenetic tree that encompasses all living (and some extinct) biodiversity on earth. A question some may have, especially non-scientists, is why do we need a tree like that, and what would we do with it? You can’t even see it all at once, right? The answer to this question, of course, is that with bigger and more resolved trees we can answer evolutionary questions on scales not previously possible.

Currently, postdocs from the labs of Doug Soltis (Univ. of Florida) and Stephen Smith (Univ. of Michigan) are collaborating on several projects within the plant world that leverage the power of big trees. Cody Hinchliff, a postdoc in the Smith lab, recently presented some of these findings during a standing room only presentation at the Botanical Society of America conference in Boise, Idaho, employing a tree with almost complete generic level sampling to unravel evolution and diversification of epiphytes across vascular plants. Perhaps most surprisingly, Hinchliff found that most epiphyte lineages are relatively young, suggesting that either the widespread success that epiphytes currently exhibit is a recent phenomenon, or that epiphytic lineages are relatively short lived and evolve opportunistically in response to large-scale climate fluctuations. This, and other associated findings, are novel and exciting discoveries, and are examples of the insights that can be gleaned by analyzing character data across a massive data set.

Other collaborative “big tree” projects involving the Soltis and Smith labs involve the evolution of the aquatic habit within land plants and the evolution of floral characters in the order Lamiales. These studies involve Hinchliff and Stephen Smith, Bryan Drew from the University of Nebraska at Kearney (formerly a postdoc with Doug Soltis) and Doug Soltis, and undergraduates from all three institutions. The aquatic evolution project is looking at how the re-colonization of aquatic plants is linked to lineage diversification and whether an aquatic habit is associated with other character or habitat traits. The focus of the Lamiales study is investigating what suites of floral characters may be responsible for the extraordinary evolutionary success of the lineage, which at 23,000 species comprise about 1/12th of all flowering plants.

The fact that studies of this magnitude are not only possible, but ongoing, is a testament to the utility of big trees. Because these trees are nearly complete in terms of genera, we can account for virtually all diversity across these clades. Sparse lineage sampling and hence unaccounted for diversity has previously been a hindrance when analyzing evolutionary trends that span the tree of life, but the time is approaching (or might be here already!) where the size of the phylogenies will not be the limiting factor in studying broad scale evolutionary questions. This exciting development leaves researchers more time to examine and ponder truly interesting questions that could not be addressed previously. This is the power that big trees give us, and this is one of the reasons we need big trees.

Chronogram showing epiphytic evolution within vascular plants. Epiphytic lineages are shown in orange, and likely branches of epiphytic origin are in red. Root of tree is ~485 million years old.

Chronogram showing epiphytic evolution within vascular plants. Epiphytic lineages are shown in orange, and likely branches of epiphytic origin are in red. Root of tree is ~485 million years old.

Doug Soltis is a distinguished professor at the University of Florida.
Bryan Drew was previously a post-doctoral researcher in the Soltis lab and is currently an assistant professor at the University of Nebraska-Kearney.

Tree-for-All hackathon series: Introduction

The Tree-for-All: Introduction

Welcome to the first in a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich., Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.   This post is written by Arlin Stoltzfus (NIST)[1], one of the hackathon organizers (but not affiliated with Open Tree in any other way).  Below, I’m going to introduce the rationale and aims of the hackathon, describe the process, and summarize some of the projects.  In subsequent posts, we will discuss products and lessons learned.  The list of forward links will be updated as new posts appear:

  • Sampling taxa, part 1: Python and Perl scripts
  • Sampling taxa, part 2: PhyloJIVE, Arbor and Open Refine

Motivation: bridging the accessibility gap

The Open Tree of Life project aims to provide data resources for the scientific community, including

  • a grand synthetic tree covering millions of species, generated from thousands of source trees
  • a database of the source trees, published species trees used to generate the synthetic tree
  • a reference taxonomy used (among other things) to align names from different sources

The premise of synthesizing a grand Tree of Life, and making it available with source studies and a reference taxonomy, is that these resources are valuable.  To assess the value of these resources right now would be premature— we will  return to that question later.  For now, I will just point out that, until recently, when scientists in the bioinformatics community have needed a tree broadly covering the kingdoms of life, they have used the NCBI taxonomy hierarchy (multiple examples are cited by Stoltzfus, et al., 2012), an approach that causes phylogeneticists and systematists to groan.  Surely we are better off now, but determining how much better off we are probably will require further analysis.

For the present, it is important to understand that the value of a community resource is predicated on accessibility.  Most users would not know how to handle a tree with 3 million species, useful or not.  For the value of OpenTree’s resources to be realized, it is important to anticipate the needs of users, and support them with appropriate tools.

The aim of the recent Tree-for-all hackathon was to begin bridging this accessibility gap.  More specifically, the aim of the hackathon was to build capacity for the community to leverage Open Tree’s resources via their recently announced web services API (Application Programming Interface).   This enhanced capacity may take the form of end-user tools, library code, standards, and designs.

Technology: web services

Web services are a natural choice for accessibility, because they provide programmable access to a resource to anyone with a networked computer.  Most of the time when you use the web, you are sending a request for a specific page, and receiving results in HTML that are rendered by your browser.  But more generally, web services work by a standard protocol that allows you to send data and commands, and receive results.

Some services are so simple that you can access them just by typing in the URL box of your browser.  For instance, TreeBASE has a web-services API that allows you to access data with commands such as

http://purl.org/phylo/treebase/phylows/tree/TB2:Tr2026?format=nexus

which retrieves a particular tree in NEXUS format.  When that isn’t enough, you can use a command-line tool such as  cURL (command-line URL), found on most computer systems.   I’ll give an example using cURL, then explain how to use a Chrome extension called DHC that provides a graphical user interface.

Open Tree’s web API can do many things, but let’s start with something simple: find out what the synthetic tree implies about the relationships of a set of named species, “Panthera tigris”, “Sorex araneus”, “Erinaceus europaeus”.   To get the tree, we need to chain together a workflow based on 2 web services, the match_names service (click to read the docs) to convert species names to OT taxon identifiers, and the induced_tree service to get a tree for species designated by identifiers.  In the first step, using cURL, we issue this command:

curl -X POST http://api.opentreeoflife.org/v2/tnrs/match_names \
-H "content-type:application/json" \
-d '{"names":["Panthera tigris","Sorex araneus","Erinaceus europaeus"]}'

This command matches our list of input names with the names in OpenTree’s taxonomy. If a species is in the tree, it will have an id in the taxonomy. The output of this command yields the matching identifiers 633213, 796660, and 42314.  To find them, scroll through the output and look for the “ottId” field, which refers to Open Tree taxonomy ids.  Once we have those ids, the next step is to use them to request the tree:

curl -X POST http://api.opentreeoflife.org/v2/tree_of_life/induced_subtree \
-H "content-type:application/json" \
-d '{"ott_ids":[633213, 796660, 42314]}'

which returns a Newick tree (embedded in JSON). OpenTree’s interface refers to this as the “induced” tree, though perhaps it is more appropriately called the implied tree: for any set of nodes in the synthetic tree, the structure of the larger tree immediately implies a topology for the subset, e.g., the tree of A, C and E implied by (A,(B,(C,(D,E)))) is (A,(C,E)).

To run these commands in DHC, start with the cURL command above, then copy and paste the service (the “http” part) and the body (after the -d), into the appropriate boxes, click on “JSON” below the body window (or set the header to content-type: application/json), choose “POST”, then hit “Send”.  The output will appear below.

dhc_screenshot

DHC allows you to use web services in a one-off manner, interactively, but the real power of web services starts to emerge when they are invoked and processed in an automated way, within another program.

Process: Hackathon

Open Tree announced version 1 of its web services in May, at the same time we distributed an open call for participation in a “Tree-for-all” hackathon, which took place September 15 to 19 at University of Michigan, Ann Arbor.  The hackathon was organized and funded by Open Tree, the Arbor workflows project and NESCent’s HIP (Hackathons, Interoperability, Phylogenies) working group.

What, exactly, is a hackathon?  A hackathon is an intensive bout of computer programming, usually with a scope that allows for considerable creativity (when the objectives are pre-determined, the event might be called a “code sprint” instead).  Often it involves bringing together people who haven’t worked face-to-face before.

The tree-for-all hackathon followed a plan for a participant-driven 5-day meeting with ~30 people.  The participant pool is seeded with some hand-picked developers, but consists mainly of folks who have responded to an open call.  The people chosen to participate are not all elite super-coders— some are subject-matter experts without advanced coding skills.  On the morning of day 1, these participants hear informational presentations— in this case, about Open Tree’s data and services (above), the Arbor workflow project, and HIP’s vision of an interoperable web of evolutionary resources.  This is followed by open discussion of possible projects, a process that typically begins (via email list) long before the hackathon.

On the afternoon of Day 1 comes the make-or-break moment: pitching and team-formation.  Participants with ideas stand up, make a pitch for a software development target, and post it on the wall using a giant sticky note.  Others move from pitch to pitch, critiquing, suggesting ideas, and trying to find where they could contribute (or learn) the most.  Pitches evolve through this process, and eventually a set of teams emerges.  From this point on— days 2 to 5 of the hackathon— the meeting belongs to the teams.  The hackathon will succeed or fail, depending on the strength of the teams.

Hackathon participants gather to hear a progress report.  Left to right: Matt Yoder, Stephen Smith, Cody Hinchliff (standing), Andréa Matsunaga, Joseph Brown, Zack Galbreath (standing), Chodon Sass, Alex Harkess, Julienne Ng (eyes only),  Katie Lyons, Gaurav Vaidya (standing), Jorrit Poelen, Shan Kothari (facing left), David Winter, Julie Allen (standing), Karolis Ramanauskas, Nicky Nicolson, Josef Uyeda, Miranda Sinnott-Armstrong (standing), Rachel Warnock, François Michonneau, Luke Harmon, Kayce Bell, Jon Hill's right arm.

Hackathon participants gather to hear a progress report. Left to right: Matt Yoder, Stephen Smith, Cody Hinchliff (standing), Andréa Matsunaga, Joseph Brown, Zack Galbreath (standing), Chodon Sass, Alex Harkess, Julienne Ng (eyes only), Katie Lyons, Gaurav Vaidya (standing), Jorrit Poelen, Shan Kothari (facing left), David Winter, Julie Allen (standing), Karolis Ramanauskas, Nicky Nicolson, Josef Uyeda, Miranda Sinnott-Armstrong (standing), Rachel Warnock, François Michonneau, Luke Harmon, Kayce Bell, Jon Hill’s right arm.

Outcomes: Hackathon team projects

Over the coming weeks, I’m going to write about hackathon team projects and, ideally, provoke some other hackathon participants to do the same.  Hackathon teams are instructed (and cajoled) to focus on tangible outcomes, and the Tree-for-All hackathon produced a lot of them!  For now, here is a brief synopsis.

Integration of Trees and Traits involved hackathon participants Jeff Cavner (remote), Luke Harmon, Zack Galbreath, Jorrit Poelen, Julienne Ng, Alex Harkess, Chodon Sass, Shan Kothari, and Mark Westneat (remote).   They aimed to develop ways to integrate Open Tree’s resources into workflows for analysis of character data and other data.  They already have a nice presentation on their wiki.

Library wrappers for OT APIs involved Joseph Brown, Mark Holder (remote), Jon Hill, Matt Yoder, François Michonneau, Jeet Sukumaran, David Winter, and Karolis Ramanauskas.  The aim of this group was to develop programmable interfaces to Open Tree’s web services in Python, Ruby and R.  They developed an innovative test scheme in which all the libraries were subjected to the same tests.

Phylogeny visualization style-sheets were the focus of Peter Midford (remote), Jim Allman (remote), Pandurang Kolekar (remote), Daisie Huang, Gaurav Vaidya, Julie Allen, and Mike Rosenberg (remote).  Every year thousands of researchers generate  tree images, import them into a graphics editor, and add the same kinds of adornments (colored branches, numbers on nodes, images at the tips, brackets, etc).   The aim of this group was to develop and implement a scheme to treat graphical markups as styles in a separate document (because most tree formats don’t have room for markup), analogous to stylesheets for web pages.

The taxon sampling team included Andréa Matsunaga, Kayce Bell, Dilrini de Silva, Jonathan Rees, Nicky NIcolson and Arlin Stoltzfus.  This group focused on ways to get a phylogeny that represents a sample from a larger taxon— a sample that integrates some useful data, or is otherwise representative of the taxon.

The branch lengths team, including Lyndon Coghill (remote), Rachel Warnock, Josef Uyeda, Katie Lyons, Miranda Sinnott-Armstrong, Bob Thacker (remote), and Curt Lisle (remote) explored ways to address the challenge of adding branch lengths to the synthetic tree.  Like most supertrees, the synthetic tree lacks branch lengths, which limits its usefulness in many kinds of evolutionary studies.

A major knowledge engineering challenge for the Tree of Life community is to link knowledge to nodes in a comprehensive tree, and then ensure that this knowledge persists (as appropriate) when the tree is updated.  A scheme for addressing this challenge was developed and implemented by the annotation database group, including Cody Hinchliff, Karen Cranston, Stephen Smith, Joseph Brown, Mark Holder (remote), Hilmar Lapp (remote) and Temi Varghese.

Next

Next week, I’ll start to describe the work of the taxon sampling team.  To be sure you hear about future posts, click “Follow” in the WordPress bar above this pane.

 


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.

A push for fungal phylogenies in the Open Tree of Life

Screen Shot 2014-09-15 at 1.16.35 PMThe summer of 2014 was a busy one for the mycology group in the Open Tree of Life. Postdoctoral Fellow Romina Gazis gave presentations on the Open Tree of Life at the Annual Meeting of the Mycological Society of America (June 8-12, East Lansing, Michigan) and the International Mycological Congress (Aug. 3-8, Bangkok, Thailand). You can download the IMC presentation here.

Meanwhile, back in Worcester, we continued to compile published phylogenetic trees for incorporation into the Open Tree database. Our goal is to create a synthetic tree that represents, as closely as possible, our current understanding of the broad outlines of fungal phylogenetic relationships, based on molecular studies and taxonomy in Index Fungorum and other sources. We plan to use the tree as the centerpiece of a revision of “higher level” fungal taxonomy, updating a study that we published with seventy coauthors way back in 20071.

Dr. Romina Gazis is a postdoc at Clark University. Dr. Gazis specializes in systematics of endophytes, including symbionts of rubber trees (Hevea brasiliensis) and the newly-described class Xylonomycetes, and also works on phylogenies for the Open Tree of Life project.

Dr. Romina Gazis is a postdoc at Clark University. Dr. Gazis specializes in systematics of endophytes, including symbionts of rubber trees (Hevea brasiliensis) and the newly-described class Xylonomycetes, and also works on phylogenies for the Open Tree of Life project.

To this end, we reviewed the recent and not-so-recent fungal biology literature, emphasizing studies that made a major contribution to understanding of higher-level relationships. We thus identified 314 important studies that are a priority for inclusion in Open Tree of Life. The list of “critical” higher-level studies can be viewed here. Mycologists reading this blog post may wish to check our list of references, and let us know if we have missed anything! Please realize that at this point, we are prioritizing studies that resolve major clades, or that have particularly strong sampling of large groups.

Jiaqi Mei is an undergraduate research assistant at the Katz Lab at Smith College. Jiaqi has been working on gathering information on missing phylogenies for the Open Tree of Life project. Photo: Katz Lab

Having identified the critical higher-level analyses, our next job was to search for the phylogenies in TreeBase and upload them to Open Tree of Life via PhyloGrafter. We were assisted in this time-consuming work by Jiaqi Mei, an undergraduate from Laura Katz’s lab at Smith College who joined us for the summer. 119 of the 314 “higher level” studies (38%) had studies available in TreeBase or other sources. In contrast, Drew et al. (2013)2 found that only about 17% of published phylogenetic studies from all groups have available phylogenies . This evidently demonstrates that mycologists who look at “big picture” phylogenetic relationships are particularly conscientious about data deposition! Nonetheless, there were still many missing phylogenies, so Jiaqi and Romina initiated an e-mail campaign, reaching out to authors of the 195 critical higher-level studies for which we had no trees. We are very grateful to have received responses from almost 50 authors so far. If you are among those who replied to our plea for data, we want to take this opportunity to say Thank You! You should have received a note from us—if not, something may have been lost in transit—please write again!

Our immediate goal is to compile phylogenies that address higher-level relationships, but we are not neglecting fungal studies at low taxonomic levels. In fact, one of Jiaqi’s major tasks was to update our literature review of all fungal phylogenies, reviewing publications since the 2013 study of Drew et al.2, which included studies published up to 2012. Overall, we have identified 2314 fungal phylogenetic studies published since 2000 in 17 journals, of which 640 (28%) have associated treefiles.

It is hard to believe that the Open Tree of Life Project is already in its third year. Our major goal by the end of this academic year is to produce a synthetic phylogenetic tree that significantly updates the 2007 “AFTOL Classification”1 of Fungi, with direct connections to taxonomy and diverse phylogenetic studies. With the continued cooperation of the mycological community we are optimistic that we will reach this goal.

1Hibbett, D. S., M. Binder, J. F. Bischoff, M. Blackwell, P. F. Cannon, O. E. Eriksson, S. Huhndorf, T. James, P. M. Kirk, R. Lücking, T. Lumbsch, F. Lutzoni, P. B. Matheny, D. J. Mclaughlin, M. J. Powell, S. Redhead, C. L. Schoch, J. W. Spatafora, J. A. Stalpers, R. Vilgalys, M. C. Aime, A. Aptroot, R. Bauer, D. Begerow, G. L. Benny, L. A. Castlebury, P. W. Crous, Y.-C. Dai, W. Gams, D. M. Geiser, G. W. Griffith, C. Gueidan, D. L. Hawksworth, G. Hestmark, K. Hosaka, R. A. Humber, K. Hyde, J. E. Ironside, U. Kõljalg, C. P. Kurtzman, K.-H. Larsson, R. Lichtwardt, J. Longcore, J. Miądlikowska, A. Miller, J.-M. Moncalvo, S. Mozley-Standridge, F. Oberwinkler, E. Parmasto, V. Reeb, J. D. Rogers, C. Roux, L. Ryvarden, J. P. Sampaio, A. Schüßler, J. Sugiyama, R. G. Thorn, L. Tibell, W. A. Untereiner, C. Walker, Z. Wang, A. Weir, M. Weiß, M. M. White, K. Winka, Y.-J. Yao, N. Zhang. 2007. A higher-level phylogenetic classification of the Fungi. Mycological Research 111: 509-547. <http://www.clarku.edu/faculty/dhibbett/Reprints%20PDFs/Hibbett_et_al_AFTOL_class_2007.pdf>

2Drew, B.T., R. Gazis, P. Cabezas, K.S. Swithers, J. Deng, R. Rodriguez, L.A. Katz, K.A. Crandall, D.S. Hibbett, D.E. Soltis. 2013. Lost branches on the tree of life. PLOS Biology 11:e1001636. http://www.clarku.edu/faculty/dhibbett/Reprints%20PDFs/added_pdfs_Feb_2013/Drew_et_al_2013_LostBranchesOnTheTreeOfLife_PLOSbiology.pdf

David Hibbett is a professor of biology and PI of the Hibbett lab at Clark University.

Romina Gazis is a postdoc at Clark University. 

 

Accessing OpenTree data

With the soft release of the v 1.0 of the Open Tree of Life (see Karen Cranston’s Evolution talk for details) we also have methods for accessing the data:

* a not-very-pretty but functional page to download the enture 2.5 million tip tree as newick
* API access to subtrees and source trees as well as taxon name services
* clone the github repository of all input trees

A few folks have started to think about ways to interact with the very large newick file, specifically extracting subtrees. Yan Wong posted a perl solution a few weeks ago:

http://yanwong.me/?page_id=1090

Michael Elliot has a C++ package called Gulo which seems to be very efficient (see comments on the post):

http://www.michaelelliot.net/blog/2013/11/09/the-fastest-possible-phylogenetic-deletion-with-phylogenies-of-spotty-animals/

Thrilled to see people working with the data! I note that, despite having APIs to return a subtree or a pruned subtree, downloading all of the data and working with it remotely is still an easy and flexible option for many users. We will continue to make our datasets available, and that download page should have more options and tree metrics soon!

Apply for Tree-for-all: a hackathon to access OpenTree resources

Full call for participation and link to application: http://bit.ly/1ioPPMc

A global “tree of life” will transform biological research in a broad range of disciplines from ecology to bioengineering. To help facilitate that transformation, the OpenTree <http://opentreeoflife.org> project [1] now provides online access to >4000 published phylogenies, and a newly generated tree covering more than 2.5 million species.

The next step is to build tools to enable the community to use these resources.  To meet this aim, OpenTree <http://www.opentreeoflife.org/>, Arbor <http://www.arborworkflows.com/> [2] and NESCent’s HIP<http://www.evoio.org/wiki/HIP> working groups [3] are staging a week-long hackathon September 15 to 19 at U. Michigan, Ann Arbor.  Participants in this “Tree-for-all” will work in small teams to develop tools that use OpenTree’s web services to extract, annotate, or add data in ways useful to the community.  Teams also may focus on testing, expanding and documenting the web services.

How could a global phylogeny be useful in your research or teaching?  What other data from OpenTree would be valuable?  How could OpenTree web services be integrated into familiar workflows and analysis tools?   How could we add to the database of published trees, or enrich it with annotations?

If you can imagine using these resources, and you have the skills to work collaboratively to turn those ideas into products (as a coder, or working side-by-side with coders), we invite you to apply for the hackathon.  The full call for participation (http://bit.ly/1ioPPMc) provides instructions for how to apply, and how to share your ideas with potential teammates (strongly encouraged prior to applying).  Applications are due July 8th. Travel support is provided.  Women and underrepresented minorities are especially encouraged to apply.

If you have questions, contact Karen Cranston (karen.cranston@nescent.org, @kcranstn, OpenTree), Arlin Stoltzfus (arlin@umd.edu, HIP), Julie Allen (juliema@illinois.edu, HIP), or Luke Harmon (lukeh@uidaho.edu, Arbor).

[1] http://www.opentreeoflife.org

[2] http://www.arborworkflows.com/

[3] http://www.evoio.org/wiki/HIP (Hackathons, Interoperability, Phylogenies)

PhyloCode names are not useful for phylogenetic synthesis

Ok, the title is intentionally a bit provocative, but bear with me.

A primary aim of the Open Tree project is to synthesize increasingly comprehensive estimates of phylogeny from “source trees” — published phylogenies constructed to resolve relationships in disparate parts of the tree of life. The general idea is to combine these localized efforts into a unified whole, using clever bioinformatic algorithms.

In this context, a basic operational question is: how do we know if a clade in one source tree is the same as a clade in another source tree? This can be difficult to answer, because source trees are typically constructed from carefully selected samples of individual organisms and their characters (usually DNA sequences). If two source trees are inferred from completely non-overlapping samples of individual organisms, as is commonly the case, is it possible for them to have clades in common, or rather, is it possible for us to determine whether they have clades in common?

I would argue that the answer is yes, with a very important condition: that the organisms sampled for each tree are placed into a common taxonomic hierarchy that embodies a working hypothesis of named clades in the tree of life.

Note an important distinction here: a clade in a source tree depicts common ancestry of selected individual organisms, while a clade in the tree of life is a conceptual group defined by common ancestry that effectively divides all organisms, living and dead, into members and non-members. So a taxon in this sense is a name that refers to a particular tree-of-life clade whose membership is formalized by its position in the comprehensive taxonomic hierarchy.

By placing sampled organisms into a common taxonomic hierarchy, one can compute the relationships between source-tree clades and tree-of-life clades in terms of taxa, a process that I refer to as “taxonomic normalization.”

An idea that emerges from this line of thinking is that the central paradigm of systematics is (or should be) the reciprocal illumination of phylogeny and taxonomy. That is, phylogenetic research tests and refines taxonomic concepts, and those taxonomic concepts in turn guide the selection of individual organisms for future research. I would argue that this, in a nutshell, is “phylogenetic synthesis.”

Which brings me to the title of this post. In the PhyloCode, taxonomic names are not hypothetical concepts that can be refuted or refined by data-driven tests. Instead, they are definitions involving specifiers (designated specimens) that are simply applied to source trees that include those specifiers. This is problematic for synthesis because if two source trees differ in topology, and/or they fail to include the appropriate specifiers, it may be impossible to answer the basic question I began with: do the trees share any clades (taxa) in common? If taxa are functions of phylogenetic topology, then there can be no taxonomic basis for meaningfully comparing source trees that either differ in topology, or do not permit the application of taxon definitions.

So phylogenetic synthesis requires taxa that are explicitly not functions of phylogenetic topology. Instead, taxa should exist independently as hypotheses to be tested by phylogenetic evidence, and as systematists we should strive to construct comprehensive taxonomic hierarchies. I think this is going to be the real key to making progress in answering the question, “what do we know about the tree of life, and how do we know it?”

Follow

Get every new post delivered to your Inbox.

Join 247 other followers