Tree-for-All hackathon series: taxon sampling, part 2
Sampling taxa in PhyloJiVE, Open Refine, and Arbor
This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project. To read the whole Tree-for-All series, go to the Introduction page.
More specifically, this is the second of two posts on work of the “taxon sampling” team: Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (Open Tree) and Arlin Stoltzfus (NIST). The team got significant help from Arbor team members Zack Galbreath (Kitware) and Curt Lisle (KnowledgeVis).
Products of the “taxon sampling” team
At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases of sampling up to N species from a taxon T:
- sub-setting: get species in T with entries in NCBI genomes
- down-sampling: get a random sample of N species from T
- relevance sampling: get the N species in T with the most records in iDigBio
Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to convert species names to OT taxon ids, and the induced_tree service to get a tree for species designated by ids.
In the previous post, I described 2 projects based on command-line scripts in Python and Perl. Below, I’ll describe how taxon sampling was implemented within existing platforms with graphical user interfaces, including Open Refine (spreadsheets), PhyloJiVE (phylogeographic visualizations), and Arbor (phylogeny workflows).
Relevance sampling in PhyloJIVE
Previously we defined “relevance sampling” as finding a subset of species in some taxon that is the most relevant by some external measure, e.g., number of hits in google (popular species). In particular, the taxon-sampling team defined its target relative to iDigBio (Integrated Digitized Biocollections), which makes data and images for millions of biological specimens available in electronic format for the research community, government agencies, students, educators, and the general public. The challenge is to get a tree for the N species in taxon T with the most records in iDigBio. Because iDigBio has its own web-services interface, we can query it automatically using scripts.
A live demo provides access to several pre-configured queries. For instance, choosing the “top 10 Felidae” menu item returns an OpenTree phylogeny for the 10 cat species most frequently implicated by iDigBio records. In the resulting view (above), mousing over the boxes reveals the number of records for each species. Clicking on a species (e.g., Leopardus pardalis above), shows a map of occurrence records.
Sub-setting and relevance-sampling in Open Refine
Open Refine (formerly Google Refine) is an open-source data management tool with an interface like a spreadsheet, but with some of the features of a database. Nicky Nicolson (Kew Gardens) teamed up with Andréa Matsunaga (U. Florida) to explore how Open Refine’s scriptable features can be used to populate a spreadsheet with occurrence data from iDigBio (obtained via iDigBio’s web services), as shown above.
The value of this demonstration, explained more fully on the refine-opentree project wiki, is that the user has considerable flexibility to create and manage a set of data using the Open Refine spreadsheet features, but also has the power to invoke external web services from iDigBio and OpenTree.
Sub-setting and relevance-sampling in Arbor
Arbor (http://arborworkflows.com) provides a framework for constructing and executing workflows used in evolutionary analysis. Andréa Matsunaga (U. Florida) and Kayce Bell (U. New Mexico) worked with Arbor developers Zack Galbreath (Kitware) and Curt Lisle (KnowledgeVis) to implement approaches to sub-setting and relevance-sampling by producing code and workflows in python/Arbor. A live demo of Arbor that includes OpenTree menu items is accessible at arbor.kitware.com. One of the nice things about Arbor is that it provides a graphical workflow editor, allowing you to piece together workflows from modules, by connecting inputs and outputs. The workflow shown below begins by querying iDigBio, and ends with generating an image of a tree.
To view the OpenTree-specific menu items on the public Arbor instance hosted at kitware.com, you must click on the view (eye) icon next to “OpenTree.” Be warned that, at present, menu items are undergoing changes. The menu item currently entitled “Get ranked scoped scientific names from iDigBio” will return a list of species names that can then be used to retrieve a tree from OpenTree. The analysis takes the scientific names of various ranks (or scope) or as a taxonomic search (leave scope at _all), and will return a list of species of the specified size, consisting either of the top-ranked species (most records) or a random set of species that meet the criteria, depending on what you specify. This also has been incorporated into a menu item (“Workflow to get an induced tree from a configurable iDigBio query”) with the specifications for the iDigBio search as the input— this is the workflow shown above. In the screencast below (bottom of page), Kayce Bell explains exactly how to carry out the individual steps in Arbor.
As Arbor’s interface is designed to allow users to execute a variety of analyses on user-supplied data, there are ways to upload your own tabular data for processing. Currently data is expected to be in CSV format. Algorithms exist in Arbor to match species names against the OpenTree TNRS, request a tree matching specific taxa, and perform comparative analysis on trees and tables. Some auto discovery of tabular taxa names is supported, but it is recommended to have a first column entitled “species”, “name”, or “scientific name”. Online documentation for Arbor is currently being developed, and will be available through the Arbor website.
 The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.