Posts tagged “web services

Tree-for-All hackathon series: Taxon sampling, part 1 

Sampling taxa with Python and Perl scripts

This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.  To read the whole series, go to the Introduction page.

More specifically, this is the first of two posts addressing the outputs of the “Sampling taxa” team, consisting of Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (OpenTree) and Arlin Stoltzfus (NIST).[1]

The “taxon sampling” idea

Although users seeking a tree may have a predetermined set of species in mind, often the user is focused on taxon T without having a prior list of species. For instance, the typical user interested in a tree of mammals does not really want the full tree of > 5000 known species of mammals, but some subset, e.g., a tree with a random subset of 100 species, or a tree of the 94 species with known genomes in NCBI, or a tree with one species for each of ~150 mammal families.

If we think about this more broadly, we can identify a number of different types of sampling, depending on what kinds of information we are using, and how we are using it. First, sampling T by sub-setting is simply getting all the species in T that satisfy some criterion, e.g., being on the IUCN red list of endangered species,  or having a genome entry in NCBI genomes, a species page in EOL, or an image in phylopic.org (organism silhouettes for adorning trees).

Second, we might use a kind of hierarchical taxonomic sampling to get 1 (or more) species from each genus (or family, order, etc.).

sampling_taxa_poster

Poster from hackathon day 1, making the pitch for sampling taxa as a hackathon target

Third, we could reduce the complexity of a taxon or clade without using any outside information— what we might call down-sampling—, e.g., get a random sample of N species from taxon T, down-sample nodes according to subnode density, or choose N species to maximize phylogenetic diversity.

Finally, we can imagine a kind of relevance sampling, where we choose (from taxon T) the top N species based on some external measure of importance or relevance, e.g., the number of occurrence records in iDigBio (or GBIF, iNaturalist, etc.), the number of google hits (i.e., popular species), or the number of PubMed hits (i.e., biomedically relevant species).

Products of the “taxon sampling” team

At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases:

  • sub-setting: get species in T with entries in NCBI genomes
  • down-sampling: get a random sample of N species from T
  • relevance sampling: get the N species in T with the most records in iDigBio

Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to match species names to OpenTree taxon identifiers (ottIds), and the induced_tree service to get a tree for species designated by these identifiers.

Here I’ll describe two projects based on command-line scripts in Python and Perl.  In the next post, I’ll describe how taxon sampling was implemented within an existing platform with a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE, and Arbor.

Down-sampling in Python

A simple down-sampling approach via random choice is implemented in the random_sample.py script developed by Dilrini De Silva (Oxford) and Jonathan Rees (OpenTree), as in this example:

python random_sample.py -t Mammalia -m random -n 50 -o my_induced_tree.nwk

Here, “Mammalia” can be replaced by another taxon name, “50” may be replaced by another number, and the -o flag is used to specify an output file. The script calls on the OpenTree functions via the ‘opentreelib’ python library (another hackathon product available on github) to interact with OpenTree. It retrieves the unique OTTid of a higher taxon specified via the -t flag, and queries OpenTree to retrieve a subtree under that node. It parses the subtree to identify the implicated species, selects a random sample of the species, and requests the induced subtree, writing this to a newick file.
my_induced_subtree_example_mammalia copy
This script also invokes a rendering library to create a graphic image of the tree from the command-line, as in the example (figure) showing a random sample of 10 mammals.

Sub-setting in Perl

The specific sub-setting challenge that the team picked was to get a tree for those species (in a named taxon) that have a genome entry in NCBI genomes. NCBI offers a programmable web-services interface called “eutils” to access its databases. Because NCBI searches can be limited to a named taxon, it is possible to query the genomes database with the “esearch” service for “Mammalia” (or Carnivora, Reptilia, Carnivora, Felidae, Thermoprotei), cross-link to NCBI’s taxonomy database using the “elink” service, get the species names using the “esummary” service, and then use OpenTree services (as described in the Introduction) to match names and extract the induced tree.

This 5-step workflow, which illustrates the potential for chaining together web services to build useful tools, was implemented by Arlin Stoltzfus (NIST) as a set of Perl scripts. The master script invokes 5 other standalone scripts, one for each step. The last 2 scripts are simply command-line wrappers for OpenTree’s match_names and induced_subtree methods. All the scripts are available in the Perl subdirectory of the team’s github repo. They are demonstrated in the brief (<2 min) screencast below.

Next

The taxon sampling group produced several other products.  In the next post, I’ll describe how taxon sampling was implemented within environments that provide a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE (phylogeographic visualization), and Arbor (phylogeny workflows).


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.

 


Tree-for-All hackathon series: taxon sampling, part 2

Sampling taxa in PhyloJiVE, Open Refine, and Arbor

This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.  To read the whole Tree-for-All series, go to the Introduction page.

More specifically, this is the second of two posts on work of the “taxon sampling” team: Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (Open Tree) and Arlin Stoltzfus (NIST).[1]  The team got significant help from Arbor team members Zack Galbreath (Kitware) and Curt Lisle (KnowledgeVis).

Products of the “taxon sampling” team

At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases of sampling up to N species from a taxon T:

  • sub-setting: get species in T with entries in NCBI genomes
  • down-sampling: get a random sample of N species from T
  • relevance sampling: get the N species in T with the most records in iDigBio

Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to convert species names to OT taxon ids, and the induced_tree service to get a tree for species designated by ids.

In the previous post, I described 2 projects based on command-line scripts in Python and Perl.  Below, I’ll describe how taxon sampling was implemented within existing platforms with graphical user interfaces, including Open Refine (spreadsheets), PhyloJiVE (phylogeographic visualizations), and Arbor (phylogeny workflows).

Relevance sampling in PhyloJIVE

Previously we defined “relevance sampling” as finding a subset of species in some taxon that is the most relevant by some external measure, e.g., number of hits in google (popular species).  In particular, the taxon-sampling team defined its target relative to iDigBio (Integrated Digitized Biocollections), which makes data and images for millions of biological specimens available in electronic format for the research community, government agencies, students, educators, and the general public.  The challenge is to get a tree for the N species in taxon T with the most records in iDigBio.  Because iDigBio has its own web-services interface, we can query it automatically using scripts.

A version of relevance sampling was implemented by Andréa Matsunaga (U. Florida) to show how phylogenies can be integrated into an environment for analyzing biodiversity data. For this demonstration, OpenTree services were invoked from within PhyloJIVE (Phylogeny Javascript Information Visualiser and Explorer), a web-based application that places biodiversity information (aggregated from many sources) onto compact phylogenetic trees.

phylojive

PhyloJIVE live demo software developed by hackathon participant Andrea Matsunaga.  Choosing the “top 10 Felidae” menu item queries iDigBio for the cats with the most records, then obtains a tree on the fly by querying Open Tree.  Clicking on Leopardus pardalis (ocelot) on the resulting tree  opens up a map viewer showing the locations associated with records (red dots).

A live demo provides access to several pre-configured queries. For instance, choosing the “top 10 Felidae” menu item returns an OpenTree phylogeny for the 10 cat species most frequently implicated by iDigBio records. In the resulting view (above), mousing over the boxes reveals the number of records for each species. Clicking on a species (e.g., Leopardus pardalis above), shows a map of occurrence records.

Sub-setting and relevance-sampling in Open Refine

OpenRefine spreadsheet populated with counts of occurrence records captured by invoking iDigBio webservices directly

OpenRefine spreadsheet populated with counts of occurrence records captured by invoking iDigBio webservices directly

Open Refine (formerly Google Refine) is an open-source data management tool with an interface like a spreadsheet, but with some of the features of a database.  Nicky Nicolson (Kew Gardens) teamed up with Andréa Matsunaga (U. Florida) to explore how Open Refine’s scriptable features can be used to populate a spreadsheet with occurrence data from iDigBio (obtained via iDigBio’s web services), as shown above.

Phylogeny view generated from within OpenRefine by invoking a javascript phylogeny viewer

Phylogeny view generated from within OpenRefine by invoking a javascript phylogeny viewer

Further scripting can be used to generate a column of OpenTree taxonomy ids from a column of species names, by invoking the tnrs/match_names service. Finally, one can submit a query for the induced tree for a selected column of species identifiers. The image above shows a custom “OpenTree” item that has been added to the menu of Open Refine, to retrieve a tree, which is then visualized using a JavaScript viewer here (image at right).

The value of this demonstration, explained more fully on the refine-opentree project wiki, is that the user has considerable flexibility to create and manage a set of data using the Open Refine spreadsheet features, but also has the power to invoke external web services from iDigBio and OpenTree.

Sub-setting and relevance-sampling in Arbor

Arbor (http://arborworkflows.com) provides a framework for constructing and executing workflows used in evolutionary analysis.   Andréa Matsunaga (U. Florida) and Kayce Bell (U. New Mexico) worked with Arbor developers Zack Galbreath (Kitware) and Curt Lisle (KnowledgeVis) to implement approaches to sub-setting and relevance-sampling by producing code and workflows in python/Arbor.  A live demo of Arbor that includes OpenTree menu items is accessible at arbor.kitware.com.   One of the nice things about Arbor is that it provides a graphical workflow editor, allowing you to piece together workflows from modules, by connecting inputs and outputs.  The workflow shown below begins by querying iDigBio, and ends with generating an image of a tree.

arborworkflow

High-level view of Arbor workflow to capture iDigBio records, and then acquire matching taxon names and the induced tree from OpenTree

To view the OpenTree-specific menu items on the public Arbor instance hosted at kitware.com, you must click on the view (eye) icon next to “OpenTree.”  Be warned that, at present, menu items are undergoing changes. The menu item currently entitled “Get ranked scoped scientific names from iDigBio” will return a list of species names that can then be used to retrieve a tree from OpenTree. The analysis takes the scientific names of various ranks (or scope) or as a taxonomic search (leave scope at _all), and will return a list of species of the specified size, consisting either of the top-ranked species (most records) or a random set of species that meet the criteria, depending on what you specify. This also has been incorporated into a menu item (“Workflow to get an induced tree from a configurable iDigBio query”) with the specifications for the iDigBio search as the input— this is the workflow shown above.   In the screencast below (bottom of page), Kayce Bell explains exactly how to carry out the individual steps in Arbor.

As Arbor’s interface is designed to allow users to execute a variety of analyses on user-supplied data, there are ways to upload your own tabular data for processing. Currently data is expected to be in CSV format. Algorithms exist in Arbor to match species names against the OpenTree TNRS, request a tree matching specific taxa, and perform comparative analysis on trees and tables. Some auto discovery of tabular taxa names is supported, but it is recommended to have a first column entitled “species”, “name”, or “scientific name”. Online documentation for Arbor is currently being developed, and will be available through the Arbor website.

 


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.


Tree-for-All hackathon series: Introduction

The Tree-for-All: Introduction

Welcome to the first in a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich., Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.   This post is written by Arlin Stoltzfus (NIST)[1], one of the hackathon organizers (but not affiliated with Open Tree in any other way).  Below, I’m going to introduce the rationale and aims of the hackathon, describe the process, and summarize some of the projects.  In subsequent posts, we will discuss products and lessons learned.  The list of forward links will be updated as new posts appear:

Motivation: bridging the accessibility gap

The Open Tree of Life project aims to provide data resources for the scientific community, including

  • a grand synthetic tree covering millions of species, generated from thousands of source trees
  • a database of the source trees, published species trees used to generate the synthetic tree
  • a reference taxonomy used (among other things) to align names from different sources

The premise of synthesizing a grand Tree of Life, and making it available with source studies and a reference taxonomy, is that these resources are valuable.  To assess the value of these resources right now would be premature— we will  return to that question later.  For now, I will just point out that, until recently, when scientists in the bioinformatics community have needed a tree broadly covering the kingdoms of life, they have used the NCBI taxonomy hierarchy (multiple examples are cited by Stoltzfus, et al., 2012), an approach that causes phylogeneticists and systematists to groan.  Surely we are better off now, but determining how much better off we are probably will require further analysis.

For the present, it is important to understand that the value of a community resource is predicated on accessibility.  Most users would not know how to handle a tree with 3 million species, useful or not.  For the value of OpenTree’s resources to be realized, it is important to anticipate the needs of users, and support them with appropriate tools.

The aim of the recent Tree-for-all hackathon was to begin bridging this accessibility gap.  More specifically, the aim of the hackathon was to build capacity for the community to leverage Open Tree’s resources via their recently announced web services API (Application Programming Interface).   This enhanced capacity may take the form of end-user tools, library code, standards, and designs.

Technology: web services

Web services are a natural choice for accessibility, because they provide programmable access to a resource to anyone with a networked computer.  Most of the time when you use the web, you are sending a request for a specific page, and receiving results in HTML that are rendered by your browser.  But more generally, web services work by a standard protocol that allows you to send data and commands, and receive results.

Some services are so simple that you can access them just by typing in the URL box of your browser.  For instance, TreeBASE has a web-services API that allows you to access data with commands such as

http://purl.org/phylo/treebase/phylows/tree/TB2:Tr2026?format=nexus

which retrieves a particular tree in NEXUS format.  When that isn’t enough, you can use a command-line tool such as  cURL (command-line URL), found on most computer systems.   I’ll give an example using cURL, then explain how to use a Chrome extension called DHC that provides a graphical user interface.

Open Tree’s web API can do many things, but let’s start with something simple: find out what the synthetic tree implies about the relationships of a set of named species, “Panthera tigris”, “Sorex araneus”, “Erinaceus europaeus”.   To get the tree, we need to chain together a workflow based on 2 web services, the match_names service (click to read the docs) to convert species names to OT taxon identifiers, and the induced_tree service to get a tree for species designated by identifiers.  In the first step, using cURL, we issue this command:

curl -X POST http://api.opentreeoflife.org/v2/tnrs/match_names \
-H "content-type:application/json" \
-d '{"names":["Panthera tigris","Sorex araneus","Erinaceus europaeus"]}'

This command matches our list of input names with the names in OpenTree’s taxonomy. If a species is in the tree, it will have an id in the taxonomy. The output of this command yields the matching identifiers 633213, 796660, and 42314.  To find them, scroll through the output and look for the “ottId” field, which refers to Open Tree taxonomy ids.  Once we have those ids, the next step is to use them to request the tree:

curl -X POST http://api.opentreeoflife.org/v2/tree_of_life/induced_subtree \
-H "content-type:application/json" \
-d '{"ott_ids":[633213, 796660, 42314]}'

which returns a Newick tree (embedded in JSON). OpenTree’s interface refers to this as the “induced” tree, though perhaps it is more appropriately called the implied tree: for any set of nodes in the synthetic tree, the structure of the larger tree immediately implies a topology for the subset, e.g., the tree of A, C and E implied by (A,(B,(C,(D,E)))) is (A,(C,E)).

To run these commands in DHC, start with the cURL command above, then copy and paste the service (the “http” part) and the body (after the -d), into the appropriate boxes, click on “JSON” below the body window (or set the header to content-type: application/json), choose “POST”, then hit “Send”.  The output will appear below.

dhc_screenshot

DHC allows you to use web services in a one-off manner, interactively, but the real power of web services starts to emerge when they are invoked and processed in an automated way, within another program.

Process: Hackathon

Open Tree announced version 1 of its web services in May, at the same time we distributed an open call for participation in a “Tree-for-all” hackathon, which took place September 15 to 19 at University of Michigan, Ann Arbor.  The hackathon was organized and funded by Open Tree, the Arbor workflows project and NESCent’s HIP (Hackathons, Interoperability, Phylogenies) working group.

What, exactly, is a hackathon?  A hackathon is an intensive bout of computer programming, usually with a scope that allows for considerable creativity (when the objectives are pre-determined, the event might be called a “code sprint” instead).  Often it involves bringing together people who haven’t worked face-to-face before.

The tree-for-all hackathon followed a plan for a participant-driven 5-day meeting with ~30 people.  The participant pool is seeded with some hand-picked developers, but consists mainly of folks who have responded to an open call.  The people chosen to participate are not all elite super-coders— some are subject-matter experts without advanced coding skills.  On the morning of day 1, these participants hear informational presentations— in this case, about Open Tree’s data and services (above), the Arbor workflow project, and HIP’s vision of an interoperable web of evolutionary resources.  This is followed by open discussion of possible projects, a process that typically begins (via email list) long before the hackathon.

On the afternoon of Day 1 comes the make-or-break moment: pitching and team-formation.  Participants with ideas stand up, make a pitch for a software development target, and post it on the wall using a giant sticky note.  Others move from pitch to pitch, critiquing, suggesting ideas, and trying to find where they could contribute (or learn) the most.  Pitches evolve through this process, and eventually a set of teams emerges.  From this point on— days 2 to 5 of the hackathon— the meeting belongs to the teams.  The hackathon will succeed or fail, depending on the strength of the teams.

Hackathon participants gather to hear a progress report.  Left to right: Matt Yoder, Stephen Smith, Cody Hinchliff (standing), Andréa Matsunaga, Joseph Brown, Zack Galbreath (standing), Chodon Sass, Alex Harkess, Julienne Ng (eyes only),  Katie Lyons, Gaurav Vaidya (standing), Jorrit Poelen, Shan Kothari (facing left), David Winter, Julie Allen (standing), Karolis Ramanauskas, Nicky Nicolson, Josef Uyeda, Miranda Sinnott-Armstrong (standing), Rachel Warnock, François Michonneau, Luke Harmon, Kayce Bell, Jon Hill's right arm.

Hackathon participants gather to hear a progress report. Left to right: Matt Yoder, Stephen Smith, Cody Hinchliff (standing), Andréa Matsunaga, Joseph Brown, Zack Galbreath (standing), Chodon Sass, Alex Harkess, Julienne Ng (eyes only), Katie Lyons, Gaurav Vaidya (standing), Jorrit Poelen, Shan Kothari (facing left), David Winter, Julie Allen (standing), Karolis Ramanauskas, Nicky Nicolson, Josef Uyeda, Miranda Sinnott-Armstrong (standing), Rachel Warnock, François Michonneau, Luke Harmon, Kayce Bell, Jon Hill’s right arm.

Outcomes: Hackathon team projects

Over the coming weeks, I’m going to write about hackathon team projects and, ideally, provoke some other hackathon participants to do the same.  Hackathon teams are instructed (and cajoled) to focus on tangible outcomes, and the Tree-for-All hackathon produced a lot of them!  For now, here is a brief synopsis.

Integration of Trees and Traits involved hackathon participants Jeff Cavner (remote), Luke Harmon, Zack Galbreath, Jorrit Poelen, Julienne Ng, Alex Harkess, Chodon Sass, Shan Kothari, and Mark Westneat (remote).   They aimed to develop ways to integrate Open Tree’s resources into workflows for analysis of character data and other data.  They already have a nice presentation on their wiki.

Library wrappers for OT APIs involved Joseph Brown, Mark Holder (remote), Jon Hill, Matt Yoder, François Michonneau, Jeet Sukumaran, David Winter, and Karolis Ramanauskas.  The aim of this group was to develop programmable interfaces to Open Tree’s web services in Python, Ruby and R.  They developed an innovative test scheme in which all the libraries were subjected to the same tests.

Phylogeny visualization style-sheets were the focus of Peter Midford (remote), Jim Allman (remote), Pandurang Kolekar (remote), Daisie Huang, Gaurav Vaidya, Julie Allen, and Mike Rosenberg (remote).  Every year thousands of researchers generate  tree images, import them into a graphics editor, and add the same kinds of adornments (colored branches, numbers on nodes, images at the tips, brackets, etc).   The aim of this group was to develop and implement a scheme to treat graphical markups as styles in a separate document (because most tree formats don’t have room for markup), analogous to stylesheets for web pages.

The taxon sampling team included Andréa Matsunaga, Kayce Bell, Dilrini de Silva, Jonathan Rees, Nicky NIcolson and Arlin Stoltzfus.  This group focused on ways to get a phylogeny that represents a sample from a larger taxon— a sample that integrates some useful data, or is otherwise representative of the taxon.

The branch lengths team, including Lyndon Coghill (remote), Rachel Warnock, Josef Uyeda, Katie Lyons, Miranda Sinnott-Armstrong, Bob Thacker (remote), and Curt Lisle (remote) explored ways to address the challenge of adding branch lengths to the synthetic tree.  Like most supertrees, the synthetic tree lacks branch lengths, which limits its usefulness in many kinds of evolutionary studies.

A major knowledge engineering challenge for the Tree of Life community is to link knowledge to nodes in a comprehensive tree, and then ensure that this knowledge persists (as appropriate) when the tree is updated.  A scheme for addressing this challenge was developed and implemented by the annotation database group, including Cody Hinchliff, Karen Cranston, Stephen Smith, Joseph Brown, Mark Holder (remote), Hilmar Lapp (remote) and Temi Varghese.

Next

Next week, I’ll start to describe the work of the taxon sampling team.  To be sure you hear about future posts, click “Follow” in the WordPress bar above this pane.

 


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.