Sampling taxa with Python and Perl scripts
This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project. To read the whole series, go to the Introduction page.
More specifically, this is the first of two posts addressing the outputs of the “Sampling taxa” team, consisting of Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (OpenTree) and Arlin Stoltzfus (NIST).
The “taxon sampling” idea
Although users seeking a tree may have a predetermined set of species in mind, often the user is focused on taxon T without having a prior list of species. For instance, the typical user interested in a tree of mammals does not really want the full tree of > 5000 known species of mammals, but some subset, e.g., a tree with a random subset of 100 species, or a tree of the 94 species with known genomes in NCBI, or a tree with one species for each of ~150 mammal families.
If we think about this more broadly, we can identify a number of different types of sampling, depending on what kinds of information we are using, and how we are using it. First, sampling T by sub-setting is simply getting all the species in T that satisfy some criterion, e.g., being on the IUCN red list of endangered species, or having a genome entry in NCBI genomes, a species page in EOL, or an image in phylopic.org (organism silhouettes for adorning trees).
Second, we might use a kind of hierarchical taxonomic sampling to get 1 (or more) species from each genus (or family, order, etc.).
Poster from hackathon day 1, making the pitch for sampling taxa as a hackathon target
Third, we could reduce the complexity of a taxon or clade without using any outside information— what we might call down-sampling—, e.g., get a random sample of N species from taxon T, down-sample nodes according to subnode density, or choose N species to maximize phylogenetic diversity.
Finally, we can imagine a kind of relevance sampling, where we choose (from taxon T) the top N species based on some external measure of importance or relevance, e.g., the number of occurrence records in iDigBio (or GBIF, iNaturalist, etc.), the number of google hits (i.e., popular species), or the number of PubMed hits (i.e., biomedically relevant species).
Products of the “taxon sampling” team
At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases:
- sub-setting: get species in T with entries in NCBI genomes
- down-sampling: get a random sample of N species from T
- relevance sampling: get the N species in T with the most records in iDigBio
Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to match species names to OpenTree taxon identifiers (ottIds), and the induced_tree service to get a tree for species designated by these identifiers.
Here I’ll describe two projects based on command-line scripts in Python and Perl. In the next post, I’ll describe how taxon sampling was implemented within an existing platform with a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE, and Arbor.
Down-sampling in Python
A simple down-sampling approach via random choice is implemented in the random_sample.py script developed by Dilrini De Silva (Oxford) and Jonathan Rees (OpenTree), as in this example:
python random_sample.py -t Mammalia -m random -n 50 -o my_induced_tree.nwk
Here, “Mammalia” can be replaced by another taxon name, “50” may be replaced by another number, and the -o flag is used to specify an output file. The script calls on the OpenTree functions via the ‘opentreelib’ python library (another hackathon product available on github) to interact with OpenTree. It retrieves the unique OTTid of a higher taxon specified via the -t flag, and queries OpenTree to retrieve a subtree under that node. It parses the subtree to identify the implicated species, selects a random sample of the species, and requests the induced subtree, writing this to a newick file.
This script also invokes a rendering library to create a graphic image of the tree from the command-line, as in the example (figure) showing a random sample of 10 mammals.
Sub-setting in Perl
The specific sub-setting challenge that the team picked was to get a tree for those species (in a named taxon) that have a genome entry in NCBI genomes. NCBI offers a programmable web-services interface called “eutils” to access its databases. Because NCBI searches can be limited to a named taxon, it is possible to query the genomes database with the “esearch” service for “Mammalia” (or Carnivora, Reptilia, Carnivora, Felidae, Thermoprotei), cross-link to NCBI’s taxonomy database using the “elink” service, get the species names using the “esummary” service, and then use OpenTree services (as described in the Introduction) to match names and extract the induced tree.
This 5-step workflow, which illustrates the potential for chaining together web services to build useful tools, was implemented by Arlin Stoltzfus (NIST) as a set of Perl scripts. The master script invokes 5 other standalone scripts, one for each step. The last 2 scripts are simply command-line wrappers for OpenTree’s match_names and induced_subtree methods. All the scripts are available in the Perl subdirectory of the team’s github repo. They are demonstrated in the brief (<2 min) screencast below.
The taxon sampling group produced several other products. In the next post, I’ll describe how taxon sampling was implemented within environments that provide a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE (phylogeographic visualization), and Arbor (phylogeny workflows).
 The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.