Latest

FuturePhy

This is the first in a series of posts about several  phylogeny initiatives newly-funded by NSF focused on both technical and community aspects of phylogeny.  Plenty of potential for mutually beneficial work with OpenTree, and we are excited to help.

First up… FuturePhy!

FuturePhy is an NSF-sponsored, three-year program of conferences, workshops and hackathons on the Tree of Life. The project aims to promote novel, integrative data analyses and visualization, interdisciplinary syntheses of phylogenetic sciences, and cross-cutting uses of phylogenetics to develop and address new research questions and applications.

The first phase of this mission is critical: to bring together a broad community of people from diverse backgrounds who are active in phylogenetics research, who use the tree of life in research or education, who will benefit in applied or practical ways from a comprehensive tree of life, or who come from a background that offers new perspectives on defining, addressing or transcending key challenges in phylogenetics.

Help accelerate progress in all aspects of phylogenetics research by joining FuturePhy today. Diverse opportunities will be available to attend FuturePhy sessions in person or virtually, and to link FuturePhy to existing projects and initiatives.

  1. We invite you to participate in the project in several ways:
    Register on futurephy.org. Scientists from all aspects of the phylogenetic sciences, educators, members of the tree-using community, and others interested in phylogenetics are welcome.
  2. Take the community survey and let FuturePhy what workshop and hackathon topics they should fund.
  3. Contribute to the discussion forum on futurephy.org. This is the best way to log your interest and contribute ideas.
  4. Send email at contact@futurephy.org with ideas or comments
  5. Tweet to the FuturePhy community: @FuturePhy
  6. Comment in the FuturePhy phylobabble thread

Crandall Lab Update: What can we do with synthetic trees?

Currently, the Crandall Lab is examining ways to use the underlying OpenTree taxonomy to gather metadata, associate it with nodes and tips in our synthetic trees, and apply it to evolutionary studies. Below we discuss them in the context of ongoing projects in the lab.

 

Curated taxonomy

One of the major outcomes of the OpenTree project is the underlying taxonomy. A curated taxonomy allows us to search and align names across independent databases to pull out additional information to associate with node and tip names. The Crandall Lab taxonomy curation started with the freshwater crayfish, which are in the Infraorder Astacidea and includes 711 species spread among 7 families.   This was a great group to start with because of the limited number of species and there are only a few active systematists revising the alpha-taxonomy, which makes the literature less dense and easier to work with. Initially, our investigations deemed the taxonomy very accurate, but the main issue we had to contend with was spelling errors attributed to depositing sequences into GenBank. In all, we identified and removed 10 misspelled taxa. Although that seems small, it was a great warm-up for two larger groups we are now working on, the Decapoda (crabs, shrimps, lobsters) which includes ~15,000 species and the Hemiptera (true bugs) which includes ~ 50,000-80,000 species.

 

Using a curated taxonomy to obtain additional data

As mentioned above, once the names have been curated we can use them to search across databases. This has been extremely useful in obtaining additional metadata to associate with our synthetic trees. For example, the Crandall Lab recently published a synthetic tree of the crayfish (Fig. 1), which included IUCN Red List values plotted for those taxa with assigned values (Owen et al. 2015, Richman et al. 2015). This is only feasible because we are able to search across IUCN Red Listed crayfish species using the OpenTree curated taxonomy names.

Other applications of using a curated taxonomy to obtain metadata include searching across GenBank to identify whether a particular taxon or rank has molecular data associated with it. This is useful for determining sampling strategies for new and continuing studies. For example, using the OpenTree taxonomy to search GenBank for Hemiptera families and genera, we found a wealth of sequence data has been generated for most of the higher taxa. The most diverse suborder within Hemiptera is Heteroptera and our query of names against NCBI GenBank suggests 70 of the 83 described families within Heteroptera have sequence data for one or more of the traditional eight molecular loci used in Hemiptera systematics (Fig. 2A). As for the Hemiptera genera identified in GenBank, we are currently validating the numbers in Fig. 2B because Hemiptera alpha-taxonomy is very active because many species are vectors for human pathogens and agricultural pests (e.g., kissing bug, aphids, psyllids, etc.).

 

In addition to searching GenBank, we are currently associating geographic, morphological, and ecological metadata to our curated names through GBIF and EOL TraitBank. We believe the curated OpenTree taxonomies of these groups and the accumulation of metadata for taxa will surely add a new dimension to our evolutionary studies and allow us to expand the scope the questions we can answer.

Figure 1 Synthetic tree of crayfish with 20 source trees.  Family names noted on the edge of the synthetic tree.  Paraphyly of Cambaridae is not novel and needs to be addressed in a morphological revision.  Color blocks note the IUCN Redlist value.

Figure 1 Synthetic tree of crayfish with 20 source trees. Family names noted on the edge of the synthetic tree. Paraphyly of Cambaridae is not novel and needs to be addressed in a morphological revision. Color blocks note the IUCN Redlist value.

Figure 2 Histograms depicting number of sequences found on GenBank given OTT names. 2A) Hemiptera families within suborders with nucleotide sequence data on NCBI GenBank. 2B) Hemiptera genera within suborders with nucleotide sequence data on NCBI GenBank.

Figure 2 Histograms depicting number of sequences found on GenBank given OTT names. 2A) Hemiptera families within suborders with nucleotide sequence data on NCBI GenBank. 2B) Hemiptera genera within suborders with nucleotide sequence data on NCBI GenBank.

Keith Crandall is a professor and director of the Computational Biology Institute at George Washington University. 

Chris Owen is a post-doctoral researcher for the AVAToL grant.

Literature

Owen, C. L., Bracken-Grissom, H., Stern, D., & Crandall, K. A. (2015). A synthetic phylogeny of freshwater crayfish: insights for conservation.Philosophical Transactions of the Royal Society of London B: Biological Sciences370(1662), 20140009.

Richman, N. I., Böhm, M., Adams, S. B., Alvarez, F., Bergey, E. A., Bunn, J. J., … & Collen, B. (2015). Multiple drivers of decline in the global status of freshwater crayfish (Decapoda: Astacidea). Philosophical Transactions of the Royal Society B: Biological Sciences370(1662), 20140060.

Update on synthesis methods

The current Open Tree of Life synthesis methods are based on the Tree Alignment Graphs described by Smith et al 2013. The examples presented in that paper used much simpler datasets than the dataset that is used for draft tree synthesis by the Open Tree of Life (which contains hundreds of original source trees and the entire OTT taxonomy with over 2.3 million terminal taxa). To accommodate the goals of synthesis, some modifications were made to the methods presented in Smith et al 2013. The current version of the draft tree (v2, which is presented at http://tree.opentreeoflife.org as of February 2015 and described in a preprint on bioRxiv), was built using these modified methods. The changes to synthesis that were introduced since Smith et al 2013 are not well-described elsewhere, so we present them below in this document.

We are continually testing and improving the methods we use to develop synthesis trees, and through this process we have recently discovered some methodological properties of the modified TAG procedures that are undesirable for our synthesis goals. We are making progress toward fixing them for the next version of the draft tree, and there are details at the end of this post.

General background on the Open Tree of Life project and the draft tree

The overall goal of OpenTree is to summarize what is known about phylogenetic relationships in a transparent manner with a clear connection to analyses and the published studies that support different clades. Comprehensive coverage of published phylogenetic statements is a very long term goal which would require work from a large community of biologists. The short-term goal for the supertree presented on the tree browser is to summarize a small set of well-curated inputs in a clear manner.

Background on Tree Alignment Graph methods

The current synthesis method uses a Tree Alignment Graph (TAG), described in Smith et al 2013. We have been using TAGs because:

  • These graphs can provide a view on conflict and congruence among input trees.
  • TAG-based are computationally tractable on the scale which the open tree of life project operates (2.3 million tips on the tree, and hundreds of input trees).
  • TAG-based approaches provide a straightforward way to handle inputs in which tips of a tree are assigned to higher taxa (any taxon above the species level). It is fairly common for published phylogenies to have tips mapped at the genus level (or higher).
  • When coupled with expert knowledge in the form of ranking of input trees, TAG methods can produce a sensible summary of our (rather limited) input trees. At this point in the project, our data store does not contain a large number of trees sufficiently curated* to be included in the supertree operations.

* Sufficiently curated = 1. tips mapped to taxa in the Open Tree Taxonomy; 2. rooted as described in the publication; 3. ingroup noted. Incorrect rootings and assignments of tips to taxa can introduce a lot of noise in the estimate, so we have opted for careful vetting of input trees rather scraping together every estimate available. We are hopeful that community involvement in the curation will get us to a point of having enough input trees to allow more traditional supertree approaches to work well, so that we can present multiple estimates of the tree of life.

Methods used to produce the v2 draft tree

The open tree of life project has been alternating between phases where we (1) add more trees to our set of curated input trees, and then (2) generate new versions of the “synthetic” draft tree of life. Thus far two versions of the tree have been publicly posted to http://tree.opentreeoflife.org. The process of generating a new public draft tree involves the creation and critical review of many unpublished draft trees in order to detect errors or problems with the process (which could be due to misspecified taxa in input trees, software bugs, etc.).

This process has led to a few modifications of the TAG procedure as it was described in the PLoS Comp. Bio. paper. These modifications have been made to our treemachine software, and they include:

  • In the original paper, conflict was assessed by whether there was conflicting overlap among the descendant taxa of the nodes, not the edes. The software that produced the v2 tree assessed conflict between edges of the graph by looking for conflict based on the taxon sets contributed by each tree. This change is referred to as the “relationship taxa” rule in this issue on GitHub).
  • The supertree operation moves from root to tips, and occasionally a species attaches to a node via a series of low ranking relationships. When all of these are rejected (due to conflict with higher ranking trees), the species would be absent in the full tree if we followed the original TAG description faithfully. Instead, the treemachine version for v2 tree reattached these taxa based on their taxonomy after sweeping over the full tree.
  • The “Partially overlapping taxon sets” section of the paper described a procedure for eliminating order-dependence of the input trees. We have recently discovered a case in which the structure of a TAG built according to those procedures would differ depending on the input order of the trees. We have implemented a new procedure that pre-processes all the input trees, which removes this order-dependence (code for the new procedure can be accessed in the find-mrcas-when-creating-nodes branch of the treemachine repo on github).
  • To increase the overlap between different input trees, an additional step was implemented in treemachine that mapped the tips of an input tree to deeper nodes in the taxonomy that they may have represented. This was done by determining the most inclusive taxon that a tip could belong to without including any other tips in the tree, and then mapping the tip to that taxon instead of the taxon actually specified for the tip in the input tree itself. For example if the only primate in a tree was Homo sapiens, but the tree contained other mammals from the taxon sister to Primates (in the taxonomy), then the Homo sapiens tip would be assigned to the taxon Primates.

Undesirable properties of the procedures used to produce v2

  • It was possible for edges to exist in the draft tree that were not supported by any of the input trees. There were a very small number (111) of such groups in the v2 tree; this GitHub issue discusses the issue more thoroughly. This is not an unusual property for a supertree method to have – in fact most supertree methods can produce such groups. And under some definitions of support (e.g. induced triples) these groupings would probably have had support in our input trees. However, not being able to link every branch in the supertree to an branch in at least one supporting branch in an input tree made the draft tree more difficult to understand. We are working on modifications to the procedure that do not produce these groupings.
  • There were 22 taxonomic groupings mislabeled in the supertree (see issue 154 for details) and the definition of support used to indicate when an input tree “supported” a particular edge in the synthesis could be counterintuitive in some cases. The current view of the tree reports an input tree in the “supported by” panel if the branch in the draft tree passes along an edge that is parallel to an edge contributed by that input tree. Because some of the included taxa may have been culled from the group and reattached in a position closer to the root, the input tree can be in conflict with a grouping but still be listed as supporting it (see issues 155 and 157).

The draft tree contains over 2 million tips and many hundreds of thousands of internal edges. Thus, the undesirable properties mentioned above affected less than 0.0001% of the draft tree v2. Nonetheless, we are in the process of developing fixes for these problems, which should further improve the interpretability as well as the biological accuracy of future versions.

Preprint: Synthesizing phylogeny and taxonomy into a comprehensive tree of life

We’ve just posted a preprint on bioRXiv of our submitted manuscript on how we are combining taxonomy and phylogeny into a comprehensive tree of life:

http://www.biorxiv.org/content/early/2014/12/05/012260

You can browse the complete tree at http://tree.opentreeoflife.org

Comments welcome (either here or on bioRXiv). Note that the authorship list is woefully incomplete – biorxiv only allows 20 authors in the submission process. Here is the complete list:

Stephen A. Smith, Karen A. Cranston, James F. Allman, Joseph W. Brown, Gordon Burleigh, Ruchi Chaudhary, Lyndon M. Coghill, Keith A. Crandall, Jiabin Deng, Bryan T. Drew, Romina Gazis, Karl Gude, David S. Hibbett, Cody Hinchliff, Laura A. Katz, H. Dail Laughinghouse IV, Emily Jane McTavish, Christopher L. Owen, Richard Ree, Jonathan A. Rees, Douglas E. Soltis, Tiffani Williams

Tree-for-All hackathon series: Taxon sampling, part 1 

Sampling taxa with Python and Perl scripts

This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.  To read the whole series, go to the Introduction page.

More specifically, this is the first of two posts addressing the outputs of the “Sampling taxa” team, consisting of Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (OpenTree) and Arlin Stoltzfus (NIST).[1]

The “taxon sampling” idea

Although users seeking a tree may have a predetermined set of species in mind, often the user is focused on taxon T without having a prior list of species. For instance, the typical user interested in a tree of mammals does not really want the full tree of > 5000 known species of mammals, but some subset, e.g., a tree with a random subset of 100 species, or a tree of the 94 species with known genomes in NCBI, or a tree with one species for each of ~150 mammal families.

If we think about this more broadly, we can identify a number of different types of sampling, depending on what kinds of information we are using, and how we are using it. First, sampling T by sub-setting is simply getting all the species in T that satisfy some criterion, e.g., being on the IUCN red list of endangered species,  or having a genome entry in NCBI genomes, a species page in EOL, or an image in phylopic.org (organism silhouettes for adorning trees).

Second, we might use a kind of hierarchical taxonomic sampling to get 1 (or more) species from each genus (or family, order, etc.).

sampling_taxa_poster

Poster from hackathon day 1, making the pitch for sampling taxa as a hackathon target

Third, we could reduce the complexity of a taxon or clade without using any outside information— what we might call down-sampling—, e.g., get a random sample of N species from taxon T, down-sample nodes according to subnode density, or choose N species to maximize phylogenetic diversity.

Finally, we can imagine a kind of relevance sampling, where we choose (from taxon T) the top N species based on some external measure of importance or relevance, e.g., the number of occurrence records in iDigBio (or GBIF, iNaturalist, etc.), the number of google hits (i.e., popular species), or the number of PubMed hits (i.e., biomedically relevant species).

Products of the “taxon sampling” team

At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases:

  • sub-setting: get species in T with entries in NCBI genomes
  • down-sampling: get a random sample of N species from T
  • relevance sampling: get the N species in T with the most records in iDigBio

Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to match species names to OpenTree taxon identifiers (ottIds), and the induced_tree service to get a tree for species designated by these identifiers.

Here I’ll describe two projects based on command-line scripts in Python and Perl.  In the next post, I’ll describe how taxon sampling was implemented within an existing platform with a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE, and Arbor.

Down-sampling in Python

A simple down-sampling approach via random choice is implemented in the random_sample.py script developed by Dilrini De Silva (Oxford) and Jonathan Rees (OpenTree), as in this example:

python random_sample.py -t Mammalia -m random -n 50 -o my_induced_tree.nwk

Here, “Mammalia” can be replaced by another taxon name, “50” may be replaced by another number, and the -o flag is used to specify an output file. The script calls on the OpenTree functions via the ‘opentreelib’ python library (another hackathon product available on github) to interact with OpenTree. It retrieves the unique OTTid of a higher taxon specified via the -t flag, and queries OpenTree to retrieve a subtree under that node. It parses the subtree to identify the implicated species, selects a random sample of the species, and requests the induced subtree, writing this to a newick file.
my_induced_subtree_example_mammalia copy
This script also invokes a rendering library to create a graphic image of the tree from the command-line, as in the example (figure) showing a random sample of 10 mammals.

Sub-setting in Perl

The specific sub-setting challenge that the team picked was to get a tree for those species (in a named taxon) that have a genome entry in NCBI genomes. NCBI offers a programmable web-services interface called “eutils” to access its databases. Because NCBI searches can be limited to a named taxon, it is possible to query the genomes database with the “esearch” service for “Mammalia” (or Carnivora, Reptilia, Carnivora, Felidae, Thermoprotei), cross-link to NCBI’s taxonomy database using the “elink” service, get the species names using the “esummary” service, and then use OpenTree services (as described in the Introduction) to match names and extract the induced tree.

This 5-step workflow, which illustrates the potential for chaining together web services to build useful tools, was implemented by Arlin Stoltzfus (NIST) as a set of Perl scripts. The master script invokes 5 other standalone scripts, one for each step. The last 2 scripts are simply command-line wrappers for OpenTree’s match_names and induced_subtree methods. All the scripts are available in the Perl subdirectory of the team’s github repo. They are demonstrated in the brief (<2 min) screencast below.

Next

The taxon sampling group produced several other products.  In the next post, I’ll describe how taxon sampling was implemented within environments that provide a graphical user interface, including Open Refine (spreadsheets), PhyloJIVE (phylogeographic visualization), and Arbor (phylogeny workflows).


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.

 

Why Do We Need Big Trees, Anyway?

An explicit goal of the Open Tree of Life is to create a single phylogenetic tree that encompasses all living (and some extinct) biodiversity on earth. A question some may have, especially non-scientists, is why do we need a tree like that, and what would we do with it? You can’t even see it all at once, right? The answer to this question, of course, is that with bigger and more resolved trees we can answer evolutionary questions on scales not previously possible.

Currently, postdocs from the labs of Doug Soltis (Univ. of Florida) and Stephen Smith (Univ. of Michigan) are collaborating on several projects within the plant world that leverage the power of big trees. Cody Hinchliff, a postdoc in the Smith lab, recently presented some of these findings during a standing room only presentation at the Botanical Society of America conference in Boise, Idaho, employing a tree with almost complete generic level sampling to unravel evolution and diversification of epiphytes across vascular plants. Perhaps most surprisingly, Hinchliff found that most epiphyte lineages are relatively young, suggesting that either the widespread success that epiphytes currently exhibit is a recent phenomenon, or that epiphytic lineages are relatively short lived and evolve opportunistically in response to large-scale climate fluctuations. This, and other associated findings, are novel and exciting discoveries, and are examples of the insights that can be gleaned by analyzing character data across a massive data set.

Other collaborative “big tree” projects involving the Soltis and Smith labs involve the evolution of the aquatic habit within land plants and the evolution of floral characters in the order Lamiales. These studies involve Hinchliff and Stephen Smith, Bryan Drew from the University of Nebraska at Kearney (formerly a postdoc with Doug Soltis) and Doug Soltis, and undergraduates from all three institutions. The aquatic evolution project is looking at how the re-colonization of aquatic plants is linked to lineage diversification and whether an aquatic habit is associated with other character or habitat traits. The focus of the Lamiales study is investigating what suites of floral characters may be responsible for the extraordinary evolutionary success of the lineage, which at 23,000 species comprise about 1/12th of all flowering plants.

The fact that studies of this magnitude are not only possible, but ongoing, is a testament to the utility of big trees. Because these trees are nearly complete in terms of genera, we can account for virtually all diversity across these clades. Sparse lineage sampling and hence unaccounted for diversity has previously been a hindrance when analyzing evolutionary trends that span the tree of life, but the time is approaching (or might be here already!) where the size of the phylogenies will not be the limiting factor in studying broad scale evolutionary questions. This exciting development leaves researchers more time to examine and ponder truly interesting questions that could not be addressed previously. This is the power that big trees give us, and this is one of the reasons we need big trees.

Chronogram showing epiphytic evolution within vascular plants. Epiphytic lineages are shown in orange, and likely branches of epiphytic origin are in red. Root of tree is ~485 million years old.

Chronogram showing epiphytic evolution within vascular plants. Epiphytic lineages are shown in orange, and likely branches of epiphytic origin are in red. Root of tree is ~485 million years old.

Doug Soltis is a distinguished professor at the University of Florida.
Bryan Drew was previously a post-doctoral researcher in the Soltis lab and is currently an assistant professor at the University of Nebraska-Kearney.

Tree-for-All hackathon series: taxon sampling, part 2

Sampling taxa in PhyloJiVE, Open Refine, and Arbor

This continues a series of posts featuring results from the recent “Tree-for-all” hackathon (Sept 15 to 19, 2014, U. Mich Ann Arbor) aimed at leveraging data resources of the Open Tree of Life project.  To read the whole Tree-for-All series, go to the Introduction page.

More specifically, this is the second of two posts on work of the “taxon sampling” team: Nicky Nicolson (Kew Gardens), Kayce Bell (U. New Mexico), Andréa Matsunaga (U. Florida), Dilrini De Silva (U. Oxford), Jonathan Rees (Open Tree) and Arlin Stoltzfus (NIST).[1]  The team got significant help from Arbor team members Zack Galbreath (Kitware) and Curt Lisle (KnowledgeVis).

Products of the “taxon sampling” team

At the tree-for-all hackathon, the “taxon sampling” team took on the challenge of demonstrating approaches to sampling from a taxon, making their products available in their github repo. The group focused its effort on creating multiple implementations for 3 specific use-cases of sampling up to N species from a taxon T:

  • sub-setting: get species in T with entries in NCBI genomes
  • down-sampling: get a random sample of N species from T
  • relevance sampling: get the N species in T with the most records in iDigBio

Each approach relies on 2 key OpenTree web services (described and illustrated in the introduction): the match_names service (click to read the docs) to convert species names to OT taxon ids, and the induced_tree service to get a tree for species designated by ids.

In the previous post, I described 2 projects based on command-line scripts in Python and Perl.  Below, I’ll describe how taxon sampling was implemented within existing platforms with graphical user interfaces, including Open Refine (spreadsheets), PhyloJiVE (phylogeographic visualizations), and Arbor (phylogeny workflows).

Relevance sampling in PhyloJIVE

Previously we defined “relevance sampling” as finding a subset of species in some taxon that is the most relevant by some external measure, e.g., number of hits in google (popular species).  In particular, the taxon-sampling team defined its target relative to iDigBio (Integrated Digitized Biocollections), which makes data and images for millions of biological specimens available in electronic format for the research community, government agencies, students, educators, and the general public.  The challenge is to get a tree for the N species in taxon T with the most records in iDigBio.  Because iDigBio has its own web-services interface, we can query it automatically using scripts.

A version of relevance sampling was implemented by Andréa Matsunaga (U. Florida) to show how phylogenies can be integrated into an environment for analyzing biodiversity data. For this demonstration, OpenTree services were invoked from within PhyloJIVE (Phylogeny Javascript Information Visualiser and Explorer), a web-based application that places biodiversity information (aggregated from many sources) onto compact phylogenetic trees.

phylojive

PhyloJIVE live demo software developed by hackathon participant Andrea Matsunaga.  Choosing the “top 10 Felidae” menu item queries iDigBio for the cats with the most records, then obtains a tree on the fly by querying Open Tree.  Clicking on Leopardus pardalis (ocelot) on the resulting tree  opens up a map viewer showing the locations associated with records (red dots).

A live demo provides access to several pre-configured queries. For instance, choosing the “top 10 Felidae” menu item returns an OpenTree phylogeny for the 10 cat species most frequently implicated by iDigBio records. In the resulting view (above), mousing over the boxes reveals the number of records for each species. Clicking on a species (e.g., Leopardus pardalis above), shows a map of occurrence records.

Sub-setting and relevance-sampling in Open Refine

OpenRefine spreadsheet populated with counts of occurrence records captured by invoking iDigBio webservices directly

OpenRefine spreadsheet populated with counts of occurrence records captured by invoking iDigBio webservices directly

Open Refine (formerly Google Refine) is an open-source data management tool with an interface like a spreadsheet, but with some of the features of a database.  Nicky Nicolson (Kew Gardens) teamed up with Andréa Matsunaga (U. Florida) to explore how Open Refine’s scriptable features can be used to populate a spreadsheet with occurrence data from iDigBio (obtained via iDigBio’s web services), as shown above.

Phylogeny view generated from within OpenRefine by invoking a javascript phylogeny viewer

Phylogeny view generated from within OpenRefine by invoking a javascript phylogeny viewer

Further scripting can be used to generate a column of OpenTree taxonomy ids from a column of species names, by invoking the tnrs/match_names service. Finally, one can submit a query for the induced tree for a selected column of species identifiers. The image above shows a custom “OpenTree” item that has been added to the menu of Open Refine, to retrieve a tree, which is then visualized using a JavaScript viewer here (image at right).

The value of this demonstration, explained more fully on the refine-opentree project wiki, is that the user has considerable flexibility to create and manage a set of data using the Open Refine spreadsheet features, but also has the power to invoke external web services from iDigBio and OpenTree.

Sub-setting and relevance-sampling in Arbor

Arbor (http://arborworkflows.com) provides a framework for constructing and executing workflows used in evolutionary analysis.   Andréa Matsunaga (U. Florida) and Kayce Bell (U. New Mexico) worked with Arbor developers Zack Galbreath (Kitware) and Curt Lisle (KnowledgeVis) to implement approaches to sub-setting and relevance-sampling by producing code and workflows in python/Arbor.  A live demo of Arbor that includes OpenTree menu items is accessible at arbor.kitware.com.   One of the nice things about Arbor is that it provides a graphical workflow editor, allowing you to piece together workflows from modules, by connecting inputs and outputs.  The workflow shown below begins by querying iDigBio, and ends with generating an image of a tree.

arborworkflow

High-level view of Arbor workflow to capture iDigBio records, and then acquire matching taxon names and the induced tree from OpenTree

To view the OpenTree-specific menu items on the public Arbor instance hosted at kitware.com, you must click on the view (eye) icon next to “OpenTree.”  Be warned that, at present, menu items are undergoing changes. The menu item currently entitled “Get ranked scoped scientific names from iDigBio” will return a list of species names that can then be used to retrieve a tree from OpenTree. The analysis takes the scientific names of various ranks (or scope) or as a taxonomic search (leave scope at _all), and will return a list of species of the specified size, consisting either of the top-ranked species (most records) or a random set of species that meet the criteria, depending on what you specify. This also has been incorporated into a menu item (“Workflow to get an induced tree from a configurable iDigBio query”) with the specifications for the iDigBio search as the input— this is the workflow shown above.   In the screencast below (bottom of page), Kayce Bell explains exactly how to carry out the individual steps in Arbor.

As Arbor’s interface is designed to allow users to execute a variety of analyses on user-supplied data, there are ways to upload your own tabular data for processing. Currently data is expected to be in CSV format. Algorithms exist in Arbor to match species names against the OpenTree TNRS, request a tree matching specific taxa, and perform comparative analysis on trees and tables. Some auto discovery of tabular taxa names is supported, but it is recommended to have a first column entitled “species”, “name”, or “scientific name”. Online documentation for Arbor is currently being developed, and will be available through the Arbor website.

 


[1] The identification of any specific commercial products is for the purpose of specifying a protocol, and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.

Follow

Get every new post delivered to your Inbox.

Join 255 other followers

%d bloggers like this: