Building an API for the Open Tree of Life database

Do you want an app for this?

Screen Shot 2012-08-29 at 9.22.20 PMThe developers of the Open Tree of Life would like to know from the phylogenetic community what kind of information they want to extract from its database when the first draft is released later this year. With those preferences, it is possible to develop an API that gives scientists the opportunity to build their own websites or software packages that use the data.

An API (application programming interface) is a digital tool that allows one website or software program to “talk” to another website to dig up certain pieces of data. For instance, a lot of people use Tweetdeck to navigate the ongoing bombardment of messages in the Twittersphere. In that case, Tweetdeck is connecting to Twitter, through its API, to receive and order the messages according to the preferences of the user.

In case of the Open Tree of Life, an API gives researchers advanced access to the data of about two million species, the phylogenies that have been created to illustrate possible relationships between them, and the underlying data and methods of synthesis. “For example, it will be possible to select smaller trees for specific species or find out how many studies there are for a particular node within the database,” says Karen Cranston, the lead investigator of the project.

“What should be ‘Number 1’ on our list to develop?”

To do so, scholars need to access the information they are looking for, while ignoring all other data they are not interested in. They can perform those detailed searches efficiently when such API is readily available. Because scientists have many questions about the evolutionary relationship between millions of species, a whole range of functions needs to be developed for the API. Moreover, the Open Tree of Life team would like to collaborate with other scientists and computer programmers to design and write those protocols.

But before the programming interface gets developed, it is necessary to learn more about the preferences of evolutionary biologists and other scholars with an interest in phylogeny. “What would be most useful for researchers? What should be ‘Number 1’ on our list to develop? It would be very helpful to get some input from other scientists, because then we know what to work on first,” Cranston explains.

“What would be most useful for researchers?”

Suggestions can be posted as a comment on this story, as well as on the Open Tree of Life’s Facebook page and Twitter account.

5 responses

  1. David B. Hedrick

    Several suggestions, that you have probably already considered:
    1. A unique integer taxonomic ID and a unique name for each node. It would be simpler if you could use the NCBI tax_id for the ID. It would be nice if the names were more descriptive than a serial number.
    2. Integration with other databases. NCBI is the example of this, with all the different types of data available from each page.
    3. Flexible format. People building the tree need to display branch lengths proportional to some evolutionary distance. Most people using the tree need it plotted against geologic time, they need to see the relationships of the organisms, not evolutionary rates. In other words, the distance from the root to each extant species is the same.
    4. Fossils should be entered between the root and extant species.
    5. This will be difficult, but eventually the tree should be plotted against geologic time, or log(time), so that each node is associated with the date of divergence.


    March 26, 2013 at 5:15 pm

  2. I would like to make 2 suggestions. One of them is to develop systems for annotation, particularly including 3rd-party annotation of source trees.

    My colleagues and I published a review last year with a lot of information on barriers to re-use of phylogenies:

    OToL represents a testbed for evaluating what makes phylogenies re-usable, how to make them more re-usable, etc. From what I can see, the important factors are going to be (1) externally meaningful identifiers that allow for semantic integration (same thing that David Hedrick mentions in his comment) and (2) methods annotations.

    A somewhat separate problem is (3) integrating expressive well-defined formats into downstream user workflows. Most users can’t use your product if it isn’t Newick, but your product becomes meaningless if you have to distribute it in Newick format.

    People who are doing Real Science are very concerned about quality, and as there is no accessible external standard of truth for phylogenies, this means that real scientists are very concerned about whether the methods used to generate a phylogeny appear to represent the best available methods– chosen for accuracy–, or appear to be chosen for convenience. That cannot be discerned without annotations of methods.

    If the original authors have not supplied you with externally meaningful identifiers and computer-readable methods annotations– and in most cases they don’t– this is a problem. Trying to get this stuff retroactively from the original authors is not a scaleable strategy for a project like OToL.

    So this means either wait for the time (in the future) when the trees generated by experts are sufficiently well described to satisfy the demands of users, or use the trees that are available now but create some way for 3rd parties to annotate them.


    April 9, 2013 at 7:15 pm

  3. astoltzfus

    My second suggestion is to address Karen Cranston suggests in the blog– give users a way to get the subtree that they want– by partnering with the Phylotastic project:

    This is a scheme for a distributed system of components that, together, provided convenient access to the “tree of life”. The simplest version of the idea is that the user submits a query consisting of a list of species, and the response is a phylogeny for those species based on expert knowledge embodied in available trees.

    Much of this is hypothetical but there are demos that show how the ideas work, e.g., this form:

    There is a “Taxonomic Name Resolution Service” component that takes user-supplied names and cleans them up. There is a “DateLife” service that calibrates the tree by integrating fossil information (using a super-efficient computational scheme)– it’s like TimeTree, except that it is an open project with a web-services interface so that it can be integrated into automated tools.

    The Phylotastic project is in an early stage, still gathering partners and putting together a proposal for major funding. I’m part of the project and I would be happy to make a pitch to the OToL PIs for why they should partner with this project. The brief pitch is this: 99 % of researchers will not have the skills and know-how to integrate OToL into their research if you offer them one giant tree. But if this is part of a system that offers name-resolution, pruning, calibration, and metadata services, then OToL can rule the world.


    April 9, 2013 at 7:27 pm

  4. Scott Chamberlain

    Thanks for asking the community what they think is important!

    From a developer perspective, a few things would be nice to have:

    1. A really well documented API. Of course no one likes writing documentation, but it makes developing tools that consume the API so much easier.

    2. A RESTful API would be much easier to develop for than SOAP.

    3. If possible, it would be nice to be able to query not only by tip names, but by node names so that someone can ask, “Give me the tree downstream from X taxon”

    4. If you use authentication, IMHO it would be easier for developing desktop apps to avoid OAuth – just a simple authentication (user/pswd, or username/key, etc.) is easier to develop around (my opinion of course).

    5. Maybe you could have some API write methods for approved users to annotate trees?

    Thanks! Scott


    April 9, 2013 at 8:28 pm

  5. Excited to play with this data when it becomes available!

    RDF / linked data endpoint?


    May 11, 2013 at 2:49 pm