Social curation of phylogenetic studies

People associated with the Open Tree of Life effort are busy on several fronts: writing a paper describing the initial draft release of a comprehensive tree of life, continuing their efforts to obtain estimates of different parts of the tree, improving the Open Tree Taxonomy (OTT) used for name matching, experimenting with new methods for building large trees…

In the midst of that activity (and well aware that we missed our initial goal of having the first release in the first year of the grant), we have recently started to redesign the study curation tool. The goal is to build a tool that is built around git and GitHub. This decision could be described using a wide variety of adjectives ranging from “foolish” to “inspired” (and probably including several that are not printable on this family-friendly blog). So, I (Mark Holder is writing this post) thought that I’d explain the rationale behind this decision.

Why do we need to “curate” published trees in the first place?

Unfortunately, even when we can find a phylogenetic estimate in a digital format, some crucial information is often missing. The tasks in the “curation” process typically include:

  • matching the tips of the tree to the appropriate taxon in a taxonomy (OTT in our case);
  • indicating which parts of the tree are rooted with high confidence. Many phylogenetic estimation procedures produce unrooted estimates, and the trees that they emit are often arbitrarily rooted. Properly identifying the “outgroup” is important for the supertree methods that we are using; and
  • describing what the branch lengths and internal node labels on the tree mean.

In our first year of work on the Open Tree of Life project, we’ve also found many cases in which it would be nice if a downstream software tool could annotate the source tree.

For example, if a phylogeny of plants contains a single animal species, this odd sampling of species could be caused by an incorrect matching of names when the study was imported into the Open Tree of Life system (there are valid homonyms in parts of life that are governed by different nomenclatural codes; the wikipedia page on homonyms has a nice discussion of this topic, including the example of the genus name Erica being used for a jumping spider and a large group of flowering plants known as “heath”). The warning signs of incorrect name matching may not be obvious when a new study is added to the Open Tree of Life system. Ideally, these potential errors would be flagged with comments so that a taxonomic expert could double check the name matching.

Why not just build a database driven website with a “page” for each study so that you can update the study information in one place?

This is exactly what we have done. Fortunately for the project, Rick Ree’s lab already had a tool (phylografter) that did many of these tasks. Rick and his group have continued to improve phylografter as a part of the Open Tree of Life project. The fact that we started the project with a nice tool for study curation is a big part of the reason that we were able to get trees from about 2500 studies into the Open Tree of Life system in this first year (the other “big parts” are the herculean efforts of Bryan Drew, Romina Gazis, Jiabin Deng, Chris Owen, Jessica Grant, Laura Katz, and others to import and curate studies).

If it is not broken, why are we trying to “fix” it?

One of the primary goals of the Open Tree of Life project is to enable the community of biologists to collaboratively assemble phylogenetic knowledge. We are trying to build infrastructure for a system that is as inviting as possible to the community of biologists and software developers. Those goals imply that we should track the contribution of users in a fine-grained manner (so people will get the credit that they deserve), and that the system be open to contributions through many avenues (so that developers will not be constrained to work within one tightly integrated code base).

Phylografter is open in many senses: the code is open-source (see its repository), the study data can be exported via web services (this code snippet is an example of using the service), and interested parties can become study curators. However, the fundamental data store used by phylografter is an SQL database. All writing to the core data store has to be done via adding new functionality to the phylografter tool itself. This is certainly not impossible, but it is not very inviting to developers outside the project who want to dabble with the project.

For example, imagine that you wrote a tool that identifies groupings which might be the result of long branch attraction. To integrate that sort of annotation tool into our current architecture, you would need to figure out the SQL tables that would be affected, write an interface for adding this form of annotation, and implement a system for keeping track of the provenance of each change. This is all possible to do, but much more complicated than writing a tool that simply adds an annotation to a file.

Maybe it won’t be too hard to open up the database of phylogenetic studies as versioned text.

Fortunately, the process of adding corrections and annotations to a text file in a collaborative setting is a common problem, and some excellent software tools exist for dealing with this situation. In particular we can use the git content tracker to store the versions of a study in a reliable, secure manner with full history of the file and rich tools that allow many people to collaborate on the same file. GitHub offers some great add-on features (including dealing with authentication of users) and makes it easy to have a core data store that anyone can access. The Open Tree of Life is making heavy use of NexSON already, and that format supports rich annotation (though we do need to iron out the details of a controlled vocabulary). So we should not have to spend much time on designing the format of the files to be managed by git.

We certainly aren’t the first to think of using git as the database for an application (see the gollum project and git-orm, for example). Nor are we the first to think of using GitHub to make data in systematics more open. I love Rutger Vos’ dump of treebase data in https://github.com/rvosa/supertreebase. Ross Mounce has recently started putting many datafiles that he uses in his research on https://github.com/rossmounce/cladistic-data. Rod Page had a nice post a while back titled “Time to put taxonomy into GitHub.” I’m sure there are more examples.

git and GitHub keep coming up in the context of collaboratively editing data, because most software developers who have used the tools recognize how they have really transformed collaborative software development. Implementing a social tool is tough, but git seems to have done it right. Every one gets an entire copy of the data (via git clone). You can make your changes and save them in your own sandbox (via committing to a fork or branch). When you think that you have a set of changes that are of interest to others, you can ask that they get incorporated into the primary version of the data base (via a pull request).

Of course, most biologists won’t want to use the git tool itself. Fortunately we have some very talented developers (Jim Allman, Jonathan “Duke” Leto, and Jonathan Rees) working on a web application that will hide the ugly details from most users. We’re also working on allowing phylografter to receive updated NexSON files, so we won’t have to abandon that tool for curating study data.

It is a bit scary to be adding a new tool this late in our timeline. But we’re really excited about the prospect of having a phylogenetic data curation tool built on top of a proven system for collaboration.

Comments, questions and suggestions are certainly welcome. The software dev page on our wiki has links to many of the communication tools that the Open Tree of Life software developers are using to discuss these (and other) ideas in more detail.

Mark Holder is an associate professor at the University of Kansas’s Department of Ecology and Evolutionary Biology.

Minor edits on Sunday, Oct 6 at 1:30 Eastern: links added for OTT and SQL

Comments are closed.