Small portion of phylogenetic data is stored publicly

‘The glass is still pretty empty’

Sometimes you wonder whether the glass is half full or half empty.

But when it is only filled for four percent—the other 96 percent is just air—there is only one conclusion: it is time for more.

At least that is what some scientists in the phylogenetic community argue, because only about four percent of all published phylogenies are stored in places such as TreeBASE or Dryad. Their message is quite simple: it is time to bring together more databases with estimations on how species are possibly related to each other.

Several journals in the evolutionary biology field recently adopted policies that encourage or require contributors to make their data publicly available online. Yet, this only leads to the storage of a very small percentage of ten-thousands of phylogenies that have been constructed in the past few decades.

Of course, there are also other ways to receive data that are not stored on the Internet, but those alternatives are commonly not the most efficient routes. For instance, it is possible to send an email to a scientist who published a phylogenetic tree and “sometimes wait for six months to maybe get a response—either with or without the data,” says Keith Crandall, one of the Open Tree of Life investigators and the founding director of the Computational Biology Institute at George Washington University.

Why even bother?

Even though sharing large amounts of data in a virtual space is a noble idea, some people wonder why they need to spend many hours on those projects by uploading files one by one to a usually slow computer system. Quite frankly, why would (or should) they even bother?

Crandall is not surprised by the hesitance from academics to spend some of their valuable time on adding their research results to third-party databases. “One of the problems is that the systematics community has been hit by a bunch of these initiatives, such as Tree of Life Web and Encyclopedia of Life. Those are all really great ideas, but everyone wants scientists to put up their favorite species or trees. Every time it feels like: really, a new one?”

“It is not a one-way street. They will get something in return.”

Ironically, as Crandall acknowledges, the Open Tree of Life project has already invited the phylogenetic community to add their favorite trees (with data) on a Mendeley page and, next year, everyone can add information to the Open Tree of Life database when a first draft of the project is released in August. The success of the project partly depends on the number of phylogenies that are added to the entire tree to connect about two million species known on earth. Only an overwhelming amount provides scientists the opportunity to efficiently explore where prior studies are in agreement on how species are related, but also where there are conflicts that still need to be resolved.

The investigators are very much aware that they have to convince researchers that they will benefit from contributing to the project, Crandall says. “This project has to be something different and enticing for people to put their trees and branches in the system. We really need to offer a functionality that empiricists would like to see. It has to become a resource for information. So it is not just putting a tree in, but also to get a tree out of it,” he explains. “It is not a one-way street. They will get something in return.”

Appropriately cautious

A group of scientists from The Netherlands, United Kingdom, and United States recently published an article about current practices for storing datasets with tree estimates. They concluded that “most phylogenetic knowledge is not easily re-used due to a lack of archiving, lack of awareness of best practices, and lack of community-wide standards for formatting data, naming entities, and annotating data.” As a result, “[m]ost attempts at data re-use seem to end in disappointment.”

Some members of the Open Tree of Life team are currently conducting their own survey to find out in what circumstances scientists would be more inclined to participate in data sharing and what tools they would like to utilize to do so in an efficient way. “We need to figure out how to create the best place that people actually use,” says Crandall. “I hope that the Open Tree eventually becomes a repository for this kind of information. Hopefully, we have a GenBank for phylogenetics after three years. Maybe it is not going to be complete yet, but it has to be the best of what you can get compared to other places.”

Crandall senses that a considerable number of scientists are waiting for some positive signs that the Open Tree of Life project will be successful before contributing large chunks of information. And he thinks that the release of the first draft tree will exactly do that. “I think the community is optimistic, but there have been many failed projects that have tried the same thing before. So I think everyone is appropriately cautious as well,” he concludes. “We have a group of very talented people, so I’m confident we will convince everyone.”

Photo credit: Shirley Hirst
Advertisements

2 responses

  1. astoltzfus

    Allow me to clarify something from the cited report (http://www.biomedcentral.com/1756-0500/5/574) that I helped to author. The Joint Data Archiving policy (JDAP) came into effect in Jan 2011, whereas we estimated the frequency of archiving for publication year 2010, before the JDAP went into affect.

    However, the JDAP effort is unlikely to turn things around, because it only covers a handful of journals, whereas trees are published in hundreds of different journals each year (the ranked distribution of trees published per journal has a very long tail of journals that publish very few trees each year). JDAP would have to cover the top 100 tree-publishing journals in order to increase archiving to 50 %. I’m afraid that the evolution community doesn’t have the mojo to make that happen. We’re too small.

    So, archiving-as-a-condition-of-publication is not a promising strategy if the goal is to achieve high frequencies of archiving per published tree. However, I don’t think this should be the goal. Just like most scientific papers only get cited a few times, most trees aren’t very useful, i.e., they have a low re-use value. The really valuable trees (a) cover large numbers of species, (b) are made using the most advanced methods and (c) are very well annotated (e.g., fully qualified species names, metadata about provenance, etc). Those are the trees most valuable for re-use, and the ones that OpenTreeOfLife is probably targeting.

    Arlin

    Like

    December 24, 2012 at 10:51 pm

  2. Pingback: Replication of Experimental Data « Torah Explorer