Making valuable research data available to others in the scientific community is at the heart of open science, an idea very central to the Open Tree of Life project. Through collaboration and the sharing of information, the goal of the Open Tree of Life is to take the discoveries about the phylogenetics of all life and make them easily accessible to everyone.
With 1.9 million of species described, and with thousands more being discovered and named each year, there is no shortage of new research being done in phylogenetics, or the relationships between species, genera, and families. What there is a shortage of, however, is digital data that is provided with these findings – data that can be used in projects like the Open Tree of Life.
Why is this? Despite decades of funding towards this type of research, a huge amount of our knowledge isn’t available in ways that are reusable. This lack of data availability is due to several factors: the data used to construct the Tree of Life have not always been provided when scholarly articles have been published, or they have been stored in a way that isn’t easily accessed, manipulated, or maintained.
Our group is faced with the challenge of gathering scattered research on some two million known species, placing them and their associated data on a single evolutionary tree of life and then providing a way for new species to be added by researchers around the world. (hence the “Open” in our name). Traditional data storage and software are simply not up to the task. With any number of scientists and researchers contributing their work to the Open Tree, standardization is key. The largest existing evolutionary trees (for example the trees in Price et al, 2010) contain around 100,000 species. Creating a system to include twenty times that amount means redefining how and what types of data storage are used.
This is no simple task. For instance, what are the criteria that should be used to distinguish species? Is it genetic material? Is it a variation in certain features? For species that are extinct, like early mammals or dinosaurs, we don’t have enough genetic material to say for certain where those species fit in the tree. Does morphology (the animal’s form or shape) then come into play? Terminology must be standardized. One research team might store their data under one scientific name, while another might use a completely different one.