“We need a sense of ownership of phylogenetic trees”
Where are the fungi datasets?
A couple thousand fungi phylogeny studies have been published in the past twelve years. Clark University postdoc researcher Romina Gazis has gone through all of them. Now she is working on a bigger challenge: finding all the trees and datasets that were the foundation of those studies.
Ideally, all scientists who publish a phylogenetic tree would also deposit the datasets they used to create such trees at a publicly available online database. That allow other researchers to synthesize data from different sources to advance the knowledge about relationships between certain species and their evolutionary history.
Unfortunately, most of those datasets are not publicly available. Gazis only found datasets for about a quarter of the two-thousand fungi articles she surveyed. “Around 600 studies had tree files available, but not necessarily complete,” she concluded. “Some scientists deposited one but not all the trees.”
“We need to encourage scientists to store tree datasets online”
Not everything is lost, though, because most of the scientists have actually stored their sequence data. It is the other information that helped the researchers to graph a phylogenetic tree that is not always publicly retrievable. Some of that data have been lost after technological breakdowns or when researchers saw no benefit in keeping the files after a picture of the tree was printed in a journal article or book chapter.
The Open Tree of Life project is specifically looking for those tree datasets to create an overarching tree of life with more than two million species, including all fungi species that have been identified. A simple drawing of those smaller trees is, therefore, not enough.
Even though the sequence data may be available, it is often impossible for scientists to reconstruct the exact phylogenetic trees that their peers had generated. There would likely be some differences in the results because of different methods, procedures, and statistical assumptions. According to Gazis, it would cost a lot of time to reanalyze all the sequences, because at least some knowledge in each taxonomic group is needed to re-build datasets and re-analyze the data under “correct” parameters. “Most important, we already have lost a lot of expertise invested in those fungi trees studies for which the data do not exist anymore.”
Some of the explored ecology journals have adopted policies that contributors must deposit their datasets on the Internet after publication, but many of those files are surprisingly nowhere to be found. “That is really shocking and unfortunate,” says David Hibbett, a biology professor at Clark University and one of the Open Tree of Life investigators. “It would be great if we could find a way to encourage scientists to store those datasets at a public site. That is really important for future research.”
Hibbett expects that the fungi community will soon make changes to be sure that vital information about fungi species and their relationships will not be lost forever. “We need a sense of ownership of the trees. Eventually that allows us to create synthetic trees that are the result of all those smaller trees that researchers create,” he explains. “I’m sure we’ll head in the right direction on this. We have a great community. It is a cohesive, cooperative group.”
For Gazis there is now the task to search for about 1500 datasets that are not stored on the Internet. She recently has contacted the authors of hundred journal articles and received thirty tree-files. All the other scientists will be contacted in the upcoming weeks. “It’s going to be a lot of work to manage all the data that will be coming in,” she says. “But in this case, that will be a good thing. We really want as much data as possible.”
Photo: Harald Matern