Connecting millions of data points in a graph database
Creating ‘Facebook’ for species
The Open Tree of Life database is not just a list with about two million species. Information is added about their special characteristics and possible relationships with others as well. “It may become tens or hundreds of million pieces of data when we are all done.”
Stephen Smith, an evolutionary biology professor at the University of Michigan, is working together with the other researchers of the Open Tree of Life project to develop the programs and tools that will be used to construct the full tree of life. Scientists from all over the world can then synthesize all the information in the database.
“We are currently building the back-end of the Open Tree of Life. We need to create software that allows us to put all our information in a graph network, so that we can easily retrieve the information that researchers are specifically looking for.”
Social network for species
That graph database is constructed in the same way popular social media networks are, such as Facebook and Twitter, where many millions of users are linked to each other. Instead of connecting personal accounts to the ones of other “Facebook friends” or “Twitter followers,” the Open Tree of Life network links species based on their evolutionary relationships.
Additionally, Facebook launched its Graph Search a few weeks ago, which operates with algorithms similar to search engines from, for example, Google and Bing. The results are based on entire phrases and not just individual keywords, which makes it a lot more challenging to promptly present what the users are really looking for.
“With such a large amount of data it is important for social media companies to learn how those networks can retrieve information swiftly,” explains Smith. “We are basically doing the same thing, trying to connect the dots in a meaningful way and as efficient as possible.”
To build a network of species, the Open Tree of Life team has adopted Neo4j, an open-source database that stores information structured in graphs rather than in tables. It is developed by Neo Technology, which has offices in the Silicon Valley, England, Germany, and Sweden.
“I contacted some big database vendors and discussed the needs for a project like this. And it became clear that Neo4j had an advantage in some things we can use for our purposes,” Smith explains. “We are constantly trying to optimize the data. If I want to look at all branches of bacteria, the system needs to process all the trees that are relevant. We can do that with Neo4j.”
The first version of the graph software was released early 2010. “It is relatively new. We are early adopters, but there is also an active community of users. That’s really important. When we have questions, we have readily available answers for them. So far it has been working out really well for us.”
“We hope to intrigue people with our efforts”
Mark Holder, an evolutionary biology professor at Kansas University and also involved with the Open Tree of Life project, is content with the functionality of the graph database as well. “It is very flexible and gives us high-performance solutions. It will solve a big set of our problems. However, a lot of code is written from scratch. We need to come up with the right algorithms for piecing together trees. It is a lot of work. But that’s ok,” he says.
“We hope to intrigue people with our efforts and that scientists become enthusiastic about the system to contribute to the tree. And hopefully it takes off from there to build the whole tree with help from the entire community.”
But before all those interested in the project can enjoy the new tools, large pieces of software still need to be written. For Smith, each new step in software development starts with a brainstorm session together with two postdoc researchers. Arrows are connecting bunch of information boxes on a whiteboard. And potential algorithms are added to the remaining space.
“Neo4j is perfectly designed for sketching out the graphs and the relationships. So we put our ideas on the board to work out the algorithms and then go back to our computers to create functions in Java for visualizing, synthesis, and analysis of the tree.”
Those brainstorm sessions with his postdocs and Google hangout conversations with other investigators of the Open Tree of Life project have led to an abundance of ideas for extra features for the software they are creating. “There are a lot of exciting ways to extent the model. For examples, we can add nodes with user information, so we can see to which branches they have contributed. Or maybe we can add information about what parts of the tree they are interested in, so they can be updated if any relevant data are added,” he says.
“But those are things for down the road. We need to first concentrate on nodes for species and all their relationships. That is the main goal.”
This entry was posted on February 26, 2013 by robinblom. It was filed under Building Open Tree, We need your help! and was tagged with #opentree, community, database, digital data, evolutionary biology, evolutionary trees, Facebook, graph, graph database, graph search, neo4j, network, open data, open science, open tree, open tree of life, phylogeny, social media, social network, species, synthesis, tree of evolution, tree of life species, Twitter, What is the Tree of Life.