Connecting millions of data points in a graph database

Creating ‘Facebook’ for species

Neo4j screenshotThe Open Tree of Life database is not just a list with about two million species. Information is added about their special characteristics and possible relationships with others as well. “It may become tens or hundreds of million pieces of data when we are all done.”

Stephen Smith, an evolutionary biology professor at the University of Michigan, is working together with the other researchers of the Open Tree of Life project to develop the programs and tools that will be used to construct the full tree of life. Scientists from all over the world can then synthesize all the information in the database.

“We are currently building the back-end of the Open Tree of Life. We need to create software that allows us to put all our information in a graph network, so that we can easily retrieve the information that researchers are specifically looking for.”

Social network for species

That graph database is constructed in the same way popular social media networks are, such as Facebook and Twitter, where many millions of users are linked to each other. Instead of connecting personal accounts to the ones of other “Facebook friends” or “Twitter followers,” the Open Tree of Life network links species based on their evolutionary relationships.

Additionally, Facebook launched its Graph Search a few weeks ago, which operates with algorithms similar to search engines from, for example, Google and Bing. The results are based on entire phrases and not just individual keywords, which makes it a lot more challenging to promptly present what the users are really looking for.

“With such a large amount of data it is important for social media companies to learn how those networks can retrieve information swiftly,” explains Smith. “We are basically doing the same thing, trying to connect the dots in a meaningful way and as efficient as possible.”


To build a network of species, the Open Tree of Life team has adopted Neo4j, an open-source database that stores information structured in graphs rather than in tables. It is developed by Neo Technology, which has offices in the Silicon Valley, England, Germany, and Sweden.

“I contacted some big database vendors and discussed the needs for a project like this. And it became clear that Neo4j had an advantage in some things we can use for our purposes,” Smith explains. “We are constantly trying to optimize the data. If I want to look at all branches of bacteria, the system needs to process all the trees that are relevant. We can do that with Neo4j.”

The first version of the graph software was released early 2010. “It is relatively new. We are early adopters, but there is also an active community of users. That’s really important. When we have questions, we have readily available answers for them. So far it has been working out really well for us.”

“We hope to intrigue people with our efforts”

Mark Holder, an evolutionary biology professor at Kansas University and also involved with the Open Tree of Life project, is content with the functionality of the graph database as well. “It is very flexible and gives us high-performance solutions. It will solve a big set of our problems. However, a lot of code is written from scratch. We need to come up with the right algorithms for piecing together trees. It is a lot of work. But that’s ok,” he says.

“We hope to intrigue people with our efforts and that scientists become enthusiastic about the system to contribute to the tree. And hopefully it takes off from there to build the whole tree with help from the entire community.”

Model extension

But before all those interested in the project can enjoy the new tools, large pieces of software still need to be written. For Smith, each new step in software development starts with a brainstorm session together with two postdoc researchers. Arrows are connecting bunch of information boxes on a whiteboard. And potential algorithms are added to the remaining space.

Neo4j is perfectly designed for sketching out the graphs and the relationships. So we put our ideas on the board to work out the algorithms and then go back to our computers to create functions in Java for visualizing, synthesis, and analysis of the tree.”

Those brainstorm sessions with his postdocs and Google hangout conversations with other investigators of the Open Tree of Life project have led to an abundance of ideas for extra features for the software they are creating. “There are a lot of exciting ways to extent the model. For examples, we can add nodes with user information, so we can see to which branches they have contributed. Or maybe we can add information about what parts of the tree they are interested in, so they can be updated if any relevant data are added,” he says.

“But those are things for down the road. We need to first concentrate on nodes for species and all their relationships. That is the main goal.”

3 responses

  1. What you need to consider in ranking species, etc., is the nature of the intelligence that each species has been formed by; in other words that version of intelligence that in effect has needed that physical form or set of forms to use and gather more of the knowledge that its individual sets of intelligence can represent and use to, in effect, survive and evolve – if not to also evolve to survive. In short it’s the intelligence that evolves its forms, even though the forms by what can seem only accidental will “cause” intelligence to evolve its strategies.
    If you’re a natural selection advocate you won’t take this comment seriously, but if you’re a follower of self engineering theorists, you should take the intelligent aspects of all living creatures very seriously.


    February 26, 2013 at 7:36 pm

  2. Pingback: Neo4j: A way to create an online database using social network systems |

  3. Pingback: Free webinar: Putting all species in a graph database | OPEN TREE OF LIFE