Assembling, Visualizing, and Analyzing the Tree of Life


Is it a plant? Or is it a monkey?

AotusIt should not be hard to recognize the differences between furry night monkeys and the bright yellow flowers of golden peas. But they have something peculiar in common that leads to some confusion once in while: their name. Both genera are officially known as Aotus.

There are about two million known species on the planet, so it should not come to a surprise that scientists accidentally have given certain species, or groups of species, similar names. For instance, Proboscidea is considered an order of elephants, but it is also the name for the genus of devil’s claws. Other examples include Myrmecia pyriformis (insect and green algae), Ficus elegans (mollusc and plant), Ormosia nobilis (insect and plant), and Trigonidium grande (orchid and katydid).

“That has historically been a problem,” says Laura Katz, a professor of biological sciences at Smith College, who is leading an effort to create a single list with all species for the Open Tree of Life database. “Someone in Europe discovers a new insect species and gives it a name, while someone else in United States wants to label a group of bacteria and gives it a similar name. That is often hard to avoid, especially in times when the Internet was not around yet.”

Functioning taxonomy

Those overlapping names generally do not cause any confusion in phylogenetic research, because night monkey experts would not mistake those animals for plants. However, it becomes an issue when you create a database with all known species, because the computer systems must be programmed in such way that it points users to the proper records when they look for Anthrax (the bombyliid flies) and not Anthrax (the bacterium), or any other genus.

The software that is currently being developed should enable users to find information about one or more species instantly and, evenly important, to leave the millions of other species out of the search results. Stephen Smith, an assistant professor of evolutionary biology at the University of Michigan, is one of the technology designers for the Open Tree of Life project. Multiple species with the same name and species that have evolved partly through lateral-gene transfer are causing some of the many difficulties for developing an efficient search engine for the tree of life.

“We really need a functioning taxonomy. It may sound trivial, but you actually want all of the individual trees consistent with the terms that are being used. It needs to be clear what are considered bacteria and what are not, especially when you are dealing with about two million species. We really try to avoid entering data with multiple meanings, because that eventually leads to lots of problems.”

Making a list…

The goal is to produce a tree structure that eventually can encompass all life forms. This includes the life forms characterized thoroughly with both morphological and genetic data, species for which there are only genetic data, as well as species from the early years of systematics where the only information is a physical description or a drawing. Complete molecular data only exist for less than a quarter of a million of all known species. And 90 percent of them have not been sampled with more modern techniques at all.

“The value of our effort is to put a list together that allows for phylogenetic synthesis of all the kinds of data that are available, whether it is a description from 1880 about a microorganism that was studied with a crappy microscope or high-tech molecular sequencing performed today. Right now, there is no resource to get all these data in one computer-readable format,” explains Katz.

“We create a mechanism for the community to make the tree better”

There are some other big challenges to create such comprehensive list besides the homonyms. For instance, scientists from all over the world are not using one uniform label system for all newly discovered species. Naming codes are different for plants, animals, microbes, and other forms of life. This creates some confusion as codes have been used in different ways.

Naming all organisms in a consistent way that can be understood by all users depends on a number of factors: what family the organism comes from, what naming system is used, and what data are available about the organism, to name just a few. One research team might store their data under one scientific name, while another might use a completely different one.

“We really have to deal with all the chaos. Otherwise, we could get the synthesis wrong at the end. Our aim is to allow for a plurality of approaches. We want to try and put all species in context. So that means that we need to disambiguate names,” she maintains.

… and checking it twice (or more often)

The Open Tree of Life team is making considerable progress generating a complete list, according to Katz. “We have captured roughly 1.9 million species and we are adding another 200,000 species, right now. More will follow soon. It is now the only list available with this many species. That is the good news.”

That does not mean that she is satisfied yet with the quality of the list, as many names still need to be standardized. Much additional work is ahead to create order in the colossal taxonomy maze that is caused by the different naming customs and practices that have been evolved for hundreds of years. “Actually, right now, the list is bad. It is awful,” says Katz laughing, poking fun at the long way she and her colleagues still have to go to eventually create their envisioned tree of life. “We are talking about millions of species, not just a few hundred. The scale of this project is massive. So it is only a start and we have a whole lot to do before we present a draft.”

Not only much work for the eleven Open Tree of Life investigators, but also work for the many scientists with an interest in taxonomy and phylogeny. Participation by researchers from all over the world are critical for success of the project.

Currently, scientists can submit their publications of favorite phylogenetic trees on the Open Tree of Life page on Mendeley, and they will be able to help with taxonomic issues as well when the first draft of the database is released next year. “We are trying really hard to continue with some refinement in the upcoming months, but then other researchers can help us cleaning up the data in their individual areas of expertise. We are creating a mechanism for the community to make the tree better, enabling anyone to contribute. That is our overall objective.”

(Rosemary Keane contributed to this article)

Photo sources

Aotus ericoides (Australia) by “Melburnian”

Aotus lemurinus zonalis (Panama) by “dsasso”

About these ads

5 responses

  1. I’m trying to reconcile the statement that “We have captured roughly 1.9 million species and we are adding another 200,000 species, right now…It is now the only list available with this many species.” with the fact that the latest GBIF classification has more species (2,498,063), and can be downloaded (I don’t see a link to the OpenTree list of species). Given that GBIF have assembled a bigger list, is there a reason why OpenTree is making its own? Does it have data GBIF doesn’t have?

    October 11, 2012 at 7:30 pm

    • Hi Rod,
      The link to the very-much-a-work-progress classification is
      I’ll ask the folks who have been working on the taxonomy about checking out the GBIF list.

      thanks, Mark

      October 11, 2012 at 8:41 pm

      • rdmpage

        Hi Mark,

        If you look at the list of databases GBIF aggregates it’s pretty much got all there are. There will, naturally, be issues, but it might make sense to focus on cleaning that list, rather than recreate the same process. Another advantage is that it gives you GBIF ids, which make it straightforward to link to their distributional data.


        October 11, 2012 at 9:14 pm

  2. Rod,

    Thanks for the comment — we’ll look to see if there are non-synonym genera and species in GBIF that are not yet in pre-OTToL, and then capture these.

    Unfortunately, we could not use GBIF as a starting point as the backbone taxonomy (from CoL) is inconsistent with most views of the tree of life (e.g. has 9 kingdoms (instead of, perhaps, 3 domains); diatoms within Plantae while the rest of the stramenopiles are in the not-well-supported group Chromista; a non-monophyletic group called ‘Protozoa’, etc., etc.). Also, GBIF has parent and offspring taxa with identical names (e.g. look up Thermotoga), which we do not allow in pre-OTToL, and there is no clear treatment of homonyms. Although we surely have more work to do, I think the difference in species number is in part because: 1) we have not included viruses (for now); and 2) we only count unique species, not synonyms.


    October 15, 2012 at 6:43 pm

    • rdmpage


      I agree that the GBIF classification isn’t particularly phylogenetic, most biodiversity databases have pretty much ignored phylogeny. In way this is possibly a good thing as traditional classification and evolutionary history (with all its complexity) are not terribly compatible – put another way, I’m a fan of keeping the two quite distinct.

      Having parent and offspring taxa with the same name seems unavoidable given taxa such as Acanthocephala (animal genus and phylum) and subgenera (NCBI classification has Rana at both generic and subgeneric rank, albeit qualified in the “unique_name” field).

      The GBIF classification species count doesn’t count synonyms (the latest version I downloaded has 3234553 species-level names, of which 2395339 are regarded as valid). Homonyms are always a problem, but GBIF is trying to disambiguate these (with the inevitable errors). They seems to be getting better and better at cleaning up their classification.

      October 16, 2012 at 12:12 pm


Get every new post delivered to your Inbox.

Join 252 other followers