Building Open Tree

Crandall Lab Update: What can we do with synthetic trees?

Currently, the Crandall Lab is examining ways to use the underlying OpenTree taxonomy to gather metadata, associate it with nodes and tips in our synthetic trees, and apply it to evolutionary studies. Below we discuss them in the context of ongoing projects in the lab.


Curated taxonomy

One of the major outcomes of the OpenTree project is the underlying taxonomy. A curated taxonomy allows us to search and align names across independent databases to pull out additional information to associate with node and tip names. The Crandall Lab taxonomy curation started with the freshwater crayfish, which are in the Infraorder Astacidea and includes 711 species spread among 7 families.   This was a great group to start with because of the limited number of species and there are only a few active systematists revising the alpha-taxonomy, which makes the literature less dense and easier to work with. Initially, our investigations deemed the taxonomy very accurate, but the main issue we had to contend with was spelling errors attributed to depositing sequences into GenBank. In all, we identified and removed 10 misspelled taxa. Although that seems small, it was a great warm-up for two larger groups we are now working on, the Decapoda (crabs, shrimps, lobsters) which includes ~15,000 species and the Hemiptera (true bugs) which includes ~ 50,000-80,000 species.


Using a curated taxonomy to obtain additional data

As mentioned above, once the names have been curated we can use them to search across databases. This has been extremely useful in obtaining additional metadata to associate with our synthetic trees. For example, the Crandall Lab recently published a synthetic tree of the crayfish (Fig. 1), which included IUCN Red List values plotted for those taxa with assigned values (Owen et al. 2015, Richman et al. 2015). This is only feasible because we are able to search across IUCN Red Listed crayfish species using the OpenTree curated taxonomy names.

Other applications of using a curated taxonomy to obtain metadata include searching across GenBank to identify whether a particular taxon or rank has molecular data associated with it. This is useful for determining sampling strategies for new and continuing studies. For example, using the OpenTree taxonomy to search GenBank for Hemiptera families and genera, we found a wealth of sequence data has been generated for most of the higher taxa. The most diverse suborder within Hemiptera is Heteroptera and our query of names against NCBI GenBank suggests 70 of the 83 described families within Heteroptera have sequence data for one or more of the traditional eight molecular loci used in Hemiptera systematics (Fig. 2A). As for the Hemiptera genera identified in GenBank, we are currently validating the numbers in Fig. 2B because Hemiptera alpha-taxonomy is very active because many species are vectors for human pathogens and agricultural pests (e.g., kissing bug, aphids, psyllids, etc.).


In addition to searching GenBank, we are currently associating geographic, morphological, and ecological metadata to our curated names through GBIF and EOL TraitBank. We believe the curated OpenTree taxonomies of these groups and the accumulation of metadata for taxa will surely add a new dimension to our evolutionary studies and allow us to expand the scope the questions we can answer.

Figure 1 Synthetic tree of crayfish with 20 source trees.  Family names noted on the edge of the synthetic tree.  Paraphyly of Cambaridae is not novel and needs to be addressed in a morphological revision.  Color blocks note the IUCN Redlist value.

Figure 1 Synthetic tree of crayfish with 20 source trees. Family names noted on the edge of the synthetic tree. Paraphyly of Cambaridae is not novel and needs to be addressed in a morphological revision. Color blocks note the IUCN Redlist value.

Figure 2 Histograms depicting number of sequences found on GenBank given OTT names. 2A) Hemiptera families within suborders with nucleotide sequence data on NCBI GenBank. 2B) Hemiptera genera within suborders with nucleotide sequence data on NCBI GenBank.

Figure 2 Histograms depicting number of sequences found on GenBank given OTT names. 2A) Hemiptera families within suborders with nucleotide sequence data on NCBI GenBank. 2B) Hemiptera genera within suborders with nucleotide sequence data on NCBI GenBank.

Keith Crandall is a professor and director of the Computational Biology Institute at George Washington University. 

Chris Owen is a post-doctoral researcher for the AVAToL grant.


Owen, C. L., Bracken-Grissom, H., Stern, D., & Crandall, K. A. (2015). A synthetic phylogeny of freshwater crayfish: insights for conservation.Philosophical Transactions of the Royal Society of London B: Biological Sciences370(1662), 20140009.

Richman, N. I., Böhm, M., Adams, S. B., Alvarez, F., Bergey, E. A., Bunn, J. J., … & Collen, B. (2015). Multiple drivers of decline in the global status of freshwater crayfish (Decapoda: Astacidea). Philosophical Transactions of the Royal Society B: Biological Sciences370(1662), 20140060.

Why Do We Need Big Trees, Anyway?

An explicit goal of the Open Tree of Life is to create a single phylogenetic tree that encompasses all living (and some extinct) biodiversity on earth. A question some may have, especially non-scientists, is why do we need a tree like that, and what would we do with it? You can’t even see it all at once, right? The answer to this question, of course, is that with bigger and more resolved trees we can answer evolutionary questions on scales not previously possible.

Currently, postdocs from the labs of Doug Soltis (Univ. of Florida) and Stephen Smith (Univ. of Michigan) are collaborating on several projects within the plant world that leverage the power of big trees. Cody Hinchliff, a postdoc in the Smith lab, recently presented some of these findings during a standing room only presentation at the Botanical Society of America conference in Boise, Idaho, employing a tree with almost complete generic level sampling to unravel evolution and diversification of epiphytes across vascular plants. Perhaps most surprisingly, Hinchliff found that most epiphyte lineages are relatively young, suggesting that either the widespread success that epiphytes currently exhibit is a recent phenomenon, or that epiphytic lineages are relatively short lived and evolve opportunistically in response to large-scale climate fluctuations. This, and other associated findings, are novel and exciting discoveries, and are examples of the insights that can be gleaned by analyzing character data across a massive data set.

Other collaborative “big tree” projects involving the Soltis and Smith labs involve the evolution of the aquatic habit within land plants and the evolution of floral characters in the order Lamiales. These studies involve Hinchliff and Stephen Smith, Bryan Drew from the University of Nebraska at Kearney (formerly a postdoc with Doug Soltis) and Doug Soltis, and undergraduates from all three institutions. The aquatic evolution project is looking at how the re-colonization of aquatic plants is linked to lineage diversification and whether an aquatic habit is associated with other character or habitat traits. The focus of the Lamiales study is investigating what suites of floral characters may be responsible for the extraordinary evolutionary success of the lineage, which at 23,000 species comprise about 1/12th of all flowering plants.

The fact that studies of this magnitude are not only possible, but ongoing, is a testament to the utility of big trees. Because these trees are nearly complete in terms of genera, we can account for virtually all diversity across these clades. Sparse lineage sampling and hence unaccounted for diversity has previously been a hindrance when analyzing evolutionary trends that span the tree of life, but the time is approaching (or might be here already!) where the size of the phylogenies will not be the limiting factor in studying broad scale evolutionary questions. This exciting development leaves researchers more time to examine and ponder truly interesting questions that could not be addressed previously. This is the power that big trees give us, and this is one of the reasons we need big trees.

Chronogram showing epiphytic evolution within vascular plants. Epiphytic lineages are shown in orange, and likely branches of epiphytic origin are in red. Root of tree is ~485 million years old.

Chronogram showing epiphytic evolution within vascular plants. Epiphytic lineages are shown in orange, and likely branches of epiphytic origin are in red. Root of tree is ~485 million years old.

Doug Soltis is a distinguished professor at the University of Florida.
Bryan Drew was previously a post-doctoral researcher in the Soltis lab and is currently an assistant professor at the University of Nebraska-Kearney.

A push for fungal phylogenies in the Open Tree of Life

Screen Shot 2014-09-15 at 1.16.35 PMThe summer of 2014 was a busy one for the mycology group in the Open Tree of Life. Postdoctoral Fellow Romina Gazis gave presentations on the Open Tree of Life at the Annual Meeting of the Mycological Society of America (June 8-12, East Lansing, Michigan) and the International Mycological Congress (Aug. 3-8, Bangkok, Thailand). You can download the IMC presentation here.

Meanwhile, back in Worcester, we continued to compile published phylogenetic trees for incorporation into the Open Tree database. Our goal is to create a synthetic tree that represents, as closely as possible, our current understanding of the broad outlines of fungal phylogenetic relationships, based on molecular studies and taxonomy in Index Fungorum and other sources. We plan to use the tree as the centerpiece of a revision of “higher level” fungal taxonomy, updating a study that we published with seventy coauthors way back in 20071.

Dr. Romina Gazis is a postdoc at Clark University. Dr. Gazis specializes in systematics of endophytes, including symbionts of rubber trees (Hevea brasiliensis) and the newly-described class Xylonomycetes, and also works on phylogenies for the Open Tree of Life project.

Dr. Romina Gazis is a postdoc at Clark University. Dr. Gazis specializes in systematics of endophytes, including symbionts of rubber trees (Hevea brasiliensis) and the newly-described class Xylonomycetes, and also works on phylogenies for the Open Tree of Life project.

To this end, we reviewed the recent and not-so-recent fungal biology literature, emphasizing studies that made a major contribution to understanding of higher-level relationships. We thus identified 314 important studies that are a priority for inclusion in Open Tree of Life. The list of “critical” higher-level studies can be viewed here. Mycologists reading this blog post may wish to check our list of references, and let us know if we have missed anything! Please realize that at this point, we are prioritizing studies that resolve major clades, or that have particularly strong sampling of large groups.

Jiaqi Mei is an undergraduate research assistant at the Katz Lab at Smith College. Jiaqi has been working on gathering information on missing phylogenies for the Open Tree of Life project. Photo: Katz Lab

Having identified the critical higher-level analyses, our next job was to search for the phylogenies in TreeBase and upload them to Open Tree of Life via PhyloGrafter. We were assisted in this time-consuming work by Jiaqi Mei, an undergraduate from Laura Katz’s lab at Smith College who joined us for the summer. 119 of the 314 “higher level” studies (38%) had studies available in TreeBase or other sources. In contrast, Drew et al. (2013)2 found that only about 17% of published phylogenetic studies from all groups have available phylogenies . This evidently demonstrates that mycologists who look at “big picture” phylogenetic relationships are particularly conscientious about data deposition! Nonetheless, there were still many missing phylogenies, so Jiaqi and Romina initiated an e-mail campaign, reaching out to authors of the 195 critical higher-level studies for which we had no trees. We are very grateful to have received responses from almost 50 authors so far. If you are among those who replied to our plea for data, we want to take this opportunity to say Thank You! You should have received a note from us—if not, something may have been lost in transit—please write again!

Our immediate goal is to compile phylogenies that address higher-level relationships, but we are not neglecting fungal studies at low taxonomic levels. In fact, one of Jiaqi’s major tasks was to update our literature review of all fungal phylogenies, reviewing publications since the 2013 study of Drew et al.2, which included studies published up to 2012. Overall, we have identified 2314 fungal phylogenetic studies published since 2000 in 17 journals, of which 640 (28%) have associated treefiles.

It is hard to believe that the Open Tree of Life Project is already in its third year. Our major goal by the end of this academic year is to produce a synthetic phylogenetic tree that significantly updates the 2007 “AFTOL Classification”1 of Fungi, with direct connections to taxonomy and diverse phylogenetic studies. With the continued cooperation of the mycological community we are optimistic that we will reach this goal.

1Hibbett, D. S., M. Binder, J. F. Bischoff, M. Blackwell, P. F. Cannon, O. E. Eriksson, S. Huhndorf, T. James, P. M. Kirk, R. Lücking, T. Lumbsch, F. Lutzoni, P. B. Matheny, D. J. Mclaughlin, M. J. Powell, S. Redhead, C. L. Schoch, J. W. Spatafora, J. A. Stalpers, R. Vilgalys, M. C. Aime, A. Aptroot, R. Bauer, D. Begerow, G. L. Benny, L. A. Castlebury, P. W. Crous, Y.-C. Dai, W. Gams, D. M. Geiser, G. W. Griffith, C. Gueidan, D. L. Hawksworth, G. Hestmark, K. Hosaka, R. A. Humber, K. Hyde, J. E. Ironside, U. Kõljalg, C. P. Kurtzman, K.-H. Larsson, R. Lichtwardt, J. Longcore, J. Miądlikowska, A. Miller, J.-M. Moncalvo, S. Mozley-Standridge, F. Oberwinkler, E. Parmasto, V. Reeb, J. D. Rogers, C. Roux, L. Ryvarden, J. P. Sampaio, A. Schüßler, J. Sugiyama, R. G. Thorn, L. Tibell, W. A. Untereiner, C. Walker, Z. Wang, A. Weir, M. Weiß, M. M. White, K. Winka, Y.-J. Yao, N. Zhang. 2007. A higher-level phylogenetic classification of the Fungi. Mycological Research 111: 509-547. <>

2Drew, B.T., R. Gazis, P. Cabezas, K.S. Swithers, J. Deng, R. Rodriguez, L.A. Katz, K.A. Crandall, D.S. Hibbett, D.E. Soltis. 2013. Lost branches on the tree of life. PLOS Biology 11:e1001636.

David Hibbett is a professor of biology and PI of the Hibbett lab at Clark University.

Romina Gazis is a postdoc at Clark University. 


Which came first? A pivotal position in the plant tree of life

Amborella trichopoda

Amborella trichopoda

The question of which extant angiosperm (flowering plant) lineage “came first” (i.e., is basal in the flowering plant tree of life) has long puzzled biologists. This question is fascinating and important in its own right, but the answer also has potentially profound ramifications including plant gene and genome evolution (which, for example, has implications for crop improvement). Such information is also important for understanding habit and habitat evolution and for the inference of ancestral character states in the angiosperms (e.g., the ancestral flower as well as the ancestral angiosperm genome). Although great 20th century plant taxonomists such as Arthur Cronquist, Armen Takhtajan, and Robert Thorne generally agreed that taxa from the subclass Magnoliidae comprised the “basal” angiosperm lineage, there was no way to “prove”, one way or another, which extant angiosperm lineage came first until the advent of molecular systematics towards the end of the 20th century.Untitled

With the aid of modern molecular phylogenetic techniques it is now known that the major groups they recognized, such as Magnoliidae sensu Cronquist and Takhtajan, are typically polyphyletic. Most research now indicates instead that Amborellaceae, Nymphaeales (water lilies), and Austrobaileyales are the earliest branching extant angiosperm lineages. However, the relative branching order of these three lineages, particularly in regards to Amborella trichopoda (the sole species within Amborellaceae) and Nymphaeales, was, until recently, somewhat contentious.

While most molecular analyses during the past 20 years have recovered Amborella as the earliest-diverging angiosperm lineage, some studies have suggested a clade comprising Amborella + Nymphaeales, or even Nymphaeales alone, as the root of all angiosperms. Recently, at the University of Florida, Soltis lab postdoc Bryan Drew and colleagues (including AVATOL team member Stephen Smith at the University of Michigan) endeavored to definitively answer the longstanding question of which angiosperm came first—that is, what living angiosperm is sister to all other living angiosperms in the angiosperm tree of life. Using a plastid data set consisting of 236 taxa, 78 genes, and ~58,000 nucleotides, Drew et al. performed a myriad of analyses with the express purpose of discerning the first-diverging angiosperm lineage; this study by Drew et al. was just accepted by Systematic Biology and will be viewable online in the coming months. Their results: Virtually every analysis conducted found Amborella as the earliest-diverging living angiosperm lineage with high internal support, and every plastid analysis performed using their original datasets recovered a topology in which Amborella alone is sister to all other living angiosperms.

CaptureThese findings lend strong affirmation to the Amborella sister hypothesis, and should help guide future research regarding angiosperm character (including genomic features) and habitat evolution. Although the “first” angiosperms are long extinct, a better understanding of Amborella will aid in our understanding of angiosperm evolution as a whole. This was the impetus behind the Amborella Genome Project. As a result of this ongoing project, the Amborella nuclear genome has recently been fully sequenced (; Amborella Genome Project, Science, in press), and this major achievement should lead to unprecedented insights within flowering plants.


Doug Soltis is a distinguished professor at the University of Florida.

Bryan Drew is a post-doctoral researcher in the Soltis lab at the University of Florida.

How computer scientists are using map distance to determine phylogeny

What is distance?

Distance is a way to measure the relatedness of two things. It is phrased in terms of similarity or difference relative to a feature. Different features expose different information about how the things are related. For instance, if we compare two cities, we might compute their geographical distance or how far apart they are in terms of miles or kilometers. But, if we are making a car trip, we may want to compute a different distance. Roads rarely directly connect two points, so we may care more about the driving distance or driving time. On the other hand, if we’re looking for somewhere warm to spend the winter, we may care most about the difference between the temperatures of two cities.

Distance is a requirement for comparison. It fundamental to the assessment data required by scientific pursuits as well as the value judgments made in our daily lives. Thus, distance is a cornerstone of the human experience.

What does distance tell us about trees?

PantheraBlogConsider four phylogenies over the genus Panthera or big cats shown below. Here, the trees are from actual phylogenetic analyses performed by different researchers over the years. The fourth tree is the current best estimate of the big cats by Davis, Li, and Murphy. (For further details, see their 2010 paper “Supermatrix and species tree methods resolve phylogenetic relationships within the big cats, panthera (carnivora: Felidae)” in Molecular Phylogenetics and Evolution.)

There are different trees because researchers use different combinations of phylogenetic reconstruction methods and phylogenetic data. Typically, these discrepencies are resolved by a consensus tree where relationships are included in the consensus tree if they appear in either most of the trees (majority consensus) or all of the trees (strict consensus). For our example, the majority consensus tree only retains one relationship as shown below. Most of the information from the trees is lost, which is one disadvantage of summarizing a set of trees with a single consensus tree.consensus

In our example, the consensus shows that there is not much in common among the four trees. But, if we look at distance, we could gain more information. For example, which of the trees are most closely related? In phylogenetics, distance is generally defined by relationships defined by bipartitions. A bipartition is an edge that when removed separates the tree into two partitions. Assume that C, S, T, J, L, and N represent Clouded Leopard, Snow Leopard, Tiger, Jaquar, Leopard, and Lion, respectively. For tree 1, the bipartitions are C|STJLN, CS|TJLN, CST|JLN, CSTJ|LN, and CSTJL|N. Bipartiton C|STJLN means there is an edge that when removed has one partition containing Clouded Leopard and the other partition containing Snow Leopard, Tiger, Jaquar, Leopard, and Lion. We can compute the Robinson-Foulds (RF) distance between two trees Ti and Tj by counting the number of bipartions in Ti but not in Tj and adding that to the number of bipartions in Tj but not in Ti. The RF distance is then this sum divided by 2. Based on the RF distance matrix of our big cat trees shown below, Trees 1 and 4 as well as Trees 3 and 4 are the closest trees since they have the smallest RF distance of 1.


In our example, the consensus shows that there is not much in common among the four trees. But, if we look at distance, we could gain more information. For example, which of the trees are most closely related? In phylogenetics, distance is generally defined by relationships defined by bipartitions. A bipartition is an edge that when removed separates the tree into two partitions. Assume that C, S, T, J, L, and N represent Clouded Leopard, Snow Leopard, Tiger, Jaquar, Leopard, and Lion, respectively. For tree 1, the bipartitions are C|STJLN, CS|TJLN, CST|JLN, CSTJ|LN, and CSTJL|N. Bipartiton C|STJLN means there is an edge that when removed has one partition containing Clouded Leopard and the other partition containing Snow Leopard, Tiger, Jaquar, Leopard, and Lion. We can compute the Robinson-Foulds (RF) distance between two trees Ti and Tj by counting the number of bipartions in Ti but not in Tj and adding that to the number of bipartions in Tj but not in Ti. The RF distance is then this sum divided by 2. Based on the RF distance matrix of our big cat trees shown below, Trees 1 and 4 as well as Trees 3 and 4 are the closest trees since they have the smallest RF distance of 1.

What tools exist for computing tree distances?

One of the main focuses in our lab is designing high-performance algorithms for comparing trees. For computing RF distances between thousands of trees, we have designed the algorithms HashRF and MrsRF. Besides bipartions, quartets are also used for describing the relationships in a tree. Whereas a bipartition shows the relationship between all of the taxa in a tree, a quartet is based on 4 taxa. Similarly to bipartitons, we can then use quartets to compare trees. To compute the quartet distance quickly, we have designed the Quick Quartet algorithm. Finally, an interesting consequence of tree distance is that we can use it to compress collections of trees. If trees have much in common, they can be stored in a smaller representation. Our TreeZip algorithm is a first step in the direction of compressing phylogenetic trees.

How can distance measures help us build the Tree of Life?

Distance measures are essential in the synthesis of new trees into the ToL. If for a particular set of taxa the distances are large, this could mean there is significant disagreement on the relationships in that part of the ToL. On the other hand, if the trees are close in terms of distance, there is evidence for substantial agreement within the trees. For trees being added to the ToL, distances can help guide the integration of the new trees. Large distances may require significant manual curation to integrate the trees whereas small distance indicate substantial agreement with the existing ToL and allow the curator to focus on a smaller set of trees.

Tiffani Williams is an assistant professor in the department of computer science at Texas A&M University.

Ralph Crosby is a graduate teaching assistant at Texas A&M University.

Grant Brammer is a graduate teaching assistant at Texas A&M University.

Mapping the Tree of Life: the ARBOR Project


Open Tree of Life met with ARBOR, a program funded by the National Science Foundation, to talk about what changes have been made featuring the synthetic tree of life. We spoke with Dr. Luke Harmon, an associate professor at the University of Idaho’s department of Biology.  Dr. Harmon has been using comparative biology to determine what the tree of life can tell us about evolution over long time scales.

What has ARBOR been working on right now?

 Comparative Biology is at the heart of the ARBOR project. Using the evolutionary relationships among species, we can learn something about trait evolution and the formation of new species. For example, there really is no basic ‘ladder of life’ stemming from simpler organisms to more complex; instead, evolution varies among groups and through time in complex and interesting ways. It’s hard to do what we do with traditional tools. Instead, we have to use new tools to analyze how species have diversified to generate the tree of life

How have phylogeny studies changed over time?

A lot of progress has been made in the last twenty years regarding our understanding of the relationships among different species. We now know a lot more about how species are related to one another and how they evolved from their common ancestors. The Open Tree of Life is the best possible example of this sort of synthesis – it’s almost like the human genome project in that it is generating a very good map that will connect all organisms on earth in a single phylogenetic tree. One problem, though, is that there is just so much information contained in large phylogenetic trees, and we don’t always know how to extract information about how organisms evolve. ARBOR is developing tools to read the stories of evolution from these phylogenies.

Taxonomy and the tree of life

What’s in a name?

It is now widely accepted that taxonomy should reflect phylogeny — that the names we use in biological classifications should refer to branches on the tree of life. This was one of Darwin’s most revolutionary ideas, that common ancestry is the fundamental organizing principle for natural classification:

“… community of descent is the hidden bond which naturalists have been unconsciously seeking.”

Charles Darwin, On the Origin of Species

One of the main goals of the Open Tree of Life project is to facilitate phylogenetic “synthesis”. What does this mean? The general idea is to take disparate pieces of information — in this case, phylogenetic trees from the scientific literature, or the data sets on which they are based — and merge them together in ways that yield more comprehensive and (hopefully) more accurate inferences of the tree of life as a whole. Like a jigsaw puzzle, the assembled pieces reveal the big picture.

Taxonomy is central to this exercise, because names are the primary link between the products of phylogenetic research. Without taxonomy, a phylogenetic tree from a typical study would simply depict relationships among individual organisms. This would not, in general, be very useful. Imagine if someone told you: “I know of a red house and a blue house, and the road between them runs north-south for about 100 miles.” Without any additional information, this statement has little if any value. For it to make sense, you would ideally want to know the address of each house, and the name of the road connecting them; but even incomplete information (what cities and states are the houses in?) is better than nothing. Only then could you figure out that the route in question is, for example, Interstate 94 between Chicago, IL and Milwaukee, WI.

Similarly, the organisms used in a particular phylogenetic study must be taxonomically classified in order to establish, like pins on a map, how the branches of the inferred tree represent “real” branches in the tree of life. This allows common relationships across studies to be discovered. To continue the analogy, if you know of a yellow house in Chicago and a green house in Milwaukee, you also know that I-94 connects them just as it does the red and blue houses mentioned above. The phylogenetic tree relating a rose, pumpkin, and oak depicts the same relationships — that is, it traces essentially the same evolutionary history — as the tree relating an apple, cucumber, and walnut. In each case, different organisms were chosen to represent the angiosperm orders Rosales, Cucurbitales, and Fagales, respectively.

You might recognize something paradoxical here. I started off by stating that taxonomy should reflect phylogeny. But then, I proceeded to describe how taxonomy is needed to interpret the results of phylogenetic studies. If taxonomy reflects knowledge of phylogeny, and knowledge of phylogeny is derived from studies of organisms chosen for the taxa they represent, isn’t this a chicken-and-egg problem?

The short answer: yes, it is. Systematic biology is a science of reciprocal illumination between, on one hand, what we discover about the tree of life, and on the other, how we reflect and communicate that knowledge through taxonomy. One can view a taxonomic hierarchy — the arrangement of species within genera, genera within families, and so on — as a working hypothesis, subject to revision. Taxonomic names refer to branches on the tree of life that we believe to exist, but we are open to new information that may change our view. For example, we might discover that members of two genera, hypothesized to be exclusive groups based on their morphological differences, are in fact co-mingled on the same branch of the tree of life when DNA evidence is studied. The question then arises: what happens to the names of the original genera? How should we refer to their common branch? These are issues of nomenclature, a topic beyond the scope of this blog post, but the bottom line is that eventually, taxonomy should be updated to reflect this new knowledge.

The tension between taxonomy and phylogeny is at the heart of the basic question, “what do we know about the tree of life, and how do we know it?” While this question is somewhat metaphysical, it also has very practical implications of immediate concern to the Open Tree project. Most importantly, it has been necessary for us to cobble together a comprehensive taxonomic hierarchy that includes all of life, since none existed previously that were reasonably up-to-date. This “Open Tree Taxonomy” serves a critical purpose — basically, it is what allows us to wrangle herds of phylogenetic trees into a common bioinformatic corral. The challenge we face moving forward is how our synthesis efforts can be leveraged to improve and refine our working taxonomy, closing the loop of reciprocal illumination that is central to the discipline of systematics.

Richard Ree is a curator at the Field Museum of Natural History and a faculty member of the Committee on Evolutionary Biology at the University of Chicago.

The Crandall lab explores solutions to incomplete phylogenies

The Crandall Lab is in charge of uploading and curating animal studies for the AVAToL-Open Tree project.  Chris Owen, postdoctoral researcher, has been leading this portion of the project for the animals beginning in March 2013.  To date, the Crandall Lab has contributed over 400 studies and sent requests for over 100 studies for authors to contribute their phylogenies to the Open Tree project.

Similar to the Solitis Lab group, the Crandall Lab success rate for obtaining published phylogenies directly from authors has been rather low.  As a result, many animal lineages are currently represented in the Open Tree as taxonomic graphs.  One example of a poorly sampled group is the decapods (crabs, crayfish, lobsters, prawns, and shrimp).  Dr. Keith Crandall has studied decapods most of his career and his phylogenies generate a well-sampled backbone, but each higher taxon is represented by few species.  Many researchers want to use the tree for some downstream analysis that benefits from sampling all species; therefore, at this stage of the project one must ask, “How can I obtain a phylogeny of all species for my favorite group, if the only thing available in Open Tree is a well-resolved backbone, while lower taxonomic ranks are represented primarily by unresolved taxonomic graphs?”.

Recently, a paper was published in the journal Nature that may present a workaround for people who wish to obtain a mostly bifurcating comprehensive phylogeny, although only a bifurcating backbone is available on OpenTree.  The published study by Jetz et al. (2013) aimed to use a phylogeny of birds to explore changes in speciation and extinction rate through time, while also mapping all bird diversity, to gain insight into bird evolution.  In order to explore these characteristics of bird evolution, the authors first needed a phylogeny of birds that included all species.  However, no such phylogeny has ever been published and the most comprehensive bird phylogenies available at the time of the study did not contain all species for each crown clade.  Their solution to generating a phylogeny of all birds began first by assigning each avian genus to a crown clade represented in the backbone phylogenies.  Next, sequence data for a set of loci for each species in a crown clade was downloaded from public databases and the phylogeny was estimated using Bayesian inference.  Since the crown clades of the backbone tree contain taxa also in the newly estimated crown phylogenies, the newly estimated crown phylogenies were sub-sampled with the backbone phylogenies to generate a pseudo-posterior distribution of complete avian phylogenies, which was used to depict the avian phylogeny with all species for downstream analyses.

As the organismal labs continue to track down studies and wait for requested published phylogenies, a method such as this may be a temporary solution to obtain mostly bifurcating phylogenies for lineages not well-represented by source trees. Furthermore, variations of this theme could also be used. For example, one could estimate a single tree for each crown clade and merge each tree with the Open Tree phylogeny that has a well-resolved backbone that has unresolved recent clades, using Open Tree Software, and ultimately create a synthetic tree for your favorite group.

These are a couple of potential methods to generate comprehensive phylogenies using the Open Tree for poorly resolved lineages represented only by taxonomy and we look forward to new ideas other researchers offer once the tree becomes public.

Keith Crandall is a professor and director at the George Washington University Institute of Computational Biology.

Chris Owen is a post-doctoral researcher for the AVAToL grant at George Washington University.

The Soltis lab fills the gaps in green plant phylogeny for the Open Tree of Life

Phylogenetic tree summarizing relationships among major lineages of green plants (Viridiplantae)

Phylogenetic tree summarizing relationships among major lineages of green plants (Viridiplantae)

In the Soltis lab at the University of Florida, Bryan Drew and Jiabin Deng have spent much of the past year collecting trees and alignments of green plants (Viridiplantae) as part of an effort to produce a synthetic tree that represents all of the described organisms on Earth. As part of the tree-gathering process, they have gleaned public database archives and contacted corresponding authors directly to request data. Although these methods were not as successful as had been hoped, they recovered trees from over 1000 publications involving green plants.

As might be expected, some areas of the green plant tree are better resolved than others. For example, within gymnosperms and flowering plants we have authorsubmitted trees that support the monophyly of most major lineages, but for other major lineages of green plants, such as green algae and bryophytes, sampling is not as complete and those parts of the tree are not as well resolved. Fortunately, for green algae at least, help is on the way in the form of the NSF funded “Assembling the Green Algae Tree of Life” project. Although results from this project will not be incorporated into the upcoming Open Tree of Life “Big Bang Tree”, within a few years the green algae portion of the Open Tree will undoubtedly greatly benefit by inclusion of trees from the Green Algae Tree of Life project. Other parts of the green plant tree are shaping up nicely, and the Soltis lab is sending out some last minute requests to authors in an attempt to shore up regions of the tree that are presently underrepresented.

Here we provide a basic summary of what we know about green plant phylogeny, stressing that there is much we still do not know about relationships in this large clade of perhaps 500,000 species. We know from the fossil record that many green plant taxa have gone extinct; these extinctions contribute to “long branches” in the Tree of Life and can make it very difficult to determine relationships between older lineages. In the green plant tree, two main clades have been recovered, the Chlorophyta and the Streptophyta. The chlorophytes contain most of what is traditionally known as green algae, while the streptophytes contain the remaining green algae as well as land plants (Embryophyta). One of the many insights provided by molecular systematics during the past twenty years is that “green algae” as long recognized are not actually a natural group (i.e., they are not monophyletic), and that some traditionally classified “green algae” are actually more closely related to land plants. However, the closest “green algal” relative of land plants remains unclear—some studies suggest Charales whereas others indicate Zygnemetales or Coleochaetales The land plants (embryophytes) include bryophytes (mosses, hornworts, and liverworts) and vascular plants (tracheophytes). There is still some question as to whether the bryophytes are a natural group or comprise separate evolutionary lineages. The vascular plants are comprised of lycophytes (clubmosses and quillworts), monilophytes (e.g., ferns and horsetails), gymnosperms (cycads, Ginkgo, gnetophytes, and conifers), and angiosperms (flowering plants).

Though the relationships of come large clades are uncertain, these uncertainties will be shown in the Big Bang tree given that we possess many of the trees that highlight these different clade placements. In other areas of the green plant tree we are sorely lacking data, and the Soltis lab (in close collaboration with Stephen Smith’s lab at the University of Michigan) is still working hard to fill in the tens of thousands of holes in the tree that remain. This is a beautiful part of the Open Tree of Life: as with the organisms that it represents, the tree is ever growing!

Doug Soltis is a distinguished professor at the University of Florida.

What do mycologists think about the tree of life?

David Hibbet screenshot of presentation

Two Open Tree participants, Romina Gazis and David Hibbett, recently attended the annual meeting of the Mycological Society of America in Austin, Texas. Romina gave a presentation about the Open Tree of Life Project, which gave us a chance to hear some thoughts from our community. Questions (paraphrased) included the following:

When the synthetic tree is available, will we be able to filter on a node-by-node basis, or just tree-by-tree? For example, will we be able to identify the strongly supported nodes in individual trees and then constrain the synthetic tree to include those nodes, but not other, weakly supported nodes?”

Capturing information about individual branches, such as support values and branch lengths, is difficult, and in some cases impossible, because the trees were deposited without such information included. It is possible to make decisions about priority on a node-by-node basis, but this requires decision-making by the individual performing the synthesis.

Can this synthetic view of the tree be used to guide genome sampling priorities?”

 Absolutely! In fact, the ongoing 1000 Fungal Genomes Project is already using taxonomy to guide sampling. Open Tree will be able to help in this effort by providing a comprehensive view of phylogenetic diversity of Fungi that will help identify clades that are poorly sampled. We will also be able to prioritize genome-based studies during synthesis, which should allow us to create trees based on a very robust backbone.

Numerous talks and posters at MSA concerned fungal phylogenetics and taxonomy. So much progress is being made! For example, there were presentations on systematics of chytrids, downy midlews, rusts, earth tongues, lichens, mushrooms, and many more. At the same time, in the course of developing the first synthetic trees for this project, it has become abundantly clear that the major centralized taxonomic resources, like Global Biodiversity Information Facility (GBIF) and National Center for Biotechnology Information (NCBI) have a hard time capturing phylogenetic knowledge. To be fair, it is unreasonable to think that any single organization can keep track of all the progress in taxon discovery and phylogenetic inference across the entire tree of life. Sitting in the audience at MSA, I wondered how long it would take for the trees being projected on-screen to be reflected in the taxonomy presented by organizations like GBIF or NCBI (or EoL , CoL, etc). Perhaps a new, community-based approach is needed for building a taxonomic commons?

For the .pdf file of Open Tree of Life’s Challenges and Progress for Fungi, check out Mycological Society of America 2013.

Dr. David Hibbett is a professor of Biology at Clark University.

Online publication to follow the three AVAToL projects

PLOS Currents: Tree of Life

PLOSPeer-reviewed articles about the Open Tree of Life as well as two related projects, Arbor and Phenomics, will be available on PLOS Currents: Tree of Life. The online publication allows the researchers to document their progress in developing software and other tools.

The three research endeavors were developed during an Ideas Lab last year as part of the National Science Foundation’s (NSF) Assembling, Visualizing, and Analyzing the Tree of Life (AVAToL) program. The Open Tree of Life project strives to produce the first draft of a comprehensive tree of life and provides tools for community enhancement and annotation. The Arbor project is developing comparative methods with utility across large sections and the entire tree of life. Finally, the Phenomics project is developing approaches for exploring and documenting phenotypic diversity across the tree of life.

“It’s meant to be a quick outlet for solid phylogenetic studies”

PLOS Currents websites encourage researchers to share their findings with a minimal delay to their peers. The Tree of Life section is focused on rapid publication of phylogenetic and systematic studies with novel data and/or analyses. According to Keith Crandall, one of the three editors of the journal and an investigator of the Open Tree of Life, “it’s meant to be a quick outlet for solid phylogenetic studies to get them and their data into the public domain.” (more…)

Presentation slides from Evolution 2013 available

Open Tree of Life at meetings

The Open Tree of Life project is one of the many phylogeny projects that are featured during the Evolution 2013 meeting that currently takes place in Snowbird (UT). The presentation slides from Karen Cranston, the principal investigator of Open Tree of Life, are available online (LINK). Presentation slides from other investigators are added here in the upcoming days.

Evolution 2013 is the joint annual meeting of the Society for the Study of Evolution (SSE), the Society of Systematic Biologists (SSB), and the American Society of Naturalists (ASN). The conference meets jointly with the iEvoBio conference. Open Tree of Life is represented at both events. About 1400 participants are expected to share their research in evolution, systematics, biodiversity, software, and mathematics.

Free webinar: Putting all species in a graph database

Biology + Technology = OTOL

Neo4j screenshotOne of the developers of the Open Tree of Life demonstrates Thursday, during a free webinar, how graph databases are used to construct a tree of life. The lecture is organized by Neo Technology, which is the maker of Neo4j, an open-source database that is used for OTOL.

Stephen Smith, an ecology and evolutionary biology professor at the University of Michigan, is going to explain how Neo4j and other digital technologies are assisting in constructing the tree of life. Starting at 10:00 PDT (19:00 CEST), he will also discuss other aspects of the interface of biology with next generation technologies.

“Our project is building the tools with which scientists in the community can continually improve the tree of life as we gather new information. Neo4j allows us to not only store trees in their native graph form, but also allows us to map trees to the same structure, the graph. So in fact, we are facilitating the construction of the graph of life,” says Smith.

Neo4j approached the Open Tree of Life team to present a webinar because it is a project that utilizes the Neo4j graph database to represent the interconnectedness of biological data. The company considers the project a great example of how a graph database can better model the natural world.

The online lecture is intended for a broad audience including beginner computer programmers, advanced hackers, data scientists, natural scientists, and anyone interested in the cross-section of science and technology, especially data modeling. Over 150 people have already registered online.

The registration form: LINK

Update: The video from this webinar is available on vimeo:

Building an API for the Open Tree of Life database

Do you want an app for this?

Screen Shot 2012-08-29 at 9.22.20 PMThe developers of the Open Tree of Life would like to know from the phylogenetic community what kind of information they want to extract from its database when the first draft is released later this year. With those preferences, it is possible to develop an API that gives scientists the opportunity to build their own websites or software packages that use the data.

An API (application programming interface) is a digital tool that allows one website or software program to “talk” to another website to dig up certain pieces of data. For instance, a lot of people use Tweetdeck to navigate the ongoing bombardment of messages in the Twittersphere. In that case, Tweetdeck is connecting to Twitter, through its API, to receive and order the messages according to the preferences of the user.

In case of the Open Tree of Life, an API gives researchers advanced access to the data of about two million species, the phylogenies that have been created to illustrate possible relationships between them, and the underlying data and methods of synthesis. “For example, it will be possible to select smaller trees for specific species or find out how many studies there are for a particular node within the database,” says Karen Cranston, the lead investigator of the project. (more…)

Connecting millions of data points in a graph database

Creating ‘Facebook’ for species

Neo4j screenshotThe Open Tree of Life database is not just a list with about two million species. Information is added about their special characteristics and possible relationships with others as well. “It may become tens or hundreds of million pieces of data when we are all done.”

Stephen Smith, an evolutionary biology professor at the University of Michigan, is working together with the other researchers of the Open Tree of Life project to develop the programs and tools that will be used to construct the full tree of life. Scientists from all over the world can then synthesize all the information in the database.

“We are currently building the back-end of the Open Tree of Life. We need to create software that allows us to put all our information in a graph network, so that we can easily retrieve the information that researchers are specifically looking for.” (more…)

“We need a sense of ownership of phylogenetic trees”

Where are the fungi datasets?

FungiA couple thousand fungi phylogeny studies have been published in the past twelve years. Clark University postdoc researcher Romina Gazis has gone through all of them. Now she is working on a bigger challenge: finding all the trees and datasets that were the foundation of those studies.

Ideally, all scientists who publish a phylogenetic tree would also deposit the datasets they used to create such trees at a publicly available online database. That allow other researchers to synthesize data from different sources to advance the knowledge about relationships between certain species and their evolutionary history.

Unfortunately, most of those datasets are not publicly available. Gazis only found datasets for about a quarter of the two-thousand fungi articles she surveyed. “Around 600 studies had tree files available, but not necessarily complete,” she concluded. “Some scientists deposited one but not all the trees.” (more…)

Tree of Life: Are big changes looming on the horizon?

All species like some gadgets

Photo by PublicDomainPictures (Creative Commons Deed CC0)While movie hero James Bond gets his spy gadgets from his loyal developer Q, almost every other species on Earth has to put a little more effort in armoring themselves. But that does not mean they cannot rely on some good ol’ friends to do so. In fact, the acquisition of genes from two or more species through lateral gene transfer can lead to innovations that at times can be painful—sometimes even deadly—to others.

One of those evolutionary novelties is noticeable for certain types of jellyfish that developed the ability to sting after their ancestors acquired a gene from a bacterium and incorporated that material in their own DNA. This gene transmission helped jellyfish to create an innovative defense tool to fend off other species that could endanger them. The result is quite frightening: more humans get killed by jellyfish than sharks. (more…)

Small portion of phylogenetic data is stored publicly

‘The glass is still pretty empty’

Sometimes you wonder whether the glass is half full or half empty.

But when it is only filled for four percent—the other 96 percent is just air—there is only one conclusion: it is time for more.

At least that is what some scientists in the phylogenetic community argue, because only about four percent of all published phylogenies are stored in places such as TreeBASE or Dryad. Their message is quite simple: it is time to bring together more databases with estimations on how species are possibly related to each other.

Several journals in the evolutionary biology field recently adopted policies that encourage or require contributors to make their data publicly available online. Yet, this only leads to the storage of a very small percentage of ten-thousands of phylogenies that have been constructed in the past few decades.

Of course, there are also other ways to receive data that are not stored on the Internet, but those alternatives are commonly not the most efficient routes. For instance, it is possible to send an email to a scientist who published a phylogenetic tree and “sometimes wait for six months to maybe get a response—either with or without the data,” says Keith Crandall, one of the Open Tree of Life investigators and the founding director of the Computational Biology Institute at George Washington University.


You don’t want to build a new tree from scratch?

‘Let the computer do the work’

Creating a phylogenetic tree is no easy task. It usually involves a complex synthesis of multiple datasets, but it leads to much satisfaction when all work is done—until new data come in.

Then, the process typically starts all over again: building a new tree from scratch.

Mark Holder, a professor of statistical phylogenetics at Kansas University and one of the investigators of the Open Tree of Life project, maintains that there is a real need for scientists to have access to digital tools that save them from doing quite a few labor-intensive procedures.

“In the past, researchers combined information from different trees and then analyzed the data. But they never made good computer systems that allowed for continuous updating. They would not be able to see how an entire tree would look like when they added more data or another individual tree. In that case, they had to start over.”



Connecting millions of pieces

Creating the entire tree of life is like completing a jigsaw puzzle with more than two million pieces. And to make it even harder; the illustration of how the solved puzzle would look like is missing.

No one knows precisely how all pieces are related.

This disparity is unmistakably demonstrated by disagreements between evolutionary biologists about how certain species and branches are linked together. Throughout the years they have created a large variety of trees with specific groups of species that contradict each other. For example, one researcher maintains that species A is the closest living relative of species B, but another scientist thinks that species C is actually most closely related to B. (more…)


Is it a plant? Or is it a monkey?

AotusIt should not be hard to recognize the differences between furry night monkeys and the bright yellow flowers of golden peas. But they have something peculiar in common that leads to some confusion once in while: their name. Both genera are officially known as Aotus.

There are about two million known species on the planet, so it should not come to a surprise that scientists accidentally have given certain species, or groups of species, similar names. For instance, Proboscidea is considered an order of elephants, but it is also the name for the genus of devil’s claws. Other examples include Myrmecia pyriformis (insect and green algae), Ficus elegans (mollusc and plant), Ormosia nobilis (insect and plant), and Trigonidium grande (orchid and katydid).



Across disciplinary boundaries

The interdisciplinary team of the Open Tree of Life project

What do a fungal evolutionary biologist and a computer scientist have in common?

It is usually easier to name a long list of differences, but that does not mean that those scholars are investigating different issues all the time. They may be very much interested in the same problems, yet apply different perspectives and methods in search for answers.

Those scientists could continuously work on their individual research projects for may years. However, in some cases only an interdisciplinary collaboration leads to a solution. The investigators of the Open Tree of Life project hope this will be the case for them as well. Their goal: creating a tree of life that includes all 1.9 million known species. (more…)


Wanted: All your favorite trees

With eleven investigators, the Open Tree of Life project is already a large-scale research endeavor. But that does not mean that they can add all 1.9 million known species to a database by themselves. In fact, they are looking for help.

A lot of help.

The main goal of the project is to merge all existing phylogenetic trees in one overarching tree of life. In the past few months, the researchers have been working on software applications to make it possible to store all known species and, more important, to specify how they are all linked to each other in evolutionary terms.


Open Data in the Open Tree of Life

      Making valuable research data available to others in the scientific community is at the heart of open science, an idea very central to the Open Tree of Life project. Through collaboration and the sharing of information, the goal of the Open Tree of Life is to take the discoveries about the phylogenetics of all life and make them easily accessible to everyone.

        With 1.9 million of species described, and with thousands more being discovered and named each year, there is no shortage of new research being done in phylogenetics, or the relationships between species, genera, and families. What there is a shortage of, however, is digital data that is provided with these findings – data that can be used in projects like the Open Tree of Life.

Why is this? Despite decades of funding towards this type of research, a huge amount of our knowledge isn’t available in ways that are reusable. This lack of data availability is due to several factors: the data used to construct the Tree of Life have not always been provided when scholarly articles have been published, or they have been stored in a way that isn’t easily accessed, manipulated, or maintained.