PhyloCode names are not useful for phylogenetic synthesis

Ok, the title is intentionally a bit provocative, but bear with me.

A primary aim of the Open Tree project is to synthesize increasingly comprehensive estimates of phylogeny from “source trees” — published phylogenies constructed to resolve relationships in disparate parts of the tree of life. The general idea is to combine these localized efforts into a unified whole, using clever bioinformatic algorithms.

In this context, a basic operational question is: how do we know if a clade in one source tree is the same as a clade in another source tree? This can be difficult to answer, because source trees are typically constructed from carefully selected samples of individual organisms and their characters (usually DNA sequences). If two source trees are inferred from completely non-overlapping samples of individual organisms, as is commonly the case, is it possible for them to have clades in common, or rather, is it possible for us to determine whether they have clades in common?

I would argue that the answer is yes, with a very important condition: that the organisms sampled for each tree are placed into a common taxonomic hierarchy that embodies a working hypothesis of named clades in the tree of life.

Note an important distinction here: a clade in a source tree depicts common ancestry of selected individual organisms, while a clade in the tree of life is a conceptual group defined by common ancestry that effectively divides all organisms, living and dead, into members and non-members. So a taxon in this sense is a name that refers to a particular tree-of-life clade whose membership is formalized by its position in the comprehensive taxonomic hierarchy.

By placing sampled organisms into a common taxonomic hierarchy, one can compute the relationships between source-tree clades and tree-of-life clades in terms of taxa, a process that I refer to as “taxonomic normalization.”

An idea that emerges from this line of thinking is that the central paradigm of systematics is (or should be) the reciprocal illumination of phylogeny and taxonomy. That is, phylogenetic research tests and refines taxonomic concepts, and those taxonomic concepts in turn guide the selection of individual organisms for future research. I would argue that this, in a nutshell, is “phylogenetic synthesis.”

Which brings me to the title of this post. In the PhyloCode, taxonomic names are not hypothetical concepts that can be refuted or refined by data-driven tests. Instead, they are definitions involving specifiers (designated specimens) that are simply applied to source trees that include those specifiers. This is problematic for synthesis because if two source trees differ in topology, and/or they fail to include the appropriate specifiers, it may be impossible to answer the basic question I began with: do the trees share any clades (taxa) in common? If taxa are functions of phylogenetic topology, then there can be no taxonomic basis for meaningfully comparing source trees that either differ in topology, or do not permit the application of taxon definitions.

So phylogenetic synthesis requires taxa that are explicitly not functions of phylogenetic topology. Instead, taxa should exist independently as hypotheses to be tested by phylogenetic evidence, and as systematists we should strive to construct comprehensive taxonomic hierarchies. I think this is going to be the real key to making progress in answering the question, “what do we know about the tree of life, and how do we know it?”

5 responses

  1. Yep, this is the point I was making in this post: back in 2007: phylocode names aren’t terribly useful, except as rules for sticking tags on trees.


    March 7, 2014 at 8:09 am

  2. I guess I’d say “names (in an of themselves) are not useful for phylogenetic synthesis.” This critique does not seem specific to phylocode.

    I’d say that in the current type-based name system, the names themselves are almost content free. When you get right down to is the codes just make statements like: “If you have a family and the oldest type-genus in that family is the genus ‘Passer’ then you should use the family name ‘Passeridae’ (rather than some other name or inventing a new name).”

    You would not be able to do much phylogenetic synthesis with just a bunch of statements like that.

    Now given a classification, you can do a lot. But I think the information is in the classification not the names.

    [updated comment:typo fixed 15:38 March 7 2104]


    March 7, 2014 at 3:33 pm

    • rickree

      Yes, the point is to build classifications (comprehensive taxonomic hierarchies). But classifications are made up of names, and names have to come from somewhere.


      March 7, 2014 at 3:58 pm

  3. “In the PhyloCode, taxonomic names are not hypothetical concepts”

    But they do refer to hypotheses! Minimally something like “A and B share common ancestry” (node-based), “A has ancestors that B does not have” (branch-based), or “A inherited trait M from its ancestors” (apomorphy-based) Of course, we might regard these hypotheses as trivial, but more complex ones can be built into definitions as well, such as “A and B share ancestry that C doesn’t” (branch-based, multiple internal specifiers), or “A and B inherited trait M from a common ancestor” (apomorphy-based, multiple representative specifiers). Qualifying clauses allow quite a few extra permutations, too — have a look at Article 11.9:

    There are always conditions under which a clade definition can yield an empty set. So, yes, they can be tested.


    March 7, 2014 at 6:54 pm

    • (Gah, this isn’t allowing me to log in from my personal Twitter account, @tmkeesey, for some reason.)


      March 7, 2014 at 6:56 pm