## How computer scientists are using map distance to determine phylogeny

### What is distance?

Distance is a way to measure the relatedness of two things. It is phrased in terms of similarity or difference relative to a feature. Different features expose different information about how the things are related. For instance, if we compare two cities, we might compute their geographical distance or how far apart they are in terms of miles or kilometers. But, if we are making a car trip, we may want to compute a different distance. Roads rarely directly connect two points, so we may care more about the driving distance or driving time. On the other hand, if we’re looking for somewhere warm to spend the winter, we may care most about the difference between the temperatures of two cities.

Distance is a requirement for comparison. It fundamental to the assessment data required by scientific pursuits as well as the value judgments made in our daily lives. Thus, distance is a cornerstone of the human experience.

### What does distance tell us about trees?

Consider four phylogenies over the genus Panthera or big cats shown below. Here, the trees are from actual phylogenetic analyses performed by different researchers over the years. The fourth tree is the current best estimate of the big cats by Davis, Li, and Murphy. (For further details, see their 2010 paper “Supermatrix and species tree methods resolve phylogenetic relationships within the big cats, panthera (carnivora: Felidae)” in Molecular Phylogenetics and Evolution.)

There are different trees because researchers use different combinations of phylogenetic reconstruction methods and phylogenetic data. Typically, these discrepencies are resolved by a consensus tree where relationships are included in the consensus tree if they appear in either most of the trees (majority consensus) or all of the trees (strict consensus). For our example, the majority consensus tree only retains one relationship as shown below. Most of the information from the trees is lost, which is one disadvantage of summarizing a set of trees with a single consensus tree.

In our example, the consensus shows that there is not much in common among the four trees. But, if we look at distance, we could gain more information. For example, which of the trees are most closely related? In phylogenetics, distance is generally defined by relationships defined by bipartitions. A bipartition is an edge that when removed separates the tree into two partitions. Assume that C, S, T, J, L, and N represent Clouded Leopard, Snow Leopard, Tiger, Jaquar, Leopard, and Lion, respectively. For tree 1, the bipartitions are C|STJLN, CS|TJLN, CST|JLN, CSTJ|LN, and CSTJL|N. Bipartiton C|STJLN means there is an edge that when removed has one partition containing Clouded Leopard and the other partition containing Snow Leopard, Tiger, Jaquar, Leopard, and Lion. We can compute the Robinson-Foulds (RF) distance between two trees T_{i} and T_{j} by counting the number of bipartions in T_{i} but not in T_{j} and adding that to the number of bipartions in T_{j} but not in T_{i}. The RF distance is then this sum divided by 2. Based on the RF distance matrix of our big cat trees shown below, Trees 1 and 4 as well as Trees 3 and 4 are the closest trees since they have the smallest RF distance of 1.

In our example, the consensus shows that there is not much in common among the four trees. But, if we look at distance, we could gain more information. For example, which of the trees are most closely related? In phylogenetics, distance is generally defined by relationships defined by bipartitions. A bipartition is an edge that when removed separates the tree into two partitions. Assume that C, S, T, J, L, and N represent Clouded Leopard, Snow Leopard, Tiger, Jaquar, Leopard, and Lion, respectively. For tree 1, the bipartitions are C|STJLN, CS|TJLN, CST|JLN, CSTJ|LN, and CSTJL|N. Bipartiton C|STJLN means there is an edge that when removed has one partition containing Clouded Leopard and the other partition containing Snow Leopard, Tiger, Jaquar, Leopard, and Lion. We can compute the Robinson-Foulds (RF) distance between two trees T_{i} and T_{j} by counting the number of bipartions in T_{i} but not in T_{j} and adding that to the number of bipartions in T_{j} but not in T_{i}. The RF distance is then this sum divided by 2. Based on the RF distance matrix of our big cat trees shown below, Trees 1 and 4 as well as Trees 3 and 4 are the closest trees since they have the smallest RF distance of 1.

### What tools exist for computing tree distances?

One of the main focuses in our lab is designing high-performance algorithms for comparing trees. For computing RF distances between thousands of trees, we have designed the algorithms HashRF and MrsRF. Besides bipartions, quartets are also used for describing the relationships in a tree. Whereas a bipartition shows the relationship between all of the taxa in a tree, a quartet is based on 4 taxa. Similarly to bipartitons, we can then use quartets to compare trees. To compute the quartet distance quickly, we have designed the Quick Quartet algorithm. Finally, an interesting consequence of tree distance is that we can use it to compress collections of trees. If trees have much in common, they can be stored in a smaller representation. Our TreeZip algorithm is a first step in the direction of compressing phylogenetic trees.

### How can distance measures help us build the Tree of Life?

Distance measures are essential in the synthesis of new trees into the ToL. If for a particular set of taxa the distances are large, this could mean there is significant disagreement on the relationships in that part of the ToL. On the other hand, if the trees are close in terms of distance, there is evidence for substantial agreement within the trees. For trees being added to the ToL, distances can help guide the integration of the new trees. Large distances may require significant manual curation to integrate the trees whereas small distance indicate substantial agreement with the existing ToL and allow the curator to focus on a smaller set of trees.

*Tiffani Williams is an assistant professor in the department of computer science at Texas A&M University.*

*Ralph Crosby is a graduate teaching assistant at Texas A&M University.*

*Grant Brammer is a graduate teaching assistant at Texas A&M University.*