Proposal for OpenTree node stability

Currently, OpenTree has two different types of node IDs. Taxonomy (OTT) IDs are assigned to named nodes when we construct a taxonomy release, and phylogenetic node IDs are assigned by the treemachine neo4j graph database for nodes that do not align to an OTT ID (i.e. nodes added due to phylogenetic resolution). The OTT IDs are fairly stable over time, but the neo4j node IDs are definitely not stable, and the same neo4j ID may point to a completely unrelated node in future versions of the graph.

This system is problematic because we expose both types of IDs in the APIs (and also in URLs for the tree browser). The lack of neo4j node stability therefore affects API calls that use nodeIDs, browser bookmarks to nodes in the synthetic tree, and feedback left by users about specific nodes in the tree (see feedback issue #63 and treemachine issue #183). The OTT IDs are problematic as well: it is not straightforward to document when we reuse an existing OTT ID, mint a new ID, or delete an existing ID, when going from one version of the taxonomy version to the next.

At our recent face-to-face meeting, we discussed a proposal for a node identifier registry and are looking for feedback. We don’t intend this system to be a universally-used set of node definitions (i.e. we aren’t trying making a PhyloCode registry). We want a lightweight system that prevents exposure of unstable nodeIDs through the APIs to clients (including our own web application) and provides some measure of predictability. Feeedback on this proposal would be greatly appreciated.

Requirements

  • be able to use the same node ID definitions across OTT and the synthetic tree
  • transparency about when we re-use a nodeID from a previous version of tree or taxonomy (or not)
  • users get an error when using a node ID from a previous version where there is no current node that fits that definition
  • fixing errors (such as moving a snail found in a worm taxon to its proper location) should not involve massive numbers of ID changes
  • generation of node definitions based on a given taxonomy must be automated and efficient
  • application of node definitions to an existing tree / taxonomy must be automated and efficient

Proposal

Develop a lightweight registry of node definitions based on the structure of the OpenTree taxonomy. For each new version of the taxonomy and synthetic tree, use the registry to decide when to re-use existing node IDs and when to register a new definition + ID.

Leaf nodes will be assigned IDs during creation of OTT based on name (together with enough taxonomic context to separate homonyms).

The definition of the ID for a non-leaf node will include a list of IDs for nodes that are descendents of the intended clade, a list that are excluded from being descendents, and (optionally) a taxonomic name.

Definitions would never be deleted from the registry, although not all definitions will be used in any given tree / taxonomy.

Implementation questions

  • How many descendant and excluded nodes to include in the definitions: The definition needs some specificity but also can’t assume a complete list due to future addition of new species. Perhaps, for example, four descendants and three exclusions would be a decent compromise between one and thousands?
  • How to choose the specific nodes in the lists of descendants and exclusions: Should be ‘popular’  (should occur in as many sources as possible) and informative (if T has children T1 and T2 then at least one definition descendant should be taken from T1, and at least one from T2). Excluded nodes should be ‘near misses’ rather than arbitrarily chosen.
  • What to do when >1 node meets the definition: Add an option of adding constraints to the registered definition in order to remove the ambiguity while preserving the ID.
  • What to do when >1 definition matches a node: Ambiguous assignments can be resolved either by the addition of constraints, or by the creation of new ids.
  • Modification / versioning of definitions: If we add constraints to a definition (for example, to resolve ambiguity), does this mint a new ID or version the existing definition?

 

Advertisements

2 responses

  1. I might suggest having at least one descendant node from each edge connected to the focal node (could be more than two for polytomies). Could do the ID of the immediate descendant nodes and perhaps, in addition, for each descendant node the next node in the clade with that descendant node MRCA that has a recognized taxonomic name (which could be the descendant node itself). Could make it easier to recognize on the tree which node is being talked about (“ah, it’s something below Aves and Crocs”).

    For the exclusion, do the same thing, but starting at the node below the focal node and its descendants (clearly excluding the focal node and its set of descendants from the set of descendants).

    Like

    September 17, 2015 at 4:43 pm

  2. I believe it is unnecessarily difficult to agree on identifier granularity without also developing strategies for linking. Computationally, a granular system (specify all descendants, change ID when anything changes) is I assume feasible. If that appears “scary” (millions of identifiers that nobody can make sense of), maybe that is because there is a tendency to talk about granularity levels without also talking about new ways of linking identifiers, so that a human user can feel at home with the bundled up and newly (re-)link identifier sets. The system could handle it all, but expose things in more reduced ways to human users. So maybe a helpful next question is, given a certain solution, what (new?) methods for identifier linking might the registry use? This reminds me of this issue as well: https://github.com/OpenTreeOfLife/muriqui/issues/15

    Like

    October 4, 2015 at 5:22 am