ete compare: calculate distances and compare trees

`ete-compare`

calculate distances and compare trees

Overview

ete-compare allows to compute topological distances between trees by using 3 types of measures:

Robinson-Foulds symmetric difference
Percentage of edge similarity (number of branches in one tree that are present in another)
Duplication aware distances (TreeKO method), which provides a distance between trees containing duplicated attributes.

As compared with other tree distance implementations, ETE allows to:

compare trees of different sizes (ignoring non-shared items)
compute distances based on any annotated feature, (i.e. not limited to leaf names)
ignore branches under a custom branch support limit (i.e. compute distances discarding lowly supported branches)
compare unrooted and rooted trees

Auto parse node names

Discarding lowly supported branches
Comparing trees with duplications

Examples

Basic tree comparison

ete-compare requires one or more target trees as input (-t), and one or more reference trees to compare with (-r). Trees can be passed as newick files or as text strings.

$ ete3 compare -t Tree1.nw Tree2.nw -r RefTree.nw

Target trees can be also be PIPEd using unix syntax. If newick string are stored in the second column of tab delimited files, you could use:

$ cut -f2|ete3 compare -r RefTree.nw

Two identical trees should produce 0 distance, 100% similarity regardless of their edge order. Not shared items will be automatically discarded from the computations:

$ ete3 compare -t '((A,B), ((C,D),E));' '((E,(C,D)), (B,A));' '((A,B,X), (Z,((C,D),E)));'  -r '((A, B), ((C, D),E));'  

source                    | ref                       | eff.size | nRF      | RF       | maxRF    | %src_br+ | %ref_br+ | subtrees | treekoD
========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+
((A,B), ((C,D),E));       | ((A,B), ((C,D),E));       | 5        | 0.00     | 0.00     | 6.00     | 1.00     | 1.00     | 1        | -1
((E,(C,D)), (B,A));       | ((A,B), ((C,D),E));       | 5        | 0.00     | 0.00     | 6.00     | 1.00     | 1.00     | 1        | -1
((A,B,X), (Z,((C,D),E))); | ((A,B), ((C,D),E));       | 5        | 0.00     | 0.00     | 6.00     | 1.00     | 1.00     | 1        | -1

parsing friendly tab delimited output can be obtained using the --taboutput flag.

The reported values are:

source target tree used
ref reference tree used to compare with
eff.size Effective tree size used for comparisons (after pruning not shared items)
nRF Normalized Robinson-Foulds distance (RF/maxRF)
RF Robinson-Foulds symmetric distance
maxRF maximum Robinson-Foulds value for this comparison
%src_br frequency of edges in target tree found in the reference (1.00 = 100% of branches are found)
%ref_br frequency of edges in the reference tree found in target (1.00 = 100% of branches are found)
subtrees Number of subtrees used for the comparison (applies only when duplicated items are use to decomposed target trees)
treekoD Average distance among all possible subtrees in the original target trees to the reference tree (TreeKO speciation distance).

Dumping detailed data

ete-compare allows to dump all edges in both target and reference trees for further inspection.

$ ete3 compare -t '((A,B),((C,D),E));' '((E,(C,D)),(B,A));' '((A,B,X),(Z,((C,D),E)));'  -r '((A,B),((C,D),E));' --show_edges
src: ((A,B),((C,D),E));        A,B,C,D,E A,B C,D,E C,D
ref: ((A,B),((C,D),E));        A,B,C,D,E A,B C,D,E C,D
src: ((E,(C,D)),(B,A));        A,B,C,D,E C,D C,D,E A,B
ref: ((A,B),((C,D),E));        A,B,C,D,E A,B C,D,E C,D
src: ((A,B,X),(Z,((C,D),E)));  A,B,C,D,E A,B C,D,E C,D
ref: ((A,B),((C,D),E));        A,B,C,D,E C,D C,D,E A,B

as well as matches and mismatches:

$ ete3 compare -t '((A,B), ((C,E),D));' '((D,(C,E)), (B,A));' '((A,B,X), (Z,((C,D),E)));'  -r '((A, B), ((C, D),E));' --show_mismatches
src: ((A,B),((C,E),D));            C,E
ref: ((A,B),((C,D),E));            C,D
src: ((D,(C,E)),(B,A));            C,E
ref: ((A,B),((C,D),E));            C,D
src: ((A,B,X),(Z,((C,D),E)));
ref: ((A,B),((C,D),E));

$ ete3 compare -t '((A,B), ((C,E),D));' '((D,(C,E)), (B,A));' '((A,B,X), (Z,((C,D),E)));'  -r '((A, B), ((C, D),E));' --show_matches
src: ((A,B),((C,E),D));          A,B,C,D,E A,B C,D,E
ref: ((A,B),((C,D),E));          A,B,C,D,E A,B C,D,E
src: ((D,(C,E)),(B,A));          A,B,C,D,E A,B C,D,E
ref: ((A,B),((C,D),E));          A,B,C,D,E A,B C,D,E
src: ((A,B,X),(Z,((C,D),E)));    A,B,C,D,E C,D C,D,E A,B
ref: ((A,B),((C,D),E));          A,B,C,D,E A,B C,D,E C,D

You can have a look at the above example trees by running

$ ete3 view --text -t '((A,B),((C,D),E));' '((E,(C,D)),(B,A));' '((A,B,X),(Z,((C,D),E)));' '((A,B),((C,D),E));'

Comparing unrooted trees

When comparing unrooted trees, edges are inferred as unpolarized splits rather than clades.Use the --unrooted flag to enable this type of comparison

$ ete3 compare -t '((A,B), ((C,D),E));'  '((A, (B, ((C,D),E))));'   -r '((C, D), (E, (A, B)));' --unrooted

source                    | ref                       | eff.size | nRF      | RF       | maxRF    | %src_br+ | %ref_br+ | subtrees | treekoD
========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+
((A,B), ((C,D),E));       | ((C, D), (E, (A, B)));    | 5        | 0.00     | 0.00     | 4.00     | 1.00     | 1.00     | 1        | -1
((A, (B, ((C,D),E))));    | ((C, D), (E, (A, B)));    | 5        | 0.00     | 0.00     | 4.00     | 1.00     | 1.00     | 1        | -1

Comparing based on custom attributes

Let's assume an annotated tree for which tip nodes contain other data rather than their names.

A biological example could be a gene family tree where tips represent molecular sequences from a give species. Using the Extended newick format, an annotated tree would like like this:

$ ete3 view --text -t '(((Chimp_Seq1[&&NHX:species=Chimp], Human_Seq1[&&NHX:species=Human]), Mouse_Seq1[&&NHX:species=Mouse]), Fly_Seq1[&&NHX:species=Fly]);' --attr name species

         /-Chimp_Seq1, Chimp
      /-|
   /-|   \-Human_Seq1, Human
  |  |
--|   \-Mouse_Seq1, Mouse
  |
   \-Fly_Seq1, Fly

To compare several gene family trees with a reference species tree, we will use the --src_tree_attr option.

$ ete3 compare -t '(((Chimp_Seq1[&&NHX:species=Chimp], Human_Seq1[&&NHX:species=Human]), Mouse_Seq1[&&NHX:species=Mouse]), Fly_Seq1[&&NHX:species=Fly]);' -r '(((Human, Chimp), Mouse), Fly);' --src_tree_attr species

source                    | ref                       | eff.size | nRF      | RF       | maxRF    | %src_br+ | %ref_br+ | subtrees | treekoD
========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+
(((ChimpSeq1[&&NHX:s(...) | (((Human, Chimp), Mo(...) | 4        | 0.00     | 0.00     | 4.00     | 1.00     | 1.00     | 1        | -1

# Note that both trees are reported as identical, still when tip names are different

The option --ref_tree_attr is also available.

Auto extract attributes

It is quite common that features are encoded as part of the tip names. ete-compare provides a handy way to extract such information and use it straightaway to perform the comparisons. This can be done by providing a Perl regular expression using--src_attr_parser or --ref-attr-parser, and ensuring that the relevant part of the text strings is captured within the parentheses.

$ ete3 compare -t '(((Chimp_Seq1, Human_Seq1), Mouse_Seq1), Fly_Seq1);'    -r '(((Human, Chimp), Mouse), Fly);' --src_attr_parser '(.+)_'

source                    | ref                       | eff.size | nRF      | RF       | maxRF    | %src_br+ | %ref_br+ | subtrees | treekoD
========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+
(((Chimp_Seq1, Human(...) | (((Human, Chimp), Mo(...) | 4        | 0.00     | 0.00     | 4.00     | 1.00     | 1.00     | 1        | -1

# Note that trees are reported identical even when tip names are different and not further annotations were available

Discarding unsupported branches

$ ete3 compare -t '((A,B)1.0,(((C,D)100,F)0.75,E)1.0);' -r '((A,B),(((C,D),E),F));'

source                    | ref                       | eff.size | nRF      | RF       | maxRF    | %src_br+ | %ref_br+ | subtrees | treekoD
========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+
((A,B)1.0,(((C,D)100(...) | ((A,B),(((C,D),E),F));    | 6        | 0.25     | 2.00     | 8.00     | 0.75     | 0.75     | 1        | -1

$ ete3 compare -t '((A,B)1.0,(((C,D)100,F)0.75,E)1.0);' -r '((A,B),(((C,D),E),F));'  --min_support_src 0.9

source                    | ref                       | eff.size | nRF      | RF       | maxRF    | %src_br+ | %ref_br+ | subtrees | treekoD
========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+
((A,B)1.0,(((C,D)100(...) | ((A,B),(((C,D),E),F));    | 6        | 0.14     | 1.00     | 7.00     | 1.00     | 0.75     | 1        | -1

Comparing trees with duplicated items

Trees with duplicated items (i.e. tip names) cannot be easily compared. This is a common problem when comparing phylogenetic gene family trees (which may contain duplication events) with species trees.

The TreeKO approach allows to split target trees at the duplication events (any branch splitting the tree into two groups of overlapping items) and re-assembly all combination of duplication-free topologies. The resulting trees can be compared with a reference tree and their distance averaged. This is referred as TreeKO speciation distance.

ETE's TreeKO implementation is orders of magnitude faster than the original algorithm, which allows to run it even with large trees. The option --treeko can be used to enable the duplication aware algorithm.

Note how in the following example both RF and TreeKO distances are 0, as the target tree is actually composed of two identical partitions split by a duplication event.

$ ete3 view --text -t '((((Chimp_Seq1, Human_Seq1), Mouse_Seq1), Fly_Seq1), (((Chimp_Seq2, Human_Seq2), Mouse_Seq2), Fly_Seq2));'  '(((Human, Chimp), Mouse), Fly);'

TARGET:
            /-Chimp_Seq1
         /-|
      /-|   \-Human_Seq1
     |  |
   /-|   \-Mouse_Seq1
  |  |
  |   \-Fly_Seq1
  |
--|         /-Chimp_Seq2
  |      /-|
  |   /-|   \-Human_Seq2
  |  |  |
   \-|   \-Mouse_Seq2
     |
      \-Fly_Seq2
REF:
         /-Human
      /-|
   /-|   \-Chimp
  |  |
--|   \-Mouse
  |
   \-Fly

$ ete3 compare -t '((((Chimp_Seq1, Human_Seq1), Mouse_Seq1), Fly_Seq1), (((Chimp_Seq2, Human_Seq2), Mouse_Seq2), Fly_Seq2));' -r '(((Human, Chimp), Mouse), Fly);' --src_attr_parser '(.+)_'  --treeko

source                    | ref                       | eff.size | nRF      | RF       | maxRF    | %src_br+ | %ref_br+ | subtrees | treekoD
========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+
((((Chimp_Seq1, Huma(...) | (((Human, Chimp), Mo(...) | 4.00     | 0.00     | 0.00     | 4        | 1.00     | 1.00     | 2        | 0.00

If only one side of the duplication contain differences, average RF, similarity percentages will be averaged from all the generated duplication-free subtrees. Also, a normalized TreeKO distance will be provided

$ ete3 compare -t '((((Chimp_Seq1, Mouse_Seq1), Human_Seq1), Fly_Seq1), (((Chimp_Seq2, Human_Seq2), Mouse_Seq2), Fly_Seq2));' -r '(((Human, Chimp), Mouse), Fly);' --src_attr_parser '(.+)_'  --treeko

source                    | ref                       | eff.size | nRF      | RF       | maxRF    | %src_br+ | %ref_br+ | subtrees | treekoD
========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+
((((Chimp_Seq1, Mous(...) | (((Human, Chimp), Mo(...) | 4.00     | 0.25     | 1.00     | 4        | 0.75     | 0.75     | 2        | 0.25

# Note that Mouse_Seq1 and Human_Seq1 were swapped in this example!

Obviously, real life situations will involve much more complex scenarios and duplication patterns. Further information on how TreeKO speciation distances are calculated can be found in Marcet and Gabaldon (2011). Do not forget to cite TreeKO!