ete-compare
ete-compare allows to compute topological distances between trees by using 3 types of measures:
As compared with other tree distance implementations, ETE allows to:
ete-compare requires one or more target trees as input (-t
), and one or more reference trees to compare with (-r
). Trees
can be passed as newick files or as text strings.
$ ete3 compare -t Tree1.nw Tree2.nw -r RefTree.nw
Target trees can be also be PIPEd using unix syntax. If newick string are stored in the second column of tab delimited files, you could use:
$ cut -f2|ete3 compare -r RefTree.nw
Two identical trees should produce 0 distance, 100% similarity regardless of their edge order. Not shared items will be automatically discarded from the computations:
$ ete3 compare -t '((A,B), ((C,D),E));' '((E,(C,D)), (B,A));' '((A,B,X), (Z,((C,D),E)));' -r '((A, B), ((C, D),E));' source | ref | eff.size | nRF | RF | maxRF | %src_br+ | %ref_br+ | subtrees | treekoD ========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ ((A,B), ((C,D),E)); | ((A,B), ((C,D),E)); | 5 | 0.00 | 0.00 | 6.00 | 1.00 | 1.00 | 1 | -1 ((E,(C,D)), (B,A)); | ((A,B), ((C,D),E)); | 5 | 0.00 | 0.00 | 6.00 | 1.00 | 1.00 | 1 | -1 ((A,B,X), (Z,((C,D),E))); | ((A,B), ((C,D),E)); | 5 | 0.00 | 0.00 | 6.00 | 1.00 | 1.00 | 1 | -1
parsing friendly tab delimited output can be obtained using the --taboutput
flag.
The reported values are:
source
target tree usedref
reference tree used to compare witheff.size
Effective tree size used for comparisons (after pruning not shared items)nRF
Normalized Robinson-Foulds distance (RF/maxRF)RF
Robinson-Foulds symmetric distancemaxRF
maximum Robinson-Foulds value for this comparison%src_br
frequency of edges in target tree found in the reference (1.00 = 100% of branches are found)%ref_br
frequency of edges in the reference tree found in target (1.00 = 100% of branches are found)subtrees
Number of subtrees used for the comparison (applies only when duplicated items are use to decomposed target trees)treekoD
Average distance among all possible subtrees in the original target trees to the reference tree (TreeKO speciation distance).ete-compare allows to dump all edges in both target and reference trees for further inspection.
$ ete3 compare -t '((A,B),((C,D),E));' '((E,(C,D)),(B,A));' '((A,B,X),(Z,((C,D),E)));' -r '((A,B),((C,D),E));' --show_edges src: ((A,B),((C,D),E)); A,B,C,D,E A,B C,D,E C,D ref: ((A,B),((C,D),E)); A,B,C,D,E A,B C,D,E C,D src: ((E,(C,D)),(B,A)); A,B,C,D,E C,D C,D,E A,B ref: ((A,B),((C,D),E)); A,B,C,D,E A,B C,D,E C,D src: ((A,B,X),(Z,((C,D),E))); A,B,C,D,E A,B C,D,E C,D ref: ((A,B),((C,D),E)); A,B,C,D,E C,D C,D,E A,B
as well as matches and mismatches:
$ ete3 compare -t '((A,B), ((C,E),D));' '((D,(C,E)), (B,A));' '((A,B,X), (Z,((C,D),E)));' -r '((A, B), ((C, D),E));' --show_mismatches src: ((A,B),((C,E),D)); C,E ref: ((A,B),((C,D),E)); C,D src: ((D,(C,E)),(B,A)); C,E ref: ((A,B),((C,D),E)); C,D src: ((A,B,X),(Z,((C,D),E))); ref: ((A,B),((C,D),E)); $ ete3 compare -t '((A,B), ((C,E),D));' '((D,(C,E)), (B,A));' '((A,B,X), (Z,((C,D),E)));' -r '((A, B), ((C, D),E));' --show_matches src: ((A,B),((C,E),D)); A,B,C,D,E A,B C,D,E ref: ((A,B),((C,D),E)); A,B,C,D,E A,B C,D,E src: ((D,(C,E)),(B,A)); A,B,C,D,E A,B C,D,E ref: ((A,B),((C,D),E)); A,B,C,D,E A,B C,D,E src: ((A,B,X),(Z,((C,D),E))); A,B,C,D,E C,D C,D,E A,B ref: ((A,B),((C,D),E)); A,B,C,D,E A,B C,D,E C,D
You can have a look at the above example trees by running
$ ete3 view --text -t '((A,B),((C,D),E));' '((E,(C,D)),(B,A));' '((A,B,X),(Z,((C,D),E)));' '((A,B),((C,D),E));'
When comparing unrooted trees, edges are inferred as unpolarized splits rather than clades.Use the --unrooted
flag to enable this type of comparison
$ ete3 compare -t '((A,B), ((C,D),E));' '((A, (B, ((C,D),E))));' -r '((C, D), (E, (A, B)));' --unrooted source | ref | eff.size | nRF | RF | maxRF | %src_br+ | %ref_br+ | subtrees | treekoD ========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ ((A,B), ((C,D),E)); | ((C, D), (E, (A, B))); | 5 | 0.00 | 0.00 | 4.00 | 1.00 | 1.00 | 1 | -1 ((A, (B, ((C,D),E)))); | ((C, D), (E, (A, B))); | 5 | 0.00 | 0.00 | 4.00 | 1.00 | 1.00 | 1 | -1
Let's assume an annotated tree for which tip nodes contain other data rather than their names.
A biological example could be a gene family tree where tips represent molecular sequences from a give species. Using the Extended newick format, an annotated tree would like like this:
$ ete3 view --text -t '(((Chimp_Seq1[&&NHX:species=Chimp], Human_Seq1[&&NHX:species=Human]), Mouse_Seq1[&&NHX:species=Mouse]), Fly_Seq1[&&NHX:species=Fly]);' --attr name species /-Chimp_Seq1, Chimp /-| /-| \-Human_Seq1, Human | | --| \-Mouse_Seq1, Mouse | \-Fly_Seq1, Fly
To compare several gene family trees with a reference species tree, we will use the --src_tree_attr
option.
$ ete3 compare -t '(((Chimp_Seq1[&&NHX:species=Chimp], Human_Seq1[&&NHX:species=Human]), Mouse_Seq1[&&NHX:species=Mouse]), Fly_Seq1[&&NHX:species=Fly]);' -r '(((Human, Chimp), Mouse), Fly);' --src_tree_attr species source | ref | eff.size | nRF | RF | maxRF | %src_br+ | %ref_br+ | subtrees | treekoD ========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ (((ChimpSeq1[&&NHX:s(...) | (((Human, Chimp), Mo(...) | 4 | 0.00 | 0.00 | 4.00 | 1.00 | 1.00 | 1 | -1 # Note that both trees are reported as identical, still when tip names are different
The option --ref_tree_attr
is also available.
It is quite common that features are encoded as part of the tip names. ete-compare provides
a handy way to extract such information and use it straightaway to perform the
comparisons. This can be done by providing a Perl regular expression using--src_attr_parser
or --ref-attr-parser
,
and ensuring that the relevant part of the text strings is captured within the parentheses.
$ ete3 compare -t '(((Chimp_Seq1, Human_Seq1), Mouse_Seq1), Fly_Seq1);' -r '(((Human, Chimp), Mouse), Fly);' --src_attr_parser '(.+)_' source | ref | eff.size | nRF | RF | maxRF | %src_br+ | %ref_br+ | subtrees | treekoD ========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ (((Chimp_Seq1, Human(...) | (((Human, Chimp), Mo(...) | 4 | 0.00 | 0.00 | 4.00 | 1.00 | 1.00 | 1 | -1 # Note that trees are reported identical even when tip names are different and not further annotations were available
$ ete3 compare -t '((A,B)1.0,(((C,D)100,F)0.75,E)1.0);' -r '((A,B),(((C,D),E),F));' source | ref | eff.size | nRF | RF | maxRF | %src_br+ | %ref_br+ | subtrees | treekoD ========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ ((A,B)1.0,(((C,D)100(...) | ((A,B),(((C,D),E),F)); | 6 | 0.25 | 2.00 | 8.00 | 0.75 | 0.75 | 1 | -1 $ ete3 compare -t '((A,B)1.0,(((C,D)100,F)0.75,E)1.0);' -r '((A,B),(((C,D),E),F));' --min_support_src 0.9 source | ref | eff.size | nRF | RF | maxRF | %src_br+ | %ref_br+ | subtrees | treekoD ========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ ((A,B)1.0,(((C,D)100(...) | ((A,B),(((C,D),E),F)); | 6 | 0.14 | 1.00 | 7.00 | 1.00 | 0.75 | 1 | -1
Trees with duplicated items (i.e. tip names) cannot be easily compared. This is a common problem when comparing phylogenetic gene family trees (which may contain duplication events) with species trees.
The TreeKO approach allows to split target trees at the duplication events (any branch splitting the tree into two groups of overlapping items) and re-assembly all combination of duplication-free topologies. The resulting trees can be compared with a reference tree and their distance averaged. This is referred as TreeKO speciation distance.
ETE's TreeKO implementation is orders of magnitude faster than
the original algorithm, which allows to run it even with large trees. The option --treeko
can be used to enable the duplication aware algorithm.
Note how in the following example both RF and TreeKO distances are 0, as the target tree is actually composed of two identical partitions split by a duplication event.
$ ete3 view --text -t '((((Chimp_Seq1, Human_Seq1), Mouse_Seq1), Fly_Seq1), (((Chimp_Seq2, Human_Seq2), Mouse_Seq2), Fly_Seq2));' '(((Human, Chimp), Mouse), Fly);' TARGET: /-Chimp_Seq1 /-| /-| \-Human_Seq1 | | /-| \-Mouse_Seq1 | | | \-Fly_Seq1 | --| /-Chimp_Seq2 | /-| | /-| \-Human_Seq2 | | | \-| \-Mouse_Seq2 | \-Fly_Seq2 REF: /-Human /-| /-| \-Chimp | | --| \-Mouse | \-Fly $ ete3 compare -t '((((Chimp_Seq1, Human_Seq1), Mouse_Seq1), Fly_Seq1), (((Chimp_Seq2, Human_Seq2), Mouse_Seq2), Fly_Seq2));' -r '(((Human, Chimp), Mouse), Fly);' --src_attr_parser '(.+)_' --treeko source | ref | eff.size | nRF | RF | maxRF | %src_br+ | %ref_br+ | subtrees | treekoD ========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ ((((Chimp_Seq1, Huma(...) | (((Human, Chimp), Mo(...) | 4.00 | 0.00 | 0.00 | 4 | 1.00 | 1.00 | 2 | 0.00
If only one side of the duplication contain differences, average RF, similarity percentages will be averaged from all the generated duplication-free subtrees. Also, a normalized TreeKO distance will be provided
$ ete3 compare -t '((((Chimp_Seq1, Mouse_Seq1), Human_Seq1), Fly_Seq1), (((Chimp_Seq2, Human_Seq2), Mouse_Seq2), Fly_Seq2));' -r '(((Human, Chimp), Mouse), Fly);' --src_attr_parser '(.+)_' --treeko source | ref | eff.size | nRF | RF | maxRF | %src_br+ | %ref_br+ | subtrees | treekoD ========================+ | ========================+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ | =======+ ((((Chimp_Seq1, Mous(...) | (((Human, Chimp), Mo(...) | 4.00 | 0.25 | 1.00 | 4 | 0.75 | 0.75 | 2 | 0.25 # Note that Mouse_Seq1 and Human_Seq1 were swapped in this example!
Obviously, real life situations will involve much more complex scenarios and duplication patterns. Further information on how TreeKO speciation distances are calculated can be found in Marcet and Gabaldon (2011). Do not forget to cite TreeKO!