New in version 2.3.

Dealing with the NCBI Taxonomy database

ETE’s ncbi_taxonomy module provides utilities to efficiently query a local copy of the NCBI Taxonomy database. The class NCBITaxa offers methods to convert from taxid to names (and vice versa), to fetch pruned topologies connecting a given set of species, or to download rank, names and lineage track information.

It is also fully integrated with PhyloTree instances through the PhyloNode.annotate_ncbi_taxa() method.

Setting up a local copy of the NCBI taxonomy database

The first time you attempt to use NCBITaxa, ETE will detect that your local database is empty and it will attempt to download the latest NCBI taxonomy database (~300MB) and will store a parsed version of it in your home directory: ~/.etetoolkit/taxa.sqlite. All future imports of NCBITaxa will detect the local database and will skip this step.

from ete2 import NCBITaxa
ncbi = NCBITaxa()

Upgrading the local database

Use the method :NCBITaxa:`update_taxonomy_database` to download and parse the latest database from the NCBI ftp site. Your current local database will be overwritten.

from ete2 import NCBITaxa
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()

Getting taxid information

you can fetch species names, ranks and linage track information for your taxids using the following methods:

The so called get-translator-functions will return a dictionary converting between taxids and species names. Either species or linage names/taxids are accepted as input.

from ete2 import NCBITaxa
ncbi = NCBITaxa()
taxid2name = ncbi.get_taxid_translator([9606, 9443])
print taxid2name
# {9443: u'Primates', 9606: u'Homo sapiens'}

name2taxid = ncbi.get_name_translator(['Homo sapiens', 'primates'])
print name2taxid
# {'Homo sapiens': [9606], 'primates': [9443]}

# when the same name points to several taxa, all taxids are returned
name2taxid = ncbi.get_name_translator(['Bacteria'])
print name2taxid
# {'Bacteria': [2, 629395]}

Other functions allow to extract further information using taxid numbers as a query.

from ete2 import NCBITaxa
ncbi = NCBITaxa()

print ncbi.get_rank([9606, 9443])
# {9443: u'order', 9606: u'species'}

print ncbi.get_lineage(9606)

# [1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742,
# 7776, 117570, 117571, 8287, 1338369, 32523, 32524, 40674, 32525, 9347,
# 1437010, 314146, 9443, 376913, 314293, 9526, 314295, 9604, 207598, 9605,
# 9606]

And you can combine combine all at once:

from ete2 import NCBITaxa
ncbi = NCBITaxa()

lineage = ncbi.get_lineage(9606)
print lineage

# [1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742,
# 7776, 117570, 117571, 8287, 1338369, 32523, 32524, 40674, 32525, 9347,
# 1437010, 314146, 9443, 376913, 314293, 9526, 314295, 9604, 207598, 9605,
# 9606]

names = ncbi.get_taxid_translator(lineage)
print [names[taxid] for taxid in lineage]

# [u'root', u'cellular organisms', u'Eukaryota', u'Opisthokonta', u'Metazoa',
# u'Eumetazoa', u'Bilateria', u'Deuterostomia', u'Chordata', u'Craniata',
# u'Vertebrata', u'Gnathostomata', u'Teleostomi', u'Euteleostomi',
# u'Sarcopterygii', u'Dipnotetrapodomorpha', u'Tetrapoda', u'Amniota',
# u'Mammalia', u'Theria', u'Eutheria', u'Boreoeutheria', u'Euarchontoglires',
# u'Primates', u'Haplorrhini', u'Simiiformes', u'Catarrhini', u'Hominoidea',
# u'Hominidae', u'Homininae', u'Homo', u'Homo sapiens']

Getting descendant taxa

Given a taxid or a taxa name from an internal node in the NCBI taxonomy tree, their descendants can be retrieved as follows:

from ete2 import NCBITaxa
ncbi = NCBITaxa()

descendants = ncbi.get_descendant_taxa('Homo')
print ncbi.translate_to_names(descendants)

# [u'Homo heidelbergensis', u'Homo sapiens ssp. Denisova', u'Homo sapiens neanderthalensis']

# you can easily ignore subspecies, so only taxa labeled as "species" will be reported:
descendants = ncbi.get_descendant_taxa('Homo', collapse_subspecies=True)
print ncbi.translate_to_names(descendants)

# [u'Homo sapiens', u'Homo heidelbergensis']

# or even returned as an annotated tree
tree = ncbi.get_descendant_taxa('Homo', collapse_subspecies=True, return_tree=True)
print tree.get_ascii(attributes=['sci_name', 'taxid'])

#           /-Homo sapiens, 9606
# -Homo, 9605
#           \-Homo heidelbergensis, 1425170

Getting NCBI species tree topology

Getting the NCBI taxonomy tree for a given set of species is one of the most useful ways to get all information at once. The method NCBITaxa.get_topology() allows to query your local NCBI database and extract the smallest tree that connects all your query taxids. It returns a normal ETE tree in which all nodes, internal or leaves, are annotated for lineage, scientific names, ranks, and so on.

from ete2 import NCBITaxa
ncbi = NCBITaxa()

tree = ncbi.get_topology([9606, 9598, 10090, 7707, 8782])
print tree.get_ascii(attributes=["sci_name", "rank"])

#                     /-Dendrochirotida, order
#                    |
#                    |                                                                /-Pan troglodytes, species
# -Deuterostomia, no rank                                           /Homininae, subfamily
#                    |                /Euarchontoglires, superorder                   \-Homo sapiens, species
#                    |               |                           |
#                     \Amniota, no rank                           \-Mus musculus, species
#                                    |
#                                     \-Aves, class

If needed, all intermediate nodes connecting the species can also be kept in the tree:

from ete2 import NCBITaxa
ncbi = NCBITaxa()

tree = ncbi.get_topology([2, 33208], intermediate_nodes=True)
print tree.get_ascii(attributes=["sci_name"])

#                  /Eukaryota - Opisthokonta - Metazoa
# -cellular organisms
#                  \-Bacteria

Automatic tree annotation using NCBI taxonomy

NCBI taxonomy annotation consists of adding additional information to any internal a leaf node in a give user tree. Only an attribute containing the taxid associated to each node is required for the nodes in the query tree. The annotation process will add the following features to the nodes:

  • sci_name
  • taxid
  • named_lineage
  • lineage
  • rank

Note that, for internal nodes, taxid can be automatically inferred based on their sibling nodes. The easiest way to annotate a tree is to use a PhyloTree instance where the species name attribute is transparently used as the taxid attribute. Note that the :PhyloNode:`annotate_ncbi_taxa`: function will also return the used name, lineage and rank translators.

Remember that species names in PhyloTree instances are automatically extracted from leaf names. The parsing method can be easily adapted to any formatting:

from ete2 import PhyloTree

# load the whole leaf name as species taxid
tree = PhyloTree('((9606, 9598), 10090);', sp_naming_function=lambda name: name)
tax2names, tax2lineages, tax2rank = tree.annotate_ncbi_taxa()

# split names by '|' and return the first part as the species taxid
tree = PhyloTree('((9606|protA, 9598|protA), 10090|protB);', sp_naming_function=lambda name: name.split('|')[0])
tax2names, tax2lineages, tax2rank = tree.annotate_ncbi_taxa()

print tree.get_ascii(attributes=["name", "sci_name", "taxid"])


#                                             /-9606|protA, Homo sapiens, 9606
#                          /, Homininae, 207598
#-, Euarchontoglires, 314146                  \-9598|protA, Pan troglodytes, 9598
#                         |
#                          \-10090|protB, Mus musculus, 10090

Alternatively, you can also use the NCBITaxa.annotate_tree() function to annotate a custom tree instance.

from ete2 import Tree, NCBITaxa
ncbi = NCBITaxa()
tree = Tree("")
ncbi.annotate_tree(tree, taxid_attr="name")