.. module:: ete3 :synopsis: Converts evolutionary events into OrthoXML format .. moduleauthor:: Jaime Huerta-Cepas .. currentmodule:: ete3 .. _etree2orthoxml: SCRIPTS: orthoXML ************************************ .. contents:: OrthoXML parser ============================ :attr:`etree2orthoxml` is a python script distributed as a part of the ETE toolkit package. It uses an automatic python parser generated on the basis of the OrthoXML schema to convert the evolutionary events in phylogenetic tree topologies into the orthoXML format. ETE OrthoXML parser is a low level python module that allows to operate with the orthoXML structure using python objects. Every element defined in the orthoXML schema has its akin in the parser module, so a complete orthoXML structure can be generated from scratch within a python script. In other words, low level access to the orthoXML parser allows to create orthoxml documents in a programmatic way. The following example will create a basic orthoXML document :: from ete3 import orthoxml # Creates an empty orthoXML object oxml = orthoxml.orthoXML() # Add an ortho group container to the orthoXML document ortho_groups = orthoxml.groups() oxml.set_groups(ortho_groups) # Add an orthology group including two sequences orthologs = orthoxml.group() orthologs.add_geneRef(orthoxml.geneRef(1)) orthologs.add_geneRef(orthoxml.geneRef(2)) ortho_groups.add_orthologGroup(orthologs) oxml_file = open("test_orthoxml.xml", "w") oxml.export(oxml_file, level=0) oxml_file.close() # producing the following output # # # # # # # # The etree2orthoxml script ================================ :attr:`etree2orthoxml` is a standalone python script that allows to read a phylogenetic tree in newick format and export their evolutionary events (duplication and speciation events) as an orthoXML document. The program is installed along with ETE, so it should be found in your path. Alternatively you can found it in the script folder of the latest ETE package release (http://etetoolkit.org/releases/ete3/). To work, :attr:`etree2orthoxml` requires only one argument containing the newick representation of a tree or the name of the file that contains it. By default, automatic detection of speciation and duplication events will be carried out using the built-in :ref:`species overlap algorithm `, although this behavior can be easily disabled when event information is provided along with the newick tree. In the following sections you will find some use case examples. Also, consider reading the source code of the script. It is documented and it can be used as a template for more specific applications. Note that :attr:`etree2orthoxml` is a work in progress, so feel free to use the `etetoolkit mailing list `_ to report any feedback or improvement to the code. Usage ---------------- :: usage: etree2orthoxml [-h] [--sp_delimiter SPECIES_DELIMITER] [--sp_field SPECIES_FIELD] [--root [ROOT [ROOT ...]]] [--skip_ortholog_detection] [--evoltype_attr EVOLTYPE_ATTR] [--database DATABASE] [--show] [--ascii] [--newick] tree_file etree2orthoxml is a python script that extracts evolutionary events (speciation and duplication) from a newick tree and exports them as a OrthoXML file. positional arguments: tree_file A tree file (or text string) in newick format. optional arguments: -h, --help show this help message and exit --sp_delimiter SPECIES_DELIMITER When species names are guessed from node names, this argument specifies how to split node name to guess the species code --sp_field SPECIES_FIELD When species names are guessed from node names, this argument specifies the position of the species name code relative to the name splitting delimiter --root [ROOT [ROOT ...]] Roots the tree to the node grouping the list of node names provided (space separated). In example:'--root human rat mouse' --skip_ortholog_detection Skip automatic detection of speciation and duplication events, thus relying in the correct annotation of the provided tree using the extended newick format (i.e. '((A, A)[&&NHX:evoltype=D], B)[&&NHX:evoltype=S];') --evoltype_attr EVOLTYPE_ATTR When orthology detection is disabled, the attribute name provided here will be expected to exist in all internal nodes and read from the extended newick format --database DATABASE Database name --show Show the tree and its evolutionary events before orthoXML export --ascii Show the tree using ASCII representation and all its evolutionary events before orthoXML export --newick print the extended newick format for provided tree using ASCII representation and all its evolutionary events before orthoXML export Example: Using custom evolutionary annotation ------------------------------------------------------ If all internal nodes in the provided tree are correctly label as duplication or speciation nodes, automatic detection of events can be disabled using the :attr:`--skip_ortholog_detection` flag. Node labeling should be provided using the extended newick format. Duplication nodes should contain the label :attr:`evoltype` set to :attr:`D`, while speciation nodes should be set to :attr:`evoltype=S`. If tag names is different, the option :attr:`evoltype_attr` can be used as convenient. In the following example, we force the HUMAN clade to be considered a speciation node. :: # etree2orthoxml --skip_ortholog_detection '((HUMAN_A, HUMAN_B)[&&NHX:evoltype=S], MOUSE_B)[&&NHX:evoltype=S];' You can avoid tree reformatting when node labels are slightly different by using the :attr:`evoltype_attr`: :: # etree2orthoxml --evoltype_attr E --skip_ortholog_detection '((HUMAN_A, HUMAN_B)[&&NHX:E=S], MOUSE_B)[&&NHX:E=S];' However, more complex modifications on raw trees can be easily performed using the core methods of the ETE library, so they match the requirements of the :attr:`etree2orthoxml` script. :: from ete3 import Tree # Having the followin tree t = Tree('((HUMAN_A, HUMAN_B)[&&NHX:speciation=N], MOUSE_B)[&&NHX:speciation=Y];') # We read the speciation tag from nodes and convert it into a vaild evoltree label for node in t.traverse(): if not node.is_leaf(): etype = "D" if node.speciation == "N" else "S" node.add_features(evoltype=etype) # We the export a newick string that is compatible with etree2orthoxml script t.write(features=["evoltype"], format_root_node=True) # converted newick: # '((HUMAN_A:1,HUMAN_B:1)1:1[&&NHX:evoltype=D],MOUSE_B:1)1:1[&&NHX:evoltype=S];' Example: Automatic detection of species names -------------------------------------------------- As different databases and software may produce slightly different newick tree formats, the script provides several customization options. In gene family trees, species names are usually encoded as a part of leaf names (i.e. P53_HUMAN). If such codification follows a simple rule, :attr:`etree2orthoxml` can automatically detect species name and used to populate the relevant sections within the orthoXML document. For this, the :attr:`sp_delimiter` and :attr:`sp_field` arguments can be used. Note how species are correctly detected in the following example: :: # etree2orthoxml --database TestDB --evoltype_attr E --skip_ortholog_detection --sp_delimiter '_' --sp_field 0 '((HUMAN_A, HUMAN_B)[&&NHX:E=S], MOUSE_B)[&&NHX:E=S];' Example: Tree rooting --------------------------- When evolutionary events are expected to be automatically inferred from tree topology, outgroup information can be passed to the program to root the tree before performing the detection. :: # etree2orthoxml --ascii --root FLY_1 FLY_2 --sp_delimiter '_' --sp_field 0 '((HUMAN_A, HUMAN_B), MOUSE_B, (FLY_1, FLY_2));' /-FLY_1 /D, NoName | \-FLY_2 -S, NoName | /-HUMAN_A | /D, NoName \S, NoName \-HUMAN_B | \-MOUSE_B