Cookbook index

Automatic switching from amino-acid to codon-based alignments

This recipe shows how to use ete-build to build nucleotide-based genes tress based on amino-acid alignments.

For instance, if the average sequence identity in the amino-acid alignment is higher than a given threshold, ete-build will automatically translate the amino-acid alignment into a codon-based alignment and infer a tree based on the nucleotide models (codon-models are not supported).

This is a useful approach when protein alignments do not provide enough phylogenetic resolution (i.e. synonymous mutations are masked), but you still want to use amino-acid based alignments for better accuracy vs nucleotide-based ones.

Requirements

Recipe

1. Prepare amino-acid and nucleotide multi sequence FASTA files

Both FASTA files should contain the exact same set of sequence names. Nucleotide sequences should be the coding sequence for the amino-acid sequences (i.e coincide in length)

In [6]:
%%bash 
head data/NUP62.aa.fa data/NUP62.nt.fa -n5
==> data/NUP62.aa.fa <==
>Phy003I7ZJ_CHICK
TMSQFNFSSAPAGGGFSFSTPKTAASTTAATGFSFTPAPSSGFTFGGAAPTPASSQPVTP
FSFSTPASSALPTAFSFGTPATATTAAPAASVFPLGGNAPKLNFGGTSTTQATGITGGFG
FGTSAPTSVPSSQAAAPSGFMFGTAATTTTTTTAAQPGTTGGFTFSSGTTTQAGTTGFNI
GATSTAAPQAVPTGLTFGAAPAAAATTTASLGSTTQPAATPFSLGGQSSATLTASTSQGP

==> data/NUP62.nt.fa <==
>Phy003I7ZJ_CHICK
ACCATGAGCCAGTTCAACTTCAGCTCGGCCCCGGCGGGAGGCGGCTTCTCCTTCAGCACGCCGAAAACGGCCGCCAGCAC
CACCGCGGCCACCGGCTTCTCCTTCACGCCCGCTCCCTCCTCGGGATTCACGTTCGGCGGCGCTGCTCCGACACCCGCCA
GCAGCCAGCCCGTCACGCCCTTCTCCTTCAGCACGCCGGCCAGCAGCGCGCTGCCCACCGCCTTCAGCTTCGGGACGCCC
GCAACAGCCACCACCGCCGCCCCGGCTGCCAGCGTGTTCCCGTTAGGGGGAAACGCACCAAAGCTCAACTTTGGAGGCAC

2. Enable mixed mode in ete-build workflows

For this,

  • pass both FASTA files as arguments to ete-build (-a for proteins, -n for nucleotide sequences)
  • specifying a threshold for the aa->nt switch. This is, the maximum protein sequence identity allowed to build protein-based trees.

If the average sequence identity in a protein alignments is higher than the threshold provided, ete-build will convert the alignment into a codon-based alignment and continue to infer the tree using a nucleotide model.

In the following example, we configure the workflow to use nucleotide alignments if the average protein similarity is above 90%.

In [19]:
%%bash 
ete3 build -a data/NUP62.aa.fa -n data/NUP62.nt.fa -o mixed_types/ -w standard_fasttree --clearall --nt-switch-threshold 0.9
Toolchain path: /Users/jhc/anaconda/bin/ete3_apps 
Toolchain version: 2.0.3


      --------------------------------------------------------------------------------
                  ETE build - reproducible phylogenetic workflows 
                                    unknown, unknown.

      If you use ETE in a published work, please cite:

        Jaime Huerta-Cepas, Joaquín Dopazo and Toni Gabaldón. ETE: a python
        Environment for Tree Exploration. BMC Bioinformatics 2010,
        11:24. doi:10.1186/1471-2105-11-24

      (Note that a list of the external programs used to complete all necessary
      computations will be also shown after execution. Those programs should
      also be cited.)
      --------------------------------------------------------------------------------

    
INFO -  Testing x86-64  portable applications...
       clustalo: OK - 1.2.1
Dialign-tx not supported in OS X
       fasttree: OK - FastTree Version 2.1.8 Double precision (No SSE3), OpenMP (1 threads)
         kalign: OK - Kalign version 2.04, Copyright (C) 2004, 2005, 2006 Timo Lassmann
          mafft: OK - MAFFT v6.861b (2011/09/24)
         muscle: OK - MUSCLE v3.8.31 by Robert C. Edgar
          phyml: OK - . This is PhyML version 20160115.
     pmodeltest: OK - pmodeltest.py v1.4
          prank: OK - prank v.100802. Minimal usage: 'prank sequence_file'
       probcons: OK - PROBCONS version 1.12 - align multiple protein sequences and print to standard output
          raxml: OK - This is RAxML version 8.1.20 released by Alexandros Stamatakis on April 18 2015.
 raxml-pthreads: OK - This is RAxML version 8.1.20 released by Alexandros Stamatakis on April 18 2015.
         readal: OK - readAl v1.4.rev6 build[2012-02-02]
         statal: OK - statAl v1.4.rev6 build[2012-02-02]
        tcoffee: OK - PROGRAM: T-COFFEE Version_11.00.8cbe486 (2014-08-12 22:05:29 - Revision 8cbe486 - Build 477)
         trimal: OK - trimAl v1.4.rev6 build[2012-02-02]
INFO -  Starting ETE-build execution at Mon Feb  8 14:05:16 2016
INFO -  Output directory /Users/jhc/_Devel/cookbook/recipes/mixed_types
INFO -  Erasing all existing npr data...
WRNG -  Using existing dir: /Users/jhc/_Devel/cookbook/recipes/mixed_types/db
INFO -  Reading aa sequences from data/NUP62.aa.fa...
INFO -  Reading nt sequences from data/NUP62.nt.fa...
WRNG -  25 target sequences
INFO -  ETE build starts now!
INFO -   Updating tasks status: (Mon Feb  8 14:05:16 2016)
INFO -  Thread clustalo_default-none-none-fasttree_full: pending tasks: 1 of sizes: 25
INFO -   (W) MultiSeqTask (25 aa seqs, MSF, /clustalo_d...ttree_full)
INFO -   (D) MultiSeqTask (25 aa seqs, MSF, /clustalo_d...ttree_full)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Mon Feb  8 14:05:18 2016)
INFO -  Thread clustalo_default-none-none-fasttree_full: pending tasks: 1 of sizes: 25
INFO -   (W) AlgTask (25 aa seqs, Clustal-Omega, /clustalo_d...ttree_full)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Mon Feb  8 14:05:20 2016)
INFO -  Thread clustalo_default-none-none-fasttree_full: pending tasks: 1 of sizes: 25
INFO -   (W) AlgTask (25 aa seqs, Clustal-Omega, /clustalo_d...ttree_full)
INFO -   (D) AlgTask (25 aa seqs, Clustal-Omega, /clustalo_d...ttree_full)
INFO -          Switching to codon alignment! amino-acid sequence similarity: 0.91 >= 0.90
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Mon Feb  8 14:05:22 2016)
INFO -  Thread clustalo_default-none-none-fasttree_full: pending tasks: 1 of sizes: 25
INFO -   (W) TreeTask (25 nt seqs, FastTree, /clustalo_d...ttree_full)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Mon Feb  8 14:05:24 2016)
INFO -  Thread clustalo_default-none-none-fasttree_full: pending tasks: 1 of sizes: 25
INFO -   (W) TreeTask (25 nt seqs, FastTree, /clustalo_d...ttree_full)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Mon Feb  8 14:05:26 2016)
INFO -  Thread clustalo_default-none-none-fasttree_full: pending tasks: 1 of sizes: 25
INFO -   (R) TreeTask (25 nt seqs, FastTree, /clustalo_d...ttree_full)
INFO -   (D) TreeTask (25 nt seqs, FastTree, /clustalo_d...ttree_full)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Mon Feb  8 14:05:28 2016)
INFO -  Thread clustalo_default-none-none-fasttree_full: pending tasks: 1 of sizes: 25
INFO -   (W) TreeMergeTask (25 nt seqs, TreeMerger, /clustalo_d...ttree_full)
INFO -   (D) TreeMergeTask (25 nt seqs, TreeMerger, /clustalo_d...ttree_full)
INFO -  Waiting 2 seconds
INFO -  Assembling final tree...
INFO -  Done thread clustalo_default-none-none-fasttree_full in 1 iteration(s)
INFO -  Writing final tree for clustalo_default-none-none-fasttree_full
   /Users/jhc/_Devel/cookbook/recipes/mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.nw
   /Users/jhc/_Devel/cookbook/recipes/mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.nwx (newick extended)
INFO -  Writing final tree alignment clustalo_default-none-none-fasttree_full
   /Users/jhc/_Devel/cookbook/recipes/mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.used_alg.fa
INFO -  Writing root node alignment clustalo_default-none-none-fasttree_full
   /Users/jhc/_Devel/cookbook/recipes/mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.fa
INFO -  Generating tree image for clustalo_default-none-none-fasttree_full
   /Users/jhc/_Devel/cookbook/recipes/mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.png
INFO -  Done
INFO -  Deleting temporal data...
   ========================================================================
         The following published software and/or methods were used.        
               *** Please, do not forget to cite them! ***                 
   ========================================================================
   Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R,
      McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast,
      scalable generation of high-quality protein multiple sequence
      alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11;7:539.
      doi: 10.1038/msb.2011.75.
   Huerta-Cepas J, Dopazo J, Gabaldón T. ETE: a python Environment for Tree
      Exploration. BMC Bioinformatics. 2010 Jan 13;11:24.
   Price MN, Dehal PS, Arkin AP. FastTree 2 - approximately maximum-
      likelihood trees for large alignments. PLoS One. 2010 Mar
      10;5(3):e9490.

During executing, a warning like this was raised:

Switching to codon alignment! amino-acid sequence similarity: 0.91 >= 0.90

In the results folder, both the amino acid and the nucleotide alignments will be reported. The codon-based alignment is called "*.used_alg.fa", as this is the alignment actually used to build the reported tree.

In [17]:
%%bash
head -n2 mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.fa
>Phy00535AU_PYGAD
------------------------------------------------AAQTPA-SSQPAGLFSFSTPGAAA-QPASFSFGTPATAA-AAPAANVFPLGANAPKLNFGGSAATQATGITGGFGFGSSVPTSVPSSQAAAPSGFVFGCAGTTTTTT---TTSAQSGTTGTFTFSSGTATQAGTPSFNIGAAA---PQAAPTGLTFGTAPAAA-ATTAATLGAATQS-TTPFCLGGQSA-------ATLTTSTSQGPTLSFGAKLGGRNTAPAAPPAAATTTTSILGSAGPTLFASIASSSAPTSA-TTTGLSLGAP---STGTASLGTLGFGLKVPGTTAAAT-STATSTT--SASGFALNLKPLTTTGAIGAGTSTAAITTATTA-SAPPVMTYAQLESLINKWSLELEDQEKHFLHQATQVNAWDRTLIENGEKITSLHREVEKVKLDQKRLDQELDFILSQQKELEDLLTPLEESVKEQSGTIYLQHADEERERT---------------------------------------------------------------------------------------------
In [18]:
%%bash
head -n2 mixed_types/clustalo_default-none-none-fasttree_full/NUP62.aa.fa.final_tree.used_alg.fa
>Phy00535AU_PYGAD
------------------------------------------------------------------------------------------------------------------------------------------------GCGGCCCAGACGCCTGCC---AGCAGCCAGCCCGCCGGGCTCTTCTCCTTCAGCACGCCGGGCGCTGCCGCG---CAGCCTGCCAGCTTCAGCTTCGGGACGCCGGCCACGGCCGCC---GCGGCTCCGGCAGCAAACGTGTTCCCGCTGGGGGCAAATGCACCAAAATTAAACTTTGGAGGCAGCGCTGCAACTCAAGCTACTGGAATCACAGGGGGCTTTGGATTTGGTAGCTCTGTACCGACCAGCGTGCCCTCAAGTCAAGCAGCAGCCCCTTCTGGCTTTGTGTTTGGATGTGCTGGCACCACCACCACCACCACC---------ACCACCTCCGCTCAGTCTGGGACAACTGGAACGTTTACTTTCTCCAGTGGTACCGCAACTCAGGCCGGAACGCCCAGCTTCAACATTGGCGCTGCAGCT---------CCGCAGGCAGCGCCCACCGGGTTGACCTTTGGAACAGCACCTGCAGCTGCT---GCCACCACTGCTGCCACCTTAGGGGCCGCAACCCAGTCG---ACAACCCCCTTCTGCCTTGGGGGGCAGTCTGCC---------------------GCAACGCTGACCACTAGTACCAGCCAGGGACCCACTCTGTCCTTTGGAGCCAAACTTGGAGGTAGGAACACCGCACCCGCCGCTCCCCCGGCTGCCGCTACCACCACAACCTCCATTCTTGGTTCAGCGGGGCCTACGTTGTTTGCATCTATAGCGAGTTCTTCAGCACCGACGTCGGCT---ACCACCACGGGCCTCTCACTTGGTGCCCCT---------TCCACTGGGACAGCAAGTCTTGGAACGCTTGGGTTTGGATTAAAAGTTCCTGGAACAACAGCGGCTGCAACA---AGCACTGCCACTAGCACTACT------TCTGCTTCTGGCTTTGCTTTGAATCTTAAGCCATTAACTACGACTGGTGCCATTGGAGCTGGGACTTCTACAGCTGCCATAACCACAGCAACCACAGCC---AGTGCACCTCCAGTGATGACGTATGCCCAGCTGGAGAGTTTGATAAACAAGTGGAGTCTGGAACTGGAAGACCAAGAGAAACACTTTCTCCATCAAGCTACGCAAGTTAATGCCTGGGATCGGACGCTGATAGAGAATGGAGAGAAGATTACTTCATTACACAGAGAAGTAGAGAAAGTGAAGCTTGATCAGAAGAGACTGGATCAGGAGCTGGACTTCATTCTGTCACAGCAGAAAGAGCTTGAAGACTTGTTGACCCCTCTGGAGGAGTCTGTGAAGGAGCAGAGCGGAACTATCTACTTGCAGCATGCAGATGAAGAACGGGAGAGGACG---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------