Cookbook index

Building species trees using concatenated alignments

This recipe shows how to automate the reconstruction of species trees based on a concatenated multiple sequence alignments.

Requirements

Recipe

Reconstructing a species tree based on several concatenated alignments requires the use of two types of workflows in ete-build:

  • a gene-tree workflow used to align the sequences of each Orthologous Group (OG) (-w)
  • a workflow to concatenate all alignments and build a tree based on it (-m)

Typically, all sequences used in a concatenated alignment are grouped by Orthologous Groups (OGs). Within OGs, each species is expected to be represented by one and only one sequence.

1. Prepare data: sequences and orthologous groups

  • The COGs file must be a text file containing the same sequence IDs as in the input file. Each TAB delimited line will be considered a COG.

    For instance, the following example would define 3 COGs of size 3, 2 and 4 sequences respectively:

    sp1_seqA   sp2_seqA    sp3_seqA
    sp1_seqB   sp2_seqB    
    sp1_seqC   sp3_seqC    sp4_seqC    sp5_seqC
  • By default, the expected format for the sequence names/identifiers is SpeciesCode_SequenceName. The species code should allways precede the sequence names, but you can change the default underscore delimiter character using --spname-delimiter.

  • All sequences must be provided in a single FASTA file.

2. Choose a gene-tree workflow for aligning sequences within each OG

Sequences belonging to OGs must be aligned prior to concatenation. Any ete-build gene-tree workflow can be selected for that and passed with the -w option.

If an alignment trimming step is present in the gene-tree workflow, the trimmed version of the alignment will be used for concatenation.

3. Choose a workflow to select OGs and infer final tree

Supermatrix (concatenated) workflow names are defined in a very similar way as gene-tree workflows.

There are three master tasks in a supermatrix workflow:

  • OG selection: Used to define the set of OGs that will be used to build the concatenated alignment. Although this can be done manually, ete-build offers several automatic options to discards OGs missing a given percentage of species.

In most cases, the OG selection step will default to cog_all, meaning that all the clusters of Orthologous Groups provided should be used. However, other options are possible:

  • cog_100: Use only OGs containing sequences from all (100%) species
  • cog_90: Use only OGs containing sequences from at least 90% of the target species

  • Alignment concatenation: This task accepts no options at the moment. It is used to call the gene-tree workflow selected with the '-w' option, retrieve results and concatenate alignments.

  • Tree inference: Used to infer the tree based on the concatenated alignment.

Other considerations

  • the set of target species is automatically determined based on the set of sequences provided (all species detected are considered). However, one can limit the set of target species by providing a custom list with the --target-species argument.
In [5]:
%%bash 
ete3 build --cpu 4 -w clustalo_default-trimal01-none-none -m cog_100-alg_concat_default-fasttree_default -o basic_sptree/ \
    --clearall -a data/proteome_seqs.fa.gz --cogs data/cogs.txt
Toolchain path: /Users/jhc/anaconda/bin/ete3_apps 
Toolchain version: 2.0.3


      --------------------------------------------------------------------------------
                  ETE build - reproducible phylogenetic workflows 
                                    unknown, unknown.

      If you use ETE in a published work, please cite:

        Jaime Huerta-Cepas, Joaquín Dopazo and Toni Gabaldón. ETE: a python
        Environment for Tree Exploration. BMC Bioinformatics 2010,
        11:24. doi:10.1186/1471-2105-11-24

      (Note that a list of the external programs used to complete all necessary
      computations will be also shown after execution. Those programs should
      also be cited.)
      --------------------------------------------------------------------------------

    
INFO -  Testing x86-64  portable applications...
       clustalo: OK - 1.2.1
Dialign-tx not supported in OS X
       fasttree: OK - FastTree Version 2.1.8 Double precision (No SSE3), OpenMP (2 threads)
         kalign: OK - Kalign version 2.04, Copyright (C) 2004, 2005, 2006 Timo Lassmann
          mafft: OK - MAFFT v6.861b (2011/09/24)
         muscle: OK - MUSCLE v3.8.31 by Robert C. Edgar
          phyml: OK - . This is PhyML version 20160115.
     pmodeltest: OK - pmodeltest.py v1.4
          prank: OK - prank v.100802. Minimal usage: 'prank sequence_file'
       probcons: OK - PROBCONS version 1.12 - align multiple protein sequences and print to standard output
          raxml: OK - This is RAxML version 8.1.20 released by Alexandros Stamatakis on April 18 2015.
 raxml-pthreads: OK - This is RAxML version 8.1.20 released by Alexandros Stamatakis on April 18 2015.
         readal: OK - readAl v1.4.rev6 build[2012-02-02]
         statal: OK - statAl v1.4.rev6 build[2012-02-02]
        tcoffee: OK - PROGRAM: T-COFFEE Version_11.00.8cbe486 (2014-08-12 22:05:29 - Revision 8cbe486 - Build 477)
         trimal: OK - trimAl v1.4.rev6 build[2012-02-02]
INFO -  Starting ETE-build execution at Tue Feb  9 11:28:26 2016
INFO -  Output directory /Users/jhc/_Devel/cookbook/recipes/basic_sptree
INFO -  Erasing all existing npr data...
WRNG -  Using existing dir: /Users/jhc/_Devel/cookbook/recipes/basic_sptree/db
WRNG -  COG file restriction: 67820 sequences from 8 species 
INFO -  Reading aa sequences from data/proteome_seqs.fa.gz...
loaded:0010000 skipped:0000000 scanned:0010000 - Approx. time to finish: 6.1secs
loaded:0020000 skipped:0000000 scanned:0020000 - Approx. time to finish: 5.3secs
loaded:0030000 skipped:0000000 scanned:0030000 - Approx. time to finish: 4.2secs
loaded:0040000 skipped:0000000 scanned:0040000 - Approx. time to finish: 3.1secs
loaded:0050000 skipped:0000000 scanned:0050000 - Approx. time to finish: 2.0secs
loaded:0060000 skipped:0000000 scanned:0060000 - Approx. time to finish: 0.9secs
WRNG -  67820 target sequences
INFO -  ETE build starts now!
INFO -   Updating tasks status: (Tue Feb  9 11:28:35 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (W) CogSelectorTask (8 species, MCL-COGs, /cog_100-al...ee_default)
INFO -                       9606  found in single copy in   20123 (75.1%) COGs 
INFO -                      10090  found in single copy in   19167 (71.5%) COGs 
INFO -                      31033  found in single copy in   14422 (53.8%) COGs 
INFO -                       7227  found in single copy in    5448 (20.3%) COGs 
INFO -                       3702  found in single copy in    4397 (16.4%) COGs 
INFO -                     294381  found in single copy in    1735 (6.5%) COGs 
INFO -                     507601  found in single copy in    1735 (6.5%) COGs 
INFO -                     578460  found in single copy in     756 (2.8%) COGs 
INFO -       Largest cog size: 8. Smallest cog size: 1
INFO -       Analysis of current COG selection:
INFO -                                 10090 species present in     72 COGs (100.0%)
INFO -                                  9606 species present in     72 COGs (100.0%)
INFO -                                578460 species present in     72 COGs (100.0%)
INFO -                                294381 species present in     72 COGs (100.0%)
INFO -                                  7227 species present in     72 COGs (100.0%)
INFO -                                 31033 species present in     72 COGs (100.0%)
INFO -                                  3702 species present in     72 COGs (100.0%)
INFO -                                507601 species present in     72 COGs (100.0%)
INFO -        72 COGs selected with at least 8 species out of 8
INFO -        Average COG size 8.0/8.0 +- 0.0
INFO -   (D) CogSelectorTask (8 species, MCL-COGs, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:37 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (W) ConcatAlgTask (8 species, 72 COGs, ConcatAlg, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:39 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (W) ConcatAlgTask (8 species, 72 COGs, ConcatAlg, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:42 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) ConcatAlgTask (8 species, 72 COGs, ConcatAlg, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:45 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) ConcatAlgTask (8 species, 72 COGs, ConcatAlg, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:48 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (W) ConcatAlgTask (8 species, 72 COGs, ConcatAlg, /cog_100-al...ee_default)
WRNG -       Concatenating trimmed alignments
WRNG -       Using aa concatenated alignment
INFO -   (D) ConcatAlgTask (8 species, 72 COGs, ConcatAlg, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:50 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (W) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:52 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (W) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:54 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (Q) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:56 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:28:58 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:29:00 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:29:02 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:29:04 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:29:06 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:29:08 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:29:10 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (R) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -   (D) TreeTask (8 aa seqs, FastTree, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -   Updating tasks status: (Tue Feb  9 11:29:12 2016)
INFO -  Thread cog_100-alg_concat_default-fasttree_default: pending tasks: 1 of sizes: 8
INFO -   (W) TreeMergeTask (8 aa seqs, TreeMerger, /cog_100-al...ee_default)
INFO -   (D) TreeMergeTask (8 aa seqs, TreeMerger, /cog_100-al...ee_default)
INFO -  Waiting 2 seconds
INFO -  Assembling final tree...
INFO -  Done thread cog_100-alg_concat_default-fasttree_default in 1 iteration(s)
INFO -  Writing final tree for cog_100-alg_concat_default-fasttree_default
   /Users/jhc/_Devel/cookbook/recipes/basic_sptree/cog_100-alg_concat_default-fasttree_default/proteome_seqs.fa.gz.final_tree.nw
   /Users/jhc/_Devel/cookbook/recipes/basic_sptree/cog_100-alg_concat_default-fasttree_default/proteome_seqs.fa.gz.final_tree.nwx (newick extended)
INFO -  Writing root node alignment cog_100-alg_concat_default-fasttree_default
   /Users/jhc/_Devel/cookbook/recipes/basic_sptree/cog_100-alg_concat_default-fasttree_default/proteome_seqs.fa.gz.final_tree.fa
INFO -  Generating tree image for cog_100-alg_concat_default-fasttree_default
   /Users/jhc/_Devel/cookbook/recipes/basic_sptree/cog_100-alg_concat_default-fasttree_default/proteome_seqs.fa.gz.final_tree.png
INFO -  Done
INFO -  Deleting temporal data...
   ========================================================================
         The following published software and/or methods were used.        
               *** Please, do not forget to cite them! ***                 
   ========================================================================
   Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R,
      McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast,
      scalable generation of high-quality protein multiple sequence
      alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11;7:539.
      doi: 10.1038/msb.2011.75.
   Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for
      automated alignment trimming in large-scale phylogenetic analyses.
      Bioinformatics. 2009 Aug 1;25(15):1972-3.
   Huerta-Cepas J, Dopazo J, Gabaldón T. ETE: a python Environment for Tree
      Exploration. BMC Bioinformatics. 2010 Jan 13;11:24.
   Price MN, Dehal PS, Arkin AP. FastTree 2 - approximately maximum-
      likelihood trees for large alignments. PLoS One. 2010 Mar
      10;5(3):e9490.

The resulting tree image will contain an overview of the concatenated alignment (not that this can be huge in some cases)

In [32]:
from IPython.display import Image
Image(filename='basic_sptree/cog_100-alg_concat_default-fasttree_default/proteome_seqs.fa.gz.final_tree.png')
Out[32]:

Note that leaf names correspond to the species names encoded as part of the sequence names. However, as we know that those species names are NCBI taxids, we can play some annotation magic to get a more readable result:

In [33]:
%%bash
cp basic_sptree/cog_100-alg_concat_default-fasttree_default/proteome_seqs.fa.gz.final_tree.nw sptree.nw
ete3 annotate --ncbi -t sptree.nw | ete3 view --ncbi --image sptree.png --Iw 700
In [30]:
from IPython.display import Image
Image(filename='sptree.png')
Out[30]: