site logo
The ETE toolkit
A Python Environment for Tree Exploration

ETE-NPR - HowTo

Index

How to customize an existing workflow or design my own?

The best way to create your own workflow is to dump an existing configuration and modify its parameters. To dump a specific workflow or meta-workflow, use:

$ npr dump standard_fasttree > workflow.cfg

You can now edit any option in the workflow.cfg file and use it when running a npr analysis.

$ npr -a sequences.fa -c workflow.cfg -w MyWorkflow

Note that the workflow name '' can now refer to a valid block name in workflow.cfg.

Need help understanding the format of the configuration file? Check this.

Configuration file format

The configuration file is organized in blocks, each setting up a different task. Block name (within brackets) will be used to crosslink from different sections. Every block contains a number of parameters, some of which are mandatory. For instance, every block should have a _app parameter, which defines the type of task or application associated to that block. Available _app types include:

# workflow types
_app = supermatrix
_app = genetree

 # only for super matrix workflows 
_app = cogselector
_app = concatalg
        
# software bindings
_app = clustalo
_app = dialigntx
_app = fasttree
_app = mafft
_app = metaligner
_app = muscle
_app = phyml
_app = prottest
_app = raxml
_app = trimal      
        

Each _app type may have its own mandatory parameters. The best way to create your own config block is to dump an existing application config block one and modify it. A list of available apps can be get with $npr apps. Choose your favorite and dump its config using its name. For instance:

$npr dump raxml_default
[raxml_default] _app = raxml _bootstrap = alrt # _method = GAMMA _aa_model = JTT -f = d -p = 31416

Some tasks are only in charge of running some internal code in ETE-NPR, while other may also require launching third-party sofware. As a general rule, any configuration parameter starting by underscore (_) is considered an internal config parameter, all the rest are considered command line arguments for the external sofware. In the case of raxml_default, all options are internal except of -f d -p 31416, which will be directely passed to the raxml binary. Any extra argument can be added in the same way.

Workflows, by constrast, contain only internal parameters. For instance A genetree workflow configuration look like this:

[CustomWorkflow]
            _app = genetree
        
     _aa_aligner = @meta_aligner_trimmed
 _aa_alg_cleaner = @trimal01
_aa_model_tester = @prottest_default
_aa_tree_builder = @phyml_default
     _nt_aligner = @meta_aligner_trimmed
 _nt_alg_cleaner = @trimal01
_nt_model_tester = none
_nt_tree_builder = @phyml_default
         _appset = @builtin_apps
      

Each task required by a gene tree workflow points to its corresponding block section in the config file, which specifies how to run such a task. Note that cross linking is done using the name of the pointed preceded by the @ symbol. Simple, right? You can also disable certain tasks, by using the word none.

Let's say that we want to build protein trees (_aa_tree_builder) using a specific RAxML setup. you config should like something like this:

[CustomWorkflow]
...
_aa_tree_builder = @my_raxml_boostrapped
...
        
[my_raxml_bootstrapped]
      _app = raxml
_bootstrap = 100   #alrt or number of bootstrap replicates
   _method = CAT
 _aa_model = JTT
        -f = d
        -p = 31416
      

The same logic applies to the whole configuration file. You can experiment dumping some of the different worklfows available (npr dump workflow_name) and changing their and parameters. To dump the full configuration file available in ete-npr, use npr dump > allconfig.cfg

Configuring external software and multi-threading

ETE-NPR comes with a set of precompiled external software. All tests and parameters have been tested on those available binaries and allow for full reproducibility across systems. However, you could specify a custom path for the external software by modifying the following config block:
[builtin_apps]
        muscle = built-in, 1,
         mafft = built-in, 2,
      clustalo = built-in, 1,
        trimal = built-in, 1,
        readal = built-in, 1,
       tcoffee = built-in, 1,
         phyml = built-in, 1,
raxml-pthreads = built-in, 48,
         raxml = built-in, 1,
     dialigntx = built-in, 1,
      fasttree = built-in, 2,
       statal = built-in, 1,
      

replace the word 'built-in' byt the corresponding path of the application. The max number of CPUs allowed to be used by each application is also defined in this block (the number after the comma). You can all modify those values if necessary

Performance tips (for Cluster and large scale usage)

network filesystem impact

When running ETE-NPR on a shared filesystem (i.e. NFS), consider using the--scratch_dir option. ETE-NPR uses sqlite databases and writes several temporal files into the disk to monitor all the tasks. Network filesystem are usually a bottleneck for this type of tasks. By using the --scratch_dir option, ETE-NPR will attempt to run everything on the specified loca scratch directory and moved the results to the appropriate output directory when ready.

The following code will run all computation using /tmp as scratch are moved the results to /shared/myresults/ when ready. /tmp directory will be cleaned in both successful or failed runs.

$ npr -a seq.fa -w linsi_raxml -o /shared/myresults --scratch_dir /tmp

Of note, this option produced significant improvement in the performance of large scale computations with thosands of ETE-NPR instances running in parallel in a big cluster.

Compress intermediate files

Use the --compress option to automatically compress all intermediate and log files when the analysis is successfully finished.

Skip checks

You can skip application checks when the program starts by using the --nochecks option

If you also trust your input sequence data, you can also skip basic sequence checks by using the --no-seq-checks

Changing scheduling time

By default, ETE-NPR will check the status of running tasks every 2 seconds and launch pending processes every 5 seconds. For long analyses, this should not create any performance problem rather than a lot of unnecessary logging, but it may affect the running time of very fast executions.

In any case, both parameters can be changed through the -t and --launch_time options. For instance for very small datasets, where tasks are usually done in few seconds, this option may accelerate the execution:

$ npr -a seq.fa -w test -o results/ -t 0.5 --launch_time 1.5

Please note that extremely low scheduling values may lead to unnecessary overloading.

Tips about input data

sequences are expected to be provided as an unaligned fasta file. However, you can also provide sequential and interleaved phylip format, even with aligned sequences. Check the following options:

--seqformat {fasta,phylip,iphylip,phylip_relaxed,iphylip_relaxed}
--dealign   When used, gaps in the orginal sequence file will be
            removed, thus allowing to use alignment files as
            input.
          

phylip and iphylip expect strict phylip format (sequence names of 10 chars maximum, while the_relaxed version allows for longer sequence names.

About sequence names

Many programs require sequence names to be unique and no longer than 10 characters. To circumvent this issues, ETE-NPR generates a unique hash name for every sequence in you input file an uses it for all the internal computations and tasks. Once the results are ready, original names are restored. This is the most convenient mode for most users, but you may notice that intermediate files in the tasks/ directory contain the hashed version of the sequence names. If, for any reason, you prefer to use the original sequence names during the whole process, you can disable this behavior by using the --no-seq-rename option.

Recycling tasks versus full analysis reset

ETE-NPR allows to stop and resume analyses by recycling the tasks available from previous runs. Let's explain the logic behing the method used:

When a task in finished, the raw resulting data (stored in the tasks/ directory) is processed and stored in a local database within the main results directory. Since each task has an unique identifier based on its source data and configuration parameters, next time the same task is needed, it will be directly fetched from the database. This is the default behavior when re-running an analysis (or starting a new workflow) using the same output directory. And that is way re-running a previously finnised analysis will require no computation and you will only see the tasks finishing very quickly.

For a complete reset of the result directory (i.e. running fresh new analysis from scratch), you will need to specify the --clearall option.

There is also an intermediate option (although mostly for debugging purposes) that will clear the database but will attempt to re-process raw data from the tasks/ directory: --softclear option.

Meta-workflows: Can different workflows be executed under the same output directory?

Yes, both a the same time or independently, but make sure the run on the exactly same input file. When doing so, the worflow will run normally and reuse any task that is present in the local database. For instance, if you run Mafft-Raxml workflow and afterwards you want to generate a Mafft-Phyml results, the second will reuse the Mafft alignement and compute only the Phyml tree.

The same idea is behind the preconfigured meta-workflows, which allow to test multiple workflows on the same dataset at once while reusing common tasks. In example, the following command line will produce 8 trees each reconstructed using Raxml under a different alignment strategy:

$ npr -w test_aligners_raxml -a seq.fa -o results/

Can I change the configuration file while recycling task results?

Yes, but you should specifically say so by using the --override option.

ETE is developed as an academic free software tool. If you find ETE useful for your work, please cite:

Jaime Huerta-Cepas, Joaquín Dopazo and Toni Gabaldón. ETE: a python Environment for Tree Exploration. BMC Bioinformatics 2010, 11:24. doi:10.1186/1471-2105-11-24

Support mailing list: etetoolkit@googlegroups.com
Contact:: huerta@embl.de


The ETE Toolkit was originally developed at the bioinformatics department of CIPF and greatly improved at the comparative genomics unit of CRG. At present, ETE is maintained by Jaime Huerta-Cepas at the Structural and Computational Biology unit of EMBL (Heidelberg, Germany).