The best way to create your own workflow is to dump an existing configuration and modify its parameters. To dump a specific workflow or meta-workflow, use:
$ npr dump standard_fasttree > workflow.cfg
You can now edit any option in the workflow.cfg file and use it when running a npr analysis.
$ npr -a sequences.fa -c workflow.cfg -w MyWorkflow
Note that the workflow name '' can now refer to a valid block name in workflow.cfg.
Need help understanding the format of the configuration file? Check this.
The configuration file is organized in blocks, each setting up a different task. Block name (within brackets) will be used to crosslink from different sections. Every block contains a number of parameters, some of which are mandatory.
For instance, every block should have a _app
parameter, which defines the type of task or application associated to that block.
Available _app types include:
# workflow types _app = supermatrix _app = genetree # only for super matrix workflows _app = cogselector _app = concatalg # software bindings _app = clustalo _app = dialigntx _app = fasttree _app = mafft _app = metaligner _app = muscle _app = phyml _app = prottest _app = raxml _app = trimal
Each _app type may have its own mandatory parameters. The best
way to create your own config block is to dump an existing application config block one and modify it. A list of available apps can be get with $npr apps
. Choose
your favorite and dump its config using its name. For instance:
$npr dump raxml_default[raxml_default] _app = raxml _bootstrap = alrt # _method = GAMMA _aa_model = JTT -f = d -p = 31416
Some tasks are only in charge of running some internal code in ETE-NPR,
while other may also require launching third-party sofware. As a general
rule, any configuration parameter starting by underscore (_) is considered
an internal config parameter, all the rest are considered command line
arguments for the external sofware. In the case of raxml_default, all options are internal except of -f d -p 31416
,
which will be directely passed to the raxml binary. Any extra argument can be added in the same way.
Workflows, by constrast, contain only internal parameters. For instance A genetree workflow configuration look like this:
[CustomWorkflow] _app = genetree _aa_aligner = @meta_aligner_trimmed _aa_alg_cleaner = @trimal01 _aa_model_tester = @prottest_default _aa_tree_builder = @phyml_default _nt_aligner = @meta_aligner_trimmed _nt_alg_cleaner = @trimal01 _nt_model_tester = none _nt_tree_builder = @phyml_default _appset = @builtin_apps
Each task required by a gene tree workflow points to its corresponding block section in the config file, which specifies how to run such a task. Note that cross linking is done using the name of the pointed preceded by the @ symbol. Simple, right? You can also disable certain tasks, by using the word none.
Let's say that we want to build protein trees (_aa_tree_builder) using a specific RAxML setup. you config should like something like this:
[CustomWorkflow] ... _aa_tree_builder = @my_raxml_boostrapped ... [my_raxml_bootstrapped] _app = raxml _bootstrap = 100 #alrt or number of bootstrap replicates _method = CAT _aa_model = JTT -f = d -p = 31416
The same logic applies to the whole configuration file. You can experiment dumping some of the different worklfows available (npr dump workflow_name
) and changing their and parameters.
To dump the full configuration file available in ete-npr, use npr dump > allconfig.cfg
[builtin_apps] muscle = built-in, 1, mafft = built-in, 2, clustalo = built-in, 1, trimal = built-in, 1, readal = built-in, 1, tcoffee = built-in, 1, phyml = built-in, 1, raxml-pthreads = built-in, 48, raxml = built-in, 1, dialigntx = built-in, 1, fasttree = built-in, 2, statal = built-in, 1,
replace the word 'built-in' byt the corresponding path of the application. The max number of CPUs allowed to be used by each application is also defined in this block (the number after the comma). You can all modify those values if necessary
When running ETE-NPR on a shared filesystem (i.e. NFS), consider
using the--scratch_dir
option. ETE-NPR
uses sqlite databases and writes several temporal files into the disk to
monitor all the tasks. Network filesystem are usually a bottleneck for
this type of tasks. By using the --scratch_dir option, ETE-NPR will
attempt to run everything on the specified loca scratch directory and
moved the results to the appropriate output directory when ready.
The following code will run all computation using /tmp as scratch are moved the results to /shared/myresults/ when ready. /tmp directory will be cleaned in both successful or failed runs.
$ npr -a seq.fa -w linsi_raxml -o /shared/myresults --scratch_dir /tmp
Of note, this option produced significant improvement in the performance of large scale computations with thosands of ETE-NPR instances running in parallel in a big cluster.
Use the --compress
option to automatically compress all intermediate and log files when the analysis is successfully finished.
You can skip application checks when the program starts by using the --nochecks
option
If you also trust your input sequence data, you can also skip basic sequence checks by using the --no-seq-checks
By default, ETE-NPR will check the status of running tasks every 2 seconds and launch pending processes every 5 seconds. For long analyses, this should not create any performance problem rather than a lot of unnecessary logging, but it may affect the running time of very fast executions.
In any case, both parameters can be changed through the -t
and --launch_time
options. For
instance for very small datasets, where tasks are usually done in few
seconds, this option may accelerate the execution:
$ npr -a seq.fa -w test -o results/ -t 0.5 --launch_time 1.5
Please note that extremely low scheduling values may lead to unnecessary overloading.
sequences are expected to be provided as an unaligned fasta file. However, you can also provide sequential and interleaved phylip format, even with aligned sequences. Check the following options:
--seqformat {fasta,phylip,iphylip,phylip_relaxed,iphylip_relaxed} --dealign When used, gaps in the orginal sequence file will be removed, thus allowing to use alignment files as input.
phylip and iphylip expect strict phylip format (sequence names of 10 chars maximum, while the_relaxed version allows for longer sequence names.
Many programs require sequence names to be unique and no longer
than 10 characters. To circumvent this issues, ETE-NPR generates a
unique hash name for every sequence in you input file an uses it for
all the internal computations and tasks. Once the results are ready,
original names are restored. This is the most convenient mode for most
users, but you may notice that intermediate files in the tasks/
directory contain the hashed version of the sequence names. If, for
any reason, you prefer to use the original sequence names during the
whole process, you can disable this behavior by using the --no-seq-rename
option.
ETE-NPR allows to stop and resume analyses by recycling the tasks available from previous runs. Let's explain the logic behing the method used:
When a task in finished, the raw resulting data (stored in the tasks/ directory) is processed and stored in a local database within the main results directory. Since each task has an unique identifier based on its source data and configuration parameters, next time the same task is needed, it will be directly fetched from the database. This is the default behavior when re-running an analysis (or starting a new workflow) using the same output directory. And that is way re-running a previously finnised analysis will require no computation and you will only see the tasks finishing very quickly.
For a complete reset of the result directory (i.e. running fresh new
analysis from scratch), you will need to specify the --clearall
option.
There is also an intermediate option (although mostly for debugging
purposes) that will clear the database but will attempt to re-process
raw data from the tasks/ directory: --softclear
option.
Yes, both a the same time or independently, but make sure the run on the exactly same input file. When doing so, the worflow will run normally and reuse any task that is present in the local database. For instance, if you run Mafft-Raxml workflow and afterwards you want to generate a Mafft-Phyml results, the second will reuse the Mafft alignement and compute only the Phyml tree.
The same idea is behind the preconfigured meta-workflows, which allow to test multiple workflows on the same dataset at once while reusing common tasks. In example, the following command line will produce 8 trees each reconstructed using Raxml under a different alignment strategy:
$ npr -w test_aligners_raxml -a seq.fa -o results/
Yes, but you should specifically say so by using the --override
option.
ETE is developed as an academic free software tool. If you find ETE useful for your work, please cite:
Jaime Huerta-Cepas, Joaquín Dopazo and Toni Gabaldón. ETE: a python Environment for Tree Exploration. BMC Bioinformatics 2010, 11:24. doi:10.1186/1471-2105-11-24
The ETE Toolkit was originally
developed at the bioinformatics
department of CIPF and greatly improved
at the comparative genomics unit
of CRG. At present, ETE is maintained by Jaime
Huerta-Cepas at the Structural and
Computational Biology unit of EMBL (Heidelberg,
Germany).