Here we describe how intermediate steps are handled
ete-build and how to access the intermediate files and data generated by the different tasks and jobs.
ete-build is a wrapper tool that allows to pipeline the execution of multiple programs. Although a final tree and alignment is reported as the main result o workflows, many other operations took place in the background.
In order to understand how ete-build works, consider the following facts:
A workflow is composed of many Tasks
Each Task is composed of none or multiple Jobs, plus some post-analysis operations (i.e. parsing, cleaning, etc.)
Each Job represents a call to an external program
When running a workflow using the
mafft_linsi aligner, this is translated into a Mafft task that will call the
mafft binary with the arguments and options under the
mafft_linsi configuration block.
Each Job in task has a unique hash-id built using input data, program type and program arguments as source. A minimal change in one of the options would generate a different job-id.
Similarly, each task is assigned with a unique hash-id based on the configuration of the task and the ids of its sibling jobs.
all tasks and jobs ids, as well as the resulting data, are stored in a SQLITE database. A unique ID is also assigned to each piece of data generated (i.e. Multiple Sequence Alignment, Tree or trimmed alignment).
Altogether, this system is what permits reusing previous results when resuming an analysis. If a new tasks or job is registered that is present the database, the stored output will be used.
There are several ways:
After execution, a file called
commands.log will be present in the output directory. It has the following tab-delimited format:
| TaskType | TaskId | JobName | JobID | command line used (if relevant)|
All intermediate operations occur in the
tasks/ directory. Within
tasks/, each Job and some Tasks store and process intermediate data. The name of subdirectories corresponds to Job or Task IDs.
input/ folder is used to dump previously generated data with is stored in the database and that is required as input for other tasks. If a Job requires data files generated in previous tasks, those files will be referred using their corresponding data IDs and will be dumped in the
input/ directory when necessary.
All job directories follow the same basic structure. They contain:
__cmd__with the command line used to launch the job
__stderr__files capturing job output
__time__file recording start and finish time of the job
__status__file reporting a single letter status about if the job is (D)one, (R)unning or has (E)rrors.