About controlling task execution
Execution control is a powerful tool, but keep in mind that if used incorrectly, it may render your workflows inoperable or cause errors that are hard to debug. If you plan on using execution control facilities, please make sure you understand the concepts outlined here.
The publicly available workflows on the Platform have been fine-tuned so when you copy a public workflow to your project, you can be sure that the Seven Bridges bioinformaticians have tightened its bolts to make the most of the provisioned compute time.
The Seven Bridges Platform understands and implements tools, workflows, jobs and execution as prescribed by the Common Workflow Language (CWL).
The need for optimisation
Different tools often require different computation instances to operate efficiently.
When you build your own workflow or bring your own tools to the Platform, there are parameters you can tune to affect how jobs are executed.
The Platform scheduling algorithm seeks the best combination of instances for a given task. However, experience shows there is no silver bullet. An algorithm will never be able to find the most efficient solution for every combination of inputs and tools.
The trick is to match the available compute resources with the resource requirements of your analysis.
Execution control can help you optimize your apps in a number of scenarios:
- Reduce task execution time: Parallelizing multiple executions of the same tool can save you precious time if applied to critical tools in your workflow.
- Lower task cost: Having a good match between tool resource requirements and provisioned computational instances means that you will only be paying for compute resources that are being used.
- Ensure sufficient resources: When developing a tool using the Platform SDK, one of the key steps is specifying the computational resources (CPU cores and RAM) that the tool needs for proper operation.
Controlling task execution
The resource requirements of an analysis are given by the resource requirements (CPU and RAM) of the tools involved. These parameters are used by the Scheduling Algorithm when deciding how to distribute jobs across available instances. Those parameters are called execution hints.
You can control the compute resources available for your tool executions by specifying the type of instance to be used and/or the maximum number of such instances that can be run in parallel. See this blog post for examples on how to make efficient use of computational resources.
Beware of mismatches between tool resource requirements and the resources available on the provisioned computation instance. Your task will fail if a tool requires more resources than the provisioned computation instance.
Controlling task execution via instance type
You generally want to pick an instance that leaves the least CPU and memory to waste, given the jobs predicted to run as part of your task. This blog post provides a detailed example of how to determine the most suitable instance type.
To set the instance type, set the
sbg:AWSInstanceType hint at workflow level.
The hierarchy of instance type hints
While instance type hints can be set at several levels (workflow, node, tool and task), there is a hierarchy that determines which of the settings takes priority during execution.
An instance type specified at workflow level will override any instance type hints set for any of its nodes or any of the tools within that workflow. However, if an instance type is specified per task, it will override all hints set at workflow level, according to the following priority task > workflow > node.
An instance type specified for a tool will only be considered when the tool is run on its own. When the tool becomes part of a workflow, its instance hint gets overridden by instance hints set at node level, which in turn get overridden by the workflow-level instance hints.
The current default instance the Scheduling Algorithm will attempt to use for execution is
c4.2xlarge, which has 8vCPUs and 15 GB RAM.
Controlling task execution via maximum number of parallel instances
Configuring the number of instances running at once for a single task is handy when parallelizing (scattering) several executions of the same tool.
The Seven Bridges bioinformaticians exploit this technique when tuning workflows in Public Apps.
To set the maximum number of instances, set the
sbg:maxNumberOfParallelInstances hint at workflow level.
The default value for
sbg:maxNumberOfParallelInstances is 1.
Updated almost 3 years ago