About parallelizing tool executions

To achieve the parallelization of several executions of the same tool, the Seven Bridges Platform implements a Common Workflow Language (CWL) feature called scattering.

Scattering is a mechanism that applies to a particular tool and one of its input ports. If a tool is passed a list of inputs on one port and that port is marked as "scattered", then one job will be created for each input in the list.

The scheduling algorithm will have these jobs be run in parallel, as far as the available compute resources allow it. If all jobs cannot be executed in parallel, they will be queued for execution as soon as more resources become available.

Scattering on a critical tool in your workflow may shorten the workflow's run time significantly. For an example of how this can be achieved, see this blog post explaining how a whole genome analysis workflow uses scattering.

Note that scattering is different from performing batch analyses. Batching launches multiple tasks, whereas scattering happens within a single task.

Keeping scattering under control

The power of scattering to reduce analysis time lies in making full use of the available compute resources. You can control the resources available for the execution of an app by specifying instance type and the number of instances to be used in parallel.

While scattering is a powerful tool to shorten your analysis run time, it may well increase the overall cost of your analysis if used in combination with certain other settings.

There are two ways in which you can fine-tune how the scattering works on a tool:

  • Configuring computational instances on the tool.
  • Setting the maximum number of parallel instances;

Controlling via instance type

Based on the scattered tool's resource requirements, you may want to pick an instance that leaves the least CPU and memory to waste for a given number of scattered jobs and maximum number of parallel instances. This blog post explains how to choose an instance suitable for your analysis.

To set the instance type, set the sbg:AWSInstanceType hint at workflow level.

Controlling via maximum number of parallel instances

If you anticipate that the execution of the tool which you are scattering is time-critical for the entire workflow, you can configure the maximum number of instances that the Scheduling Algorithm is allowed to have running at any one time.

If the jobs that would be started as a result of scattering cannot fit onto the provisioned instances according to their tool's resource requirements, those jobs will be queued for execution. As soon as enough resources become available following the completion of other jobs, queued jobs will be executed. This ensures there will be less idle time across the entire task.

The Seven Bridges bioinformaticians exploit this technique when tuning workflows in Public Apps

To set the maximum number of instances, set the sbg:maxNumberOfParallelInstances hint at workflow level.