Automating the Seven Bridges Platform
Prerequisite
Before continuing with this part of the tutorial, please make sure you have advance_access
enabled for the Seven Bridges API. Please refer to Installation and configuration to see how this is accomplished.
Running an app on the Seven Bridges Platform
This example works only if run as an Enterprise user. If you are not an Enterprise user the argument 'billing_group_name' needs to be provided when creating the project.
The following automation script creates a project on the Seven Bridges Platform, copies a public FASTq test files into the new project, and runs FastQC in it. After FastQC finished, it prints generated outputs on the console.
from freyja import Step, Automation, Input
from hephaestus import SBApi
from hephaestus.steps import (
FindOrCreateProject,
FindOrCopyFilesByName,
FindOrCopyApp,
FindOrCreateAndRunTask
)
class Main(Step):
project_name = Input(str)
def execute(self):
public_data_project = SBApi().projects.get("admin/sbg-public-data")
my_project = FindOrCreateProject(name=self.project_name).project
fastqc_app = FindOrCopyApp(
app_id="admin/sbg-public-data/fastqc-0-11-4",
to_project=my_project,
).app
fastq_file = FindOrCopyFilesByName(
names=["example_human_Illumina.pe_1.fastq"],
from_project=public_data_project,
to_project=my_project,
).copied_files[0]
task = FindOrCreateAndRunTask(
inputs={"input_fastq": [fastq_file]},
app=fastqc_app,
in_project=my_project,
).finished_task
print(task.outputs)
if __name__ == "__main__":
Automation(Main).run()
This script uses FindOr...
versions of Hephaestus steps that provide the added benefit of memoization via external state discovery.
That is, when this script is run with same project name again, it finds the already existing project instead of creating a new one with the same name, uses the input file and the app which are already in the project instead of copying them again, and returns outputs of previously run task instead of rerunning it with the same inputs.
This behavior is very useful for quick reruns in case of errors or changed inputs.
Tasks executed from an automation run
When a task is executed by an automation run, it will preserve the information of the run that started it. You can get this information using the API (Get details of a task). The automation run ID is available in the origin_id
property.
This information is not available for tasks started before October 1st 2021, when this feature was first introduced.
In order to connect the tasks with automation runs, please make sure that an override authentication token is not provided through the automation or automation run settings and configurations.
If a task does not contain the
origin_id
property the following are the possible reasons that happened:
- the task was not not started by an automation run
- an override token was used
- the task was executed before October 1st 2021
Direct access to the Seven Bridges public API
The underlying Seven Bridges Python API object can be accessed directly via the SBApi()
singleton.
This singleton gives access to all the functionality of SevenBridges Python API in case there is no Hephaestus step available or necessary to perform a specific platform operation.
In the above example, we used the SBApi
singleton to get the project for public reference files via its project ID.
Parallelization: automation loops vs. CWL scatter
If multiple files need to be processed this could be accomplished with a simple loop inside the automation:
fastq_files = FindOrCopyFilesByName(
names=[
"example_human_Illumina.pe_1.fastq",
"example_human_Illumina.pe_2.fastq",
],
from_project=public_data_project,
to_project=my_project,
).copied_files
finished_tasks = []
for fastq_file in fastq_files:
task = FindOrCreateAndRunTask(
f"RunFastQC-{fastq_file.name}",
inputs={"input_fastq": [fastq_file]},
app=fastqc_app,
in_project=my_project,
).finished_task
finished_tasks.append(task)
When you choose this approach, however, keep in mind that each iteration of this loop creates a separate task on the Seven Bridges Platform:
This is not always what you want. In particular, if you have hundreds or even thousands of files to process, this type of fine-grained flow control inside the automation can lead to inefficient execution on the Seven Bridges Platform.
Keep in mind that each task runs on a separate instance. So even if based on resource requirements an instance could process multiple files concurrently, if you use this approach, each file gets processed on a separate instance.
CWL scatter is more efficient than loops inside the automation
The better approach is to take advantage of CWL's built-in parallelization mechanisms (scatter) instead of parallelizing at the automation level.
For our specific example, the FastQC CWL app is already wrapped in a way such that it accepts a list of files as input and creates a list of files as output. So instead of processing in a loop we can simply pass a list of FASTq files as input:
task = FindOrCreateAndRunTask(
inputs={"input_fastq": fastq_files},
app=fastqc_app,
in_project=my_project,
).finished_task
Please keep this optimization strategy in mind when developing your own automations. Try to push parallelization down into the CWL layer whenever it is an option.
For automations that should run at scale, we strongly recommend using CWL scatter over automation loops to achieve parallelization.
Automation loops should remain reserved for top-level concurrency (e.g. at the sample level) and shouldn't be used to parallelize execution at lower levels (e.g. at chromosome, genomic region, or gene level).
Updated over 1 year ago