Job retry

Overview

When running a task, jobs can fail for a number of reasons. In some cases, the Platform is able to automatically retry failed jobs, avoiding task failure. Read this page to learn more about the cases when the Platform carries out automatic job retry.

Mechanisms for job retry

Sometimes during task execution, job(s) can fail. Jobs that fail for the reasons described below are retried automatically according to the nature of the interruption:

Instance stops responding. There are two reasons why this type of interruption might occur and they are indistinguishable to the Platform at runtime:
Memory or hard disk exhaustion: The tool demands so many resources that the instance is unable to respond, even to a direct HTTP request.
Cloud provider issue: There has been a cloud provider-related network and/or hardware malfunction.

Following this interruption, unfinished jobs are retried once, since either the initial error is unlikely to be resolved without manually editing the tool or because it is reasonable to expect that the same error will not happen again.

Spot Instance is terminated. Because of high demand, the market price of the Spot Instance became higher than the bid price. Unfinished jobs will be restarted on On-Demand instance(s) and task execution will be continued on on-demand instance(s). Learn more about spot instance termination.

Job retry example

The image below is a screenshot from the task stats page, showing how the job retry mechanism steps in when a job is interrupted.

[ 1 ] The execution of GATK_IndelRealigner is interrupted, causing this job to fail.
[ 2 ] The job retry mechanism is triggered and a new instance is being initialized.
[ 3 ] The job inputs are being downloaded.
[ 4 ] The Docker image is being pulled.
[ 5 ] GATK_IndelRealigner is being run from the beginning on a new instance.
[ 6 ] The job outputs are being uploaded.

Updated 9 months ago