Updating file metadata

Metadata for multiple files can be set using a bulk API call instead of one call per file. Setting metadata for the files is typically required before they can be provided as input to a CWL workflow.

In the examples below, we will assume that there is a list of FASTQ files for a specific sample and we want to set both sample_id and paired_end metadata information for all of them.

Not optimized for rate limit

In an example which is not optimized for the rate limit we are iterating over all FASTQ files and setting metadata for each of the files individually.

# set metadata for forward read files
for fastq in forward_reads:
    fastq.metadata['sample_id'] = 'my-sample'
    fastq.metadata['paired_end'] = '1'
    fastq.save()
 
# set metadata for reverse read files
for fastq in reverse_reads:
    fastq.metadata['sample_id'] = 'my-sample'
    fastq.metadata['paired_end'] = '2'
    fastq.save()

Optimized for rate limit

An optimal way to update metadata for multiple files is to use a bulk API call and update metadata for up to 100 files per call.

def bulk_update_metadata(files, metadata, replace=False):
    """Updates metadata for list of files in bulk. Input lists must be ordered
    pairs, i.e. the first element in list 'files' corresponds to first element
    in list 'metadata' etc."""
     
    final_responses = []
     
    # process in batches of 100
    for i in range(0, len(files), 100):
 
        files_chunk = [f for f in files[i:i + 100]]
        md_chunk = [md for md in metadata[i:i + 100]]
 
        # make sure metadata attribute is set for all files before update;
        # avoids lazy fetching of metadata for each file in subsequent loop
        md_missing = any([not f.field('metadata') for f in files_chunk])
        if md_missing:
            files_chunk = api.files.bulk_get(files_chunk)
 
        # set metadata for each file
        for t in zip(files_chunk, md_chunk):
            t[0].metadata = t[1]
 
        # update or replace existing metdata
        if replace :
            responses = api.files.bulk_update(files_chunk)
        else:
            responses = api.files.bulk_edit(files_chunk)
 
        # check for errors
        for r in responses:
            if not r.valid:
                raise Exception(
                    '\n'.join([str(r.error), r.error.message, r.error.more_info])
                )
 
        final_responses.extend(responses)
         
    return final_responses
 
metadata = []
for fastq in forward_reads:
    metadata.append({"sample_id" : "my-sample", "paired_end" : "1"})
for fastq in reverse_reads:
    metadata.append({"sample_id" : "my-sample", "paired_end" : "2"})
   
responses = bulk_update_metadata(forward_reads + reverse_reads, metadata)