Updating file metadata
Metadata for multiple files can be set using a bulk API call instead of one call per file. Setting metadata for the files is typically required before they can be provided as input to a CWL workflow.
In the examples below, we will assume that there is a list of FASTQ files for a specific sample and we want to set both sample_id
and paired_end
metadata information for all of them.
Not optimized for rate limit
In an example which is not optimized for the rate limit we are iterating over all FASTQ files and setting metadata for each of the files individually.
# set metadata for forward read files
for fastq in forward_reads:
fastq.metadata['sample_id'] = 'my-sample'
fastq.metadata['paired_end'] = '1'
fastq.save()
# set metadata for reverse read files
for fastq in reverse_reads:
fastq.metadata['sample_id'] = 'my-sample'
fastq.metadata['paired_end'] = '2'
fastq.save()
Optimized for rate limit
An optimal way to update metadata for multiple files is to use a bulk API call and update metadata for up to 100 files per call.
def bulk_update_metadata(files, metadata, replace=False):
"""Updates metadata for list of files in bulk. Input lists must be ordered
pairs, i.e. the first element in list 'files' corresponds to first element
in list 'metadata' etc."""
final_responses = []
# process in batches of 100
for i in range(0, len(files), 100):
files_chunk = [f for f in files[i:i + 100]]
md_chunk = [md for md in metadata[i:i + 100]]
# make sure metadata attribute is set for all files before update;
# avoids lazy fetching of metadata for each file in subsequent loop
md_missing = any([not f.field('metadata') for f in files_chunk])
if md_missing:
files_chunk = api.files.bulk_get(files_chunk)
# set metadata for each file
for t in zip(files_chunk, md_chunk):
t[0].metadata = t[1]
# update or replace existing metdata
if replace :
responses = api.files.bulk_update(files_chunk)
else:
responses = api.files.bulk_edit(files_chunk)
# check for errors
for r in responses:
if not r.valid:
raise Exception(
'\n'.join([str(r.error), r.error.message, r.error.more_info])
)
final_responses.extend(responses)
return final_responses
metadata = []
for fastq in forward_reads:
metadata.append({"sample_id" : "my-sample", "paired_end" : "1"})
for fastq in reverse_reads:
metadata.append({"sample_id" : "my-sample", "paired_end" : "2"})
responses = bulk_update_metadata(forward_reads + reverse_reads, metadata)
Updated almost 5 years ago