Metadata schema

Suggest Edits

These are subdivided into three categories (File, Sample, and General). The recommended practice is to enter as much metadata as possible when you first upload files to the Platform. For instance, for raw sequencing files, you should enter Platform (sequencing platform) and Sample ID. Of these fields, there are seven metadata fields that we highly suggest you set for your data. While your tasks may run correctly without them, these metadata fields will help optimize your analyses. These fields are labeled in the table below with a suggested tag in the Name column.

Please keep in mind the fields have to be specified exactly as listed in the tables below under the Name column. This means that if the field is not listed exactly as in the table, the Platform will interpret it is a custom metadata field (see below).

File

In the following table, you will find the name, description, and values of metadata fields for File. The second column, API key, allows you to access the specified metadata field through the API. Learn more about accessing metadata via the API.

There are six metadata fields that we highly suggest you set for your data. While your tasks may run correctly without them, these metadata fields will help optimize your analyses. These fields are labeled in the table below with a red suggested tag in the Name column.

Name	API key	Description	Values
Reference genome	`reference_genome`	The reference assembly (such as HG19 or GRCh37) to which the nucleotide sequence of a case can be aligned.	string Suggested values: human_g1k_v37 human_g1k_v37_decoy ucsc.hg19 Homo_sapiens.Ensembl.GRCh37 Homo_sapiens.GRCh38.dna.primary_assembly ion_torrent.hg19 mouse_mm9_ucsc ens_mouse_mm9_genome mouse_mm10_ucsc
Quality scale suggested	`quality_scale`	For raw reads, this value denotes the sequencing technology and quality format. For BAM and SAM files, this value should always be ‘Sanger’. Enter this value for all FASTQ files, unless they are used in a workflow with a FASTQ quality scale detector wrapper.	Choose from one of the following options: sanger llumina13 illumina15 illumina18 solexa Or, enter no value.
Platform suggested	`platform`	Only some tools and workflows may require a value for the Platform field. However, it is recommended that you set it whenever possible, unless you are certain that your workflow will work without it.	string Suggested values: Illumina HiSeq Illumina GA ABI capillary sequencer Illumina MiSeq ABI SOLiD Ion Torrent PGM LS 454 Illumina HiSeq X Ten Illumina Helicos PacBio Not available
Platform unit ID suggested	`platform_unit_id`	This is an identifier for lanes (Illumina), or for slides (SOLiD) in the case that a library was split and ran over multiple lanes on the flow cell or slides. The platform unit ID refers to the lane ID or the slide ID. The value supplied in the Platform unit ID field will be written to the read group tag (@RG:PU) in SAM or BAM files. All aligner apps add read group fields to the aligned BAM file on the basis of Platform unit ID metadata.	string
Paired end suggested	`paired_end`	For paired-end sequencing, this value determines the end of the fragment sequenced. For paired-end read files, this field indicates whether the read file is left end or right end. Set ‘1’ for left end and ‘2’ for right end reads. This is used to group pairs. If the FASTQ file is a single-end read this field should be left as ‘-’. Note: It is important for two members of paired-end reads to have identical Sample ID, Library ID, Platform unit ID, and File segment number.	This takes a value of 1 or 2. Note: For single-end sequencing no value is needed.
Library ID suggested	`library_id`	This is an identifier for the sequencing library preparation. The value set in this field does not affect whether or not the workflow runs successfully. However, all files that come from the same sequencing library must have the same value. The Library ID will be written to the read group tag (@RG:LB) in SAM or BAM files. All aligner apps are programmed to add RG fields to the aligned BAM according to the Library ID.	string
File segment number suggested	`file_segment_number`	If the sequencing reads for a single library, sample and lane are divided into multiple (smaller) files, the File segment number is used to enumerate these. Otherwise, this field can be left blank. This information can be used for batching when processing files with a workflow.	Integer.
Experimental strategy	`experimental_strategy`	This is the method or protocol used to perform the laboratory analysis.	string Suggested values: DNA-Seq WXS WGS Amplicon Bisulfite-Seq RNA-Seq miRNA-Seq Total RNA-Seq Not available

Sample

In the table below, you will find the name, description, and values of metadata fields for Sample. The second column, API key, allows you to access the specified metadata field through the API. Learn more about accessing metadata via the API.

There is one metadata field below that we highly suggest you set for your data. While your tasks may run correctly without this field, it will help optimize your analyses. This field is labeled in the table below with a suggested tag in the Name column.

Name	API key	Description	Value
Sample ID suggested	`sample_id`	A human readable identifier for a sample or specimen, which could contain some metadata information. A sample or specimen is material taken from a biological entity for testing, diagnosis, propagation, treatment, or research purposes, including but not limited to tissues, body fluids, cells, organs, embryos, body excretory products, etc. Tools use Sample ID to separate files that come from different samples. For SAM and BAM files, the value supplied in the Sample ID field is written to the read group tag (@RG:SM). All aligners add read group fields to the aligned BAM file using the file’s Sample ID metadata.	string
Sample type	`sample_type`	The type of material taken from a biological entity for testing, diagnosis, propagation, treatment, or research purposes. This includes tissues, body fluids, cells, organs, embryos, body excretory products, etc.	string Suggested values: Blood normal Tumor tissue Normal tissue Primary cells Stem cells Embryo Cell Line Saliva Control Not available

General

In the following table, you will find the name, description, and values of metadata fields for General. The second column, API key, allows you to access the specified metadata field through the API. Learn more about accessing metadata via the API.

Name	API key	Description	Value
Species	`species`	A group of organisms having some common characteristic or qualities, that differ from all other groups of organisms and that are capable of breeding and producing fertile offspring.	String Suggested values: Homo sapiens Mus musculus
Investigation	`investigation`	A value denoting the project or study that generated the data.	String
Case ID	`case_id`	This is a human-readable identifier, such as a number or a string for a subject who has taken part in the investigation or study.	String
Batch number	`batch_number`	This is an assigned distinctive alpha-numeric identification code that signifies grouping.	Integer

👍
Apart from the standard set of metadata fields that can be seen through the visual interface, custom metadata fields can be added via the command line uploader or via the API.

Updated about 2 years ago