The Platform hosts large genomics datasets along with the tools to query, filter, and browse them. The resulting data can be added to your projects and analyzed with your private data to address your research questions. Below, learn more about datasets on the Platform and access resources describing each dataset's data and metadata. Note that these datasets are not available if you are using AWS EU as your cloud provider.
Consistent with terminology used by the Genomic Data Commons (GDC), datasets on are divided into two categories: "harmonized" and "legacy". In 2016, the GDC started hosting and distributing previously generated data from The Cancer Genome Atlas (TCGA). Additionally, for all submitted sequence data (FASTQs and BAM alignment files), the GDC generated new alignments (BAM files) to the latest human reference genome, GRCh38, using standard workflows. Using these alignments, the GDC generated derived data, including normal and tumor variant and mutation calls, gene and miRNA expression profiles, and splice junction quantification data. The GDC refers to this process of data generation through standard workflows as data harmonization.
Datasets on that are aligned to GRCh38 or that use a similar data model as the GRCh38 datasets from the GDC are labeled "harmonized". Datasets on that are not aligned to GRCh38 or that use a different data model are labeled "legacy". "Legacy" datasets remain fully supported.
Below, learn more about datasets on the Platform and access resources describing each dataset's data and metadata.
TCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the National Cancer Institute (NCI), National Human Genome Research Institute (NHGRI), the National Institutes of Health (NIH), and the U.S. Department of Health and Human Services. The Platform provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data.
TCGA data is made available on the Seven Bridges Platform through an integration with the Seven Bridges Cancer Genomics Cloud (CGC). TCGA on the Platform includes both Open and Controlled Data. While all data in TCGA is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on the Platform after you are authenticated and agree to data use policies. In addition, you can obtain access to Controlled Data through the NIH via the Database of Genotypes and Phenotypes (dBGaP) site.
There are two iterations of TCGA dataset on the Platform:
Learn more about their differences below.
TCGA is a "legacy" dataset that contains TCGA data from the original genome build produced by CGHub. This dataset was imported before the GDC completed their harmonized data model. In addition, the Platform hosts the harmonized version of TCGA, TCGA GRCh38, as discussed below.
The TCGA dataset is termed "legacy" in accordance with the GDC labeling convention because its sequence data was not aligned to GRCh38. Note that the metadata fields available for the legacy TCGA dataset are different from those available for TCGA GRCh38.
TCGA GRCh38 is a "harmonized" dataset that contains BAM files derived from TCGA FASTQs that have been re-aligned to GRCh38. Note that TCGA GRCh38 does not contain the FASTQs themselves. We've aligned our TCGA data to the GDC's harmonized data model so users can access the same data using similar search terms.
The CGC also hosts a non-harmonized ("legacy") version of TCGA, named TCGA, as discussed above. Note that the metadata fields available for TCGA GRCh38 are different from those available for the TCGA dataset available on the CGC.
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a comprehensive and coordinated effort to accelerate understanding of the molecular basis of cancer through the application of robust, quantitative, proteomic technologies and workflows.
The CPTAC analyzes cancer biospecimens from genomics initiatives such as The Cancer Genome Atlas (TCGA) by mass spectrometry to characterize and quantify their constituent proteins or “proteome”. These mass spectrometry data are present in four different file formats including raw mass spectrometry spectra in vendor-specific file formats and processed peptide spectrum match (PSM) data.
The Cancer Imaging Archive (TCIA) contains radiological imaging data generated as part of The Cancer Genome Atlas (TCGA) with the aim of connecting cancer phenotypes to genotypes by providing matched clinical imaging and genomic analysis data.
TCIA includes Open Access radiological images that represent 21 types of cancer detailed in TCGA. These images are stored in a standard DICOM format.
The Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of immortalized human cancer cell lines. The CCLE is the result of a collaboration between the Broad Institute, the Novartis Institutes for Biomedical Research, and the Genomics Institute of the Novartis Research Foundation.
CCLE is a referred to as a "legacy" dataset on the CGC in accordance with the GDC labeling convention for datasets not aligned to GRCh38. It contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples that was obtained from CGHub on May 11, 2016.
The Simons Genome Diversity Project (SGDP) dataset contains complete genome sequences from more than one hundred diverse human populations. It is the largest dataset of diverse, high quality human genome sequences ever reported. To represent as much anthropological, linguistic, and cultural diversity as possible, the dataset includes many deeply divergent human populations that are not well-represented in other datasets.
SGDP is available on the CGC as a read-only public project that contains Open Access whole genome sequencing data for 279 samples. Note that SGDP data is available for use in your analyses but is not currently accessible via the Data Browser.
Seven Bridges is committed to providing users with up-to-date versions of the datasets that are available from the NCI Genomic Data Commons (GDC). Therefore, we have a clearly formulated set of rules that apply to updates of GDC datasets that are available through the Platform:
- We aim to update the data on the Platform within 30 day of release by the GDC.
- The time frame for alignment of datasets available through the Platform with the current GDC data release is within 30 days of the release by GDC.
- If a GDC data release includes redaction of files from a dataset, the affected files will be available on the Platform for an additional 30 days. After that, you will need to contact the GDC for information on how to retain access to redacted files.
- Re-running queries executed in the past may return slightly different results due to updates in the datasets from the GDC. This is expected as datasets are dynamic and version updates can introduce file updates or redactions, and queries will return the most up to date version of files. This applies both to the queries made through the Data Browser and through the Datasets API.
Updated 6 months ago