The Seven Bridges Knowledge Center

The Seven Bridges Platform is a simple solution for doing bioinformatics at industrial scale. But sometimes, everyone needs a little help.

Get Started    What's new?

About datasets

Overview

The Platform hosts large genomics datasets along with the tools to query, filter, and browse them. The resulting data can be added to your projects and analyzed with your private data to address your research questions. Below, learn more about datasets on the Platform and access resources describing each dataset's data and metadata.

Note that these datasets are not available if you are using AWS EU or GCP as your cloud provider.

Dataset access depends on your cloud infrastructure provider (not available for AWS EU or GCP)

To browse and access datasets, you should know the cloud infrastructure provider on which you run the Seven Bridges Platform: Amazon Web Services in the US (AWS US East); Amazon Web Services in Frankfurt, Germany (AWS EU) or the Google Cloud Platform (GCP).

If you didn't chose a cloud provider when you signed up for the Platform, you are using AWS US East. If you signed up from early 2016, you had the option to select between AWS US East and GCP.

If you signed up from early 2017, you had the additional option to select AWS EU as your cloud provider.

The Cancer Genome Atlas (TCGA)

TCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the National Cancer Institute (NCI), National Human Genome Research Institute (NHGRI), the National Institutes of Health (NIH), and the U.S. Department of Health and Human Services. The Platform provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data.

TCGA data is made available on the Seven Bridges Platform through an integration with the Seven Bridges Cancer Genomics Cloud (CGC). TCGA on the Platform includes both Open and Controlled Data. While all data in TCGA is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on the Platform after you are authenticated and agree to data use policies. In addition, you can obtain access to Controlled Data through the NIH via the Database of Genotypes and Phenotypes (dBGaP) site.

TCGA Resources

Cancer Cell Line Encyclopedia (CCLE)

The Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of human cancer cell lines. Cell lines are permanently established cell cultures derived from patients that will proliferate indefinitely given appropriate fresh medium and space. The CCLE is the result of a collaboration between the Broad Institute, the Novartis Institutes for Biomedical Research, and the Genomics Institute of the Novartis Research Foundation.

CCLE contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples. The Platform hosts the CCLE dataset in the form of a read-only public project which contains cell line samples as available from cgHub on May 11, 2016. You have automatic access to all CCLE data on the Platform.

CCLE Resources

Simons Genome Diversity Project (SGDP) dataset

The Simons Genome Diversity Project (SGDP) dataset contains complete genome sequences from more than one hundred diverse human populations. It is the largest dataset of diverse, high quality human genome sequences ever reported. To represent as much anthropological, linguistic, and cultural diversity as possible, the dataset includes many deeply divergent human populations that are not well-represented in other datasets.

SGDP is available on the CGC as a read-only public project which contains Open Access whole genome sequencing data for 279 samples. You have automatic access to all SGDP data on the CGC. Note that SGDP data is available for use in your analyses. However, it is not currently accessible via the Data Browser.

Resources

Get started

  1. Refine your results with a query issued on the visual interface or programmatically.
  2. Access data for further analysis in your Platform project.

About datasets