About Graph WGS

Description

The GRAF Germline Variant Detection Workflow enables accurate alignment and variant calling by utilizing a genome graph reference that can address the bias and other limitations inherent in linear genome references.

Seven Bridges has constructed a comprehensive pan-genome graph that incorporates the diverse genetic composition of all populations around the world. By using this Pan-Genome graph, the GRAF Germline Variant Detection Workflow makes graph technology applicable at the whole genome level, enabling highly accurate and fast read alignment and variant calling.

The current version of the workflow is 1.0 and supports both GRCh37 and GRCh38 versions of the Pan-Genome graphs.

Methods and Algorithms

Graph reference

The graph reference augments the linear representation of the human genome (GRCh37 or GRCh38) with additional information on the genetic diversity of various human populations. The pan-genome graphs provided by Seven Bridges contain single nucleotide polymorphisms, insertions, deletions and other structural variations observed with significant frequency in a large number of populations.

Alignment

The GRAF Aligner is a fast and accurate short read aligner capable of aligning sequencing reads to the genome graph reference. It is designed to process single and paired reads from NGS sequencing technologies.

Variant calling

The GRAF Variant Caller is designed to work in tandem with the GRAF Aligner and detect both small variants and structural variants with assistance from the data available on population genome variability.

Filtering

The following hard filtering criteria is applied to the variants detected by the GRAF Variant Caller. Variants satisfying the criteria are marked as false positives but not removed from the VCF file.

  • SNPs: AD_Ratio[1] < 0.20, MBQ[1] < 15, QD < 1, MQRankSum < -8, FS > 50
  • Indels: AD_Ratio[1] < 0.15

Inputs

Reads

NGS sequencing reads. Several formats are supported:

  • A single FASTQ or FASTQ.GZ file with single end (unpaired) reads.
  • A pair of FASTQ or FASTQ.GZ files with paired end reads. The Paired-end metadata field on the input files must be set as 1 and 2 to denote the pairs of reads prior to task execution.
  • A single BAM or CRAM file with either single end or paired end reads. Pair property is determined from flag 0x1 (see SAM specification). When the input file is a CRAM file encoded relative to a reference, the indexed reference file should be provided as CRAM reference input.

Linear reference

A FASTA file representing the linear reference used as the basis for genome graph construction. This file must be indexed, with .fai index available in the same path as the FASTA file. Valid references files (GRCh38.GRAF.Linear_Reference.v1.fa and GRCh37.GRAF.Linear_Reference.v1.fa) and their indices are available in the public reference files.

Graph reference

A VCF.GZ file containing the variants used to construct the genome graph reference. This file must be indexed, with .tbi index available in the same path as the VCF.GZ file. The variants in the file must be represented relative to the FASTA file passed to Linear reference input. Valid pan-genome graphs (GRCh38.GRAF.Pan_Genome_Reference.v1.vcf.gz and GRCh37.GRAF.Pan_Genome_Reference.v1.vcf.gz) and their indices are available in the public reference files. The current pipeline does not accept custom graphs.

Intervals

The target regions for variant calling in BED format. This BED file is also used to parallelize variant calling in multithreaded environment. Suggested files for the whole genome (GRCh38.GRAF.Genome_Intervals.v1.bed and GRCh37.GRAF.Genome_Intervals.v1.bed) are available in the public reference files.

Outputs

Alignments

Read alignments as output from the GRAF Aligner. The alignment file is coordinate-sorted and indexed, with .bai / .crai index as a secondary file with the alignment file.

Variants

A VCF file containing the final list of variants detected by the GRAF Variant Caller. Variants that do not pass the hard filtering criteria are marked with FP in the FILTER column.