Variant calling

Sample & Assay Technologies
Welcome!
Next generation sequencing for cancer research
Ravi Vijaya Satya, Ph.D.
Contact Technical Support
North America:
• [email protected] or [email protected]
Phone: 1-800-362-7737
International customers:
[email protected]
Webinar related questions: [email protected]
Sample & Assay Technologies
Four part webinar series on NGS
Webinar 1: Next-generation sequencing, an introduction to technology and
applications
Speaker: Quan Peng, Scientist, Genomics Assay Development – QIAGEN
Webinar 2: Next-generation sequencing for cancer research
Speaker: Raed Samara, Global Product Manager for NGS – QIAGEN
Webinar 3: Next-generation sequencing data analysis for genetic profiling
Speaker: Ravi Vijaya Satya, Senior Scientist, Bioinformatics – QIAGEN
Webinar 4: Advancing Biological Analysis and Interpretation in Cancer Studies
Speaker: Nathan Pearson, Principal Genome Scientist – Ingenuity, a QIAGEN company
2
Sample & Assay Technologies
Welcome to the four-part webinar series
Next Generation Sequencing and its role in cancer biology
Webinar 1: Next-generation sequencing, an introduction to technology and applications
Date:
February 4, 2014
Speaker:
Quan Peng, Scientist, Genomics Assay Development – QIAGEN
Webinar 2: Next-generation sequencing for cancer research
Date:
February 11, 2014
Speaker:
Raed Samara, Global Product Manager for NGS – QIAGEN
Webinar 3: Next-generation sequencing data analysis for genetic profiling
Date:
February 18, 2014
Speaker:
Ravi Vijaya Satya, Senior Scientist, Bioinformatics – QIAGEN
Webinar 4: Advancing Biological Analysis and Interpretation in Cancer Studies
Date:
February 25, 2014
Speaker:
Nathan Pearson, Principal Genome Scientist – Ingenuity, a QIAGEN company
Title, Location, Date
3
Sample & Assay Technologies
Legal Disclaimer
QIAGEN products shown here are intended for molecular biology applications.
These products are not intended for the diagnosis, prevention, or treatment of
a disease.
For up-to-date licensing information and product-specific disclaimers, see the
respective QIAGEN kit handbook or user manual. QIAGEN kit handbooks and
user manuals are available at www.QIAGEN.com or can be requested from
QIAGEN Technical Services or your local distributor.
Title, Location, Date
4
Sample & Assay Technologies
Agenda
NGS Data Analysis
Read Mapping
Variant Calling
Variant Annotation
Targeted Enrichment
GeneRead Gene Panels
GeneRead Data Analysis Portal
Workflow
Interface
Data Interpretation
5
Sample & Assay Technologies
Read Mapping
Reads mapped to a reference genome
Millions of reads
from a single run
Alignment
Mapping Quality
Programs for read-mapping
Hash-based: MAQ, ELAND, SOAP, Novoalign
Suffix array/Burrows Wheeler Transform based: BWA, BowTie, BowTie2, SOAP2
Title, Location, Date
6
Sample & Assay Technologies
Variant Calling
Determine if there is enough statistical support to call a variant
Reference sequence
ACAGTTAAGCCTGAACTAGACTAGGATCGTCCTAGATAGTCTCGATAGCTCGATATC
Aligned reads
AACTAGACTAGGATCGTCCTAGATAGTCTCG
AACTAGACTAGGATCGTCCTACATAGTCTCG
AACTAGACTAGGATCGTCCTACATAGTCTCG
GATCGTCCTAGATAGTCTCGATAGCTCGAT
GATCGTCCTAGATAGTCTCGATAGCTCGAT
GATCGTCCTAGATAGTCTCGATAGCTCGAT
Multiple factors are considered in calling variants
No. of reads with the variant
Mapping qualities of the reads
Base qualities at the variant position
Strand bias (variant is seen in only one of the strands)
Variant Calling Software
GATK Unified Genotyper, Torrent Variant Caller, SamTools, Mutect, …
Title, Location, Date
7
Sample & Assay Technologies
Variant Representation
VCF – Variant Call Format
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
Header lines
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been
filtered">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to
detect strand bias">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=OND,Number=1,Type=Float,Description="Overall non-diploid ratio (alleles/(alleles+nonalleles))">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##contig=<ID=chrM,length=16571,assembly=hg19>
##contig=<ID=chr1,length=249250621,assembly=hg19>
##contig=<ID=chr2,length=243199373,assembly=hg19>
Column labels
##contig=<ID=chr3,length=198022430,assembly=hg19>
##contig=<ID=chr4,length=191154276,assembly=hg19>
#CHROM POS
ID
REF ALT
QUAL FILTER
INFO
FORMAT
Sample
chr1
11181327 rs11121691 C
T
100.0 PASS
DP=1000;MQ=87.67 GT:AD:DP 0/1:146,45:191
chr1
11190646 rs2275527
G
A
100.0 PASS
DP=1000;MQ=67.38 GT:AD:DP 0/1:462,121:583
chr1
11205058 rs1057079
C
T
100.0 PASS
DP=1000;MQ=79.57 GT:AD:DP 0/1:49,143:192
Variant calls
Title, Location, Date
8
Variant Annotation
Sample & Assay Technologies
dbSNP/COSMIC ID
Chro
m
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr2
R
ef
11181327 rs11121691 C
11190646 rs2275527 G
11205058 rs1057079 C
11288758 rs1064261 G
11300344 rs191073707 C
11301714 rs1135172 A
11322628 rs2295080 G
186641626 rs2853805 G
186642429 rs2206593 A
186643058
rs5275
A
186645927 rs2066826 C
29415792 rs1728828 G
Pos
ID
Alt
T
A
T
A
T
G
T
A
G
G
T
A
Actual change and position within
the codon or amino acid sequence
Gene Mutation
Name type
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
PTGS2 SNP
PTGS2 SNP
PTGS2 SNP
PTGS2 SNP
ALK
SNP
chr2 29416366
rs1881421 G C
ALK
SNP
chr2 29416481
rs1881420 T C
ALK
SNP
chr2 29416572
rs1670283 T C
ALK
SNP
chr2 29419591
chr2 29445458
chr2 29446184
rs1670284 G T
rs3795850 G T
rs2276550 C G
ALK
ALK
ALK
SNP
SNP
SNP
Effect of the variant
on protein coding
Codon
AA
Filtered
Variant
Allele Frequency
snpEff Effect
Change
Change Coverage
Frequency
c.6909C>T p.L2303
C=0.761 T=0.239
SYNONYMOUS_CODING
1,924
0.239
c.5553G>A p.S1851
G=0.791 A=0.208
SYNONYMOUS_CODING
5,842
0.208
c.4731C>T p.A1577
C=0.254 T=0.746
SYNONYMOUS_CODING
1,928
0.746
c.2997G>A p.N999
G=0.212 A=0.788
SYNONYMOUS_CODING
5,186
0.788
C=0.924 T=0.076
INTRON
210
0.076
c.1437A>G p.D479
A=0.248
G=0.752
SYNONYMOUS_CODING
3,965
0.752
G=0.239 T=0.755
UPSTREAM
339
0.755
G=0.0 A=1.0
UTR_3_PRIME
97
1
A=0.167 G=0.833
UTR_3_PRIME
3,552
0.833
A=0.759 G=0.241
UTR_3_PRIME
237
0.241
C=0.88 T=0.12
INTRON
209
0.12
G=0.0 A=1.0
UTR_3_PRIME
2,520
1
NON_SYNONYMOUS_CODI
c.4587G>C p.D1529E
G=0.907 C=0.093
NG
4,361
0.093
NON_SYNONYMOUS_CODI
c.4472T>C p.K1491R
T=0.954 C=0.045
NG
3,061
0.045
NON_SYNONYMOUS_CODI
c.4381T>C p.I1461V
T=0.0 C=0.999
NG
5,834
0.999
G=0.093 T=0.907
INTRON
739
0.907
c.3375G>T p.G1125
G=0.917 T=0.082
SYNONYMOUS_CODING
1,776
0.082
C=0.895 G=0.105
INTRON
475
0.105
SIFT score
Predicts the deleterious effect of an amino acid change based on how conserved the
sequence is among related species
Polyphen score
Predicts the impact of the variant on protein structure
Title, Location, Date
9
Sample & Assay Technologies
GeneRead DNAseq Gene Panel: Targeted Sequencing
What is targeted sequencing?
Sequencing a sub set of regions in the whole-genome
Why do we need targeted sequencing?
Not all regions in the genome are of interest or relevant to a specific study
Exome Sequencing: sequencing most of the exonic regions of the genome (exome).
Protein-coding regions constitute less than 2% of the entire genome
Focused panel/hot spot sequencing: focused on the genes or regions of interest
What are the advantages of focused panel sequencing?
More coverage per sample, more sensitive mutation detection
More samples per run, lower cost per sample
Title, Location, Date
10
Sample & Assay Technologies
Target Enrichment - Methodology
Multiplex PCR
Small DNA input (< 100ng)
Short processing time
(several hrs)
Relatively small throughput
(KB - MB region)
Sample
preparation
(DNA
isolation)
PCR target
enrichment
(2 hours)
Title, Location, Date
Library
construction
Sequencing
Data analysis
11
Sample & Assay Technologies
Variants Identifiable through Multiplex PCR
SNPs – single nucleotide polymorphisms
Indels
Indels < 20 bp in length
CNV
Variants callable with the help of a reference
Copy number variants (CNVs)
Variants not callable
Structural variants
– Large indels
– Inversions
Large insertion
Inversion
Title, Location, Date
12
Sample & Assay Technologies
GeneRead DNAseq Gene Panel
Multiplex PCR technology based targeted enrichment for DNA sequencing
Cover all human exons (coding region + UTR)
Division of gene primers sets into 4 tubes; up to 1200 plex in each tube
13
Sample & Assay Technologies
GeneRead DNAseq Gene Panel
Focus on your Disease of Interest
Comprehensive Cancer Panel (124 genes)
Disease Focused Gene Panels (20 genes)
Genes Involved in Disease
Breast cancer
Colon Cancer
Gastric cancer
Leukemia
Liver cancer
Lung Cancer
Ovarian Cancer
Prostate Cancer
Genes with High Relevance
14
Sample & Assay Technologies
GeneRead DNAseq Custom Panel
15
Sample & Assay Technologies
Data Analysis for Targeted Sequencing
GeneRead data analysis work flow
Read
Mapping
Primer
Trimming
Variant
Calling
Variant
Annotation
Read mapping
Identify the possible position of the read within the reference
Align the read sequence to reference sequences
Primer trimming
Remove primer sequences from the reads
Variant calling
Identify differences between the reference and reads
Variant filtering and annotation
Functional information about the variant
Title, Location, Date
16
Sample & Assay Technologies
Reads from Targeted Sequencing
Typical NGS raw read from targeted sequencing
Adapter
Barcode
Primer
Insert sequence
Primer
Adapter
-3’
5’Removal of adapters and de-multiplexing
Primer
Insert sequence
Primer
-3’
5’-3’
5’Read length can vary:
only part of the insert
5’or the 3’ primer may
be present
5’-
-3’
-3’
Title, Location, Date
17
Sample & Assay Technologies
Read Mapping
Align reads to the reference genome
Reference sequence
Amplicon 1
Amplicon 2
Aligned reads
Title, Location, Date
18
Primer Trimming
Sample & Assay Technologies
Primer sequences must be trimmed for accurate variant calling
Reference sequence
Amplicon 1
Frequency of `C` without
primer trimming = 4/13 = 31%
C
C
C
C
Aligned reads
Amplicon 2
Title, Location, Date
Frequency of `C` after primer
trimming = 4/7 = 57%
19
Sample & Assay Technologies
GeneRead Variant Calling Overview
Raw reads from the
sequencer (de-multiplexed)
GeneRead
panel ID
Annotation
Analysis mode
snpEff
(basic annotation)
Read mapping and post-processing
dbSNP
MiSeq/HiSeq
Sequencing
platform
BowTie2/BWA
IonTorrent
Cosmic
TMAP
BAM
GATK Indel
Realigner
ClinVar
BAM
GATK Base Quality
Score Recalibrator
BAM
dbNSFP
VCF
BAM
Primer Trimming
BAM
SnpSift (links to dbSNP,
Cosmic and
computation of Sift
scores, etc.)
Primer Trimming
BAQ Computation
Variant calling and filtering
BAM
GATK Unified VCF
Genotyper
BAM
GATK Variant Annotator
Torrent Variant
Caller (TVC)
VCF
VCF
VCF
GATK Variant Filtration
VCF
VCF
Additional filtering (based on
frequency and coverage)
VCF
Variants in
Excel format
Separate interface, free preview available
Title, Location, Date
Variants in
Ingenuity®
Variant
Analysis™
20
Sample & Assay Technologies
Indel Realignment
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat
Genet. 2011 May; 13(5):191-8. PMID: 21178889
Eliminates some false-positive variant calls around indels
Read aligners can not eliminate these alignment errors since they align reads
independently
Multiple sequence alignment can identify these errors and correct them
Title, Location, Date
21
Sample & Assay Technologies
Base Quality Recalibration
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data.
Nat Genet. 2011 May; 13(5):191-8. PMID: 21178889
Eliminates sequencer-specific biases
Lane-specific/sample-specific biases
Instrument-specific under-reporting/over-reporting of quality scores
Systematic errors based on read position
Di-nucleotide-specific sequencing errors
Recalibration leads to improved variant calls
Title, Location, Date
22
Sample & Assay Technologies
Variant Filtration
Variant Frequency
Somatic mode
– SNPs with frequency < 4% and indels with frequency < 20%
Germline mode
– SNPs with frequency < 20% and indels with frequency < 25%
Strand Bias
SNPs with FS ≤ 60
Indels with FS ≤ 200
Mapping Quality
SNPs with MQ ≤ 40.0
Haplotype Score
SNPs with HaplotypeScore ≤ 13.0
Not applicable for pooled samples
Title, Location, Date
C
C
C
Strand Bias: variants that are
present in reads from only one of
the two strands
23
Sample & Assay Technologies
Haplotype Score
Helps remove false positives due to errors in sequencing or
alignment
Title, Location, Date
24
Sample & Assay Technologies
Specificity Analysis
Specificity: the percentage of sequences that map to the intended targets
region of interest
number of on-target reads / total number of reads
Reference
sequence
ROI 1
ROI 2
NGS
reads
Off-target reads
On-target reads
Title, Location, Date
On-target
reads
25
Sample & Assay Technologies
NGS Data Analysis: Coverage Depth and Uniformity
Coverage depth (or depth of coverage): how many times each base has been
sequenced
Coverage uniformity: evenness of the coverage depth along the target region
Reference
sequence
NGS
reads
coverage depth = 10
coverage depth = 3
coverage depth = 2
Title, Location, Date
26
Sample & Assay Technologies
GeneRead Data Analysis Web Portal
FREE Complete & Easy to use Data Analysis with Web-based Software
•
•
.bam file (Ion Torrent)
.fastq or .fastq.gz
(MiSeq/HiSeq)
27
Sample & Assay Technologies
GeneRead Data Analysis Web Portal
Title, Location, Date
28
Sample & Assay Technologies
Job Submission
Title, Location, Date
29
Sample & Assay Technologies
Retrieving Results
Title, Location, Date
30
Sample & Assay Technologies
Results
Title, Location, Date
31
Sample & Assay Technologies
Summary
Run Summary
Specificity
Coverage
Uniformity
Numbers of SNPs and Indels
Summary By Gene
Specificity
Coverage
Uniformity
# of SNPs and Indels
32
Sample & Assay Technologies
Features of Variant Report
SNP detection
Indel detection
33
Sample & Assay Technologies
Biological Interpretation Using IVA
VCF to Ingenuity Variant Analysis
Built-in optimized support for uploading called variant files in VCF format
Create analysis by answering a few questions about study
Study Type
Repeat
workflows
Title, Location, Date
34
Sample & Assay Technologies
Features of Ingenuity Variant Analysis
Annotate samples with Ingenuity Knowledge Base
Compare what’s genetically distinctive
Identify and share the most promising variants
Interactive
filter
cascade
Share /
Publish
Ingenuity
Annotation
Title, Location, Date
35
Sample & Assay Technologies
QIAGEN’s GeneRead DNAseq Gene Panel System
FOCUS ON YOUR RELEVANT GENES
Focused:
Biologically relevant content
selection enables deep sequencing
on relevant genes and identification
of rare mutations
Flexible:
Mix and match any gene of interest
NGS platform independent:
Functionally validated for PGM,
MiSeq/HiSeq
Integrated controls:
Enabling quality control of prepared
library before sequencing
Free, complete and easy to use data
analysis tool
Sample & Assay Technologies
Thank you for attending
Are you ready to try?
Contact Technical Support
North America:
•
•
[email protected]
[email protected]
Phone: 1-800-362-7737
International customers:
[email protected]
Webinar related questions: [email protected]
37