Heidelberg 2014 training deCODE troubleshooting

deCODE’s Clinical Sequence Analyzer
Neuromics workshop - Heidelberg
Sigurjón Axel Guðjónsson
Bioinformatics group
February 2014
How does it work
•
deCODE issues sample aliases for each PI to use for their samples
•
DNA samples are labeled with the deCODE aliases and shipped to Iceland
•
Samples are prepared and sequenced - (2-4) weeks
•
Sequencing data is processed through Secondary analysis pipeline
–
Create fastq files
–
BWA aligner -> bam files
–
Variant calling, using GATK (Unified genotyper)
–
Coverage calculation
●
Full coverage - all bases in the genome
●
Gene/Transcript coverage
–
Variant annotation
●
VEP
●
Frequencies
●
Other
•
deCODE employees create a project within the Clinical Sequence Analyzer
•
deCODE employee creates user accounts for the PI and his researchers using a list of email
addresses
SM and CSA important components
To mention few important components in the Web access (CSA) and more powerful java client
(SM)
Both use following :
●
dcref – reference data (ref)
●
damp – source data (anno,bam,cov,var)
●
gor_aliases (button in “Data Query”)
●
deCODE's Genome Browser (bam, ucsc, gor)
●
“Data Query” (gorpipe,drill in, transpose,perspectives)
●
“Report Builder” (mendel, recessive, LOH regions,varstat,denovofilter)
●
Local comments knowledgebase
Damp pipeline
Damp pipeline
Damp pipeline
dcref Key features
•
dcref is reference data repository in gor format, it's purpose is to help identify clinically
important variants in exome and genome sequence data.
•
dcref uses same release of ensembl database as dcvep, a wrapper around VEP (variant
effect predictor, from Ensembl), currently release 73.
•
Central important gor files include clinical_variants.gorz and clinical_genes.gorz, Those 2 are
integration of several clinical variant databases. Currently OMIM,Clinvar and HGMD.
•
Uses rp for pbs job dependency submission, a utility to submit and manage jobs on cluster
(dcref is currently running in about 10 hours on the cluster, after all reference data is
downloaded)
(currently DCREF_1-1-0)
dcref future additions and fixes
●
●
●
●
●
●
●
●
●
More expression datasets, and integrated tissue expression map
cnv public tracks
improve all.info (column info) used as metadata
add weblink document
better statistics
sanity checks
detailed map from biosystems
standardize better column names
coding exons maps for ensegens and entrezgenes from ensembl database
dcvep Key features
•
dcvep is bash scripts that runs “Variant Effect Predictor” (VEP) on gor-varjoin/vcf format files
(reformats to Ensembl native format)
•
VEP only works for hg19 variants, Enembl moves to hg20 in 6 months.
•
dcvep uses cache files from Ensembl 73 database. Also adds additional information not
available in VEP (transcript biotype, protein_domain). Keeps track of id (i.e. if annotating
hg18 variants)
•
Uses rp for pbs job dependency submission. Will Annotate ~40 million variants in 800 jobs
(refgene and ensgene) in ~1 hour on free pbs cluster (~40 vars/sec for each job).
Gor aliases - gorpipe
Central access to important annotations
•
•
•
Text file with short aliases for important gorfiles in project filesystem
Access from SM “Query Tool”.
gor #egenes# | where gene_symbol = 'MYC'
GOR – varjoin
just one example of a handy function
•
Variations can be presented differently between two data sources
•
Important to have the ability to join with other genomic information without worrying about this
•
Normalized view
–
Present the variation with the least information possible
–
If more than one position possible always choose the one with the lowest position
GOR - varmerge
Genome browser
Integrates Decode and public annotations
ucsc_tracks: /katla/groups/bioinfo/users/sigurjon/clinseq/oa_hand/trackfile ;
session1: assembly-hg18,server-genome.ucsc.edu,hgsid-349215861;
session2: assemby-hg19,server-genome-test.cse.ucsc.edu,hgsid-5068388,offset-227395
Clinical Sequence Analyzer - Phenotips
Study setup
Phenotypes
Clinical Sequence Analyzer - Phenotips
Study setup
Phenotypes
using
Phenotips
Clinical Sequence Analyzer - Phenotips
Study setup
Phenotypes
using
Phenotips
Suggest Diagnosis
Clinical Sequence Analyzer - Phenotips
Study setup
Candidate Gene list
Clinical Sequence Analyzer - Phenotips
Study setup
Candidate Gene list
Clinical Sequence Analyzer - Phenotips
Study setup
Candidate Gene list
SM - Report Builder - Mendel
SM - Report Builder - denovo filter
Analysis
De Novo
Advanced filter
Father and Mother exact coverage
from Bam files
SM - Report Builder - denovo filter
Analysis
De Novo
Advanced filter
Father and Mother exact coverage
from Bam files
Sequence Miner - Query Tool
Analysis
De Novo
Advanced filter
Father and Mother exact coverage
from Bam files
Add hard filters on variant observed in
reads
Sequence Miner - Query Tool - denovo filter
Analysis
De Novo
Advanced filter
Father and Mother exact coverage
from Bam files
Add hard filters on variant observed in
reads
Reduces output from 31 to 8
Sequence Miner - denovo filter - BAM files
Analysis
De Novo
Advanced filter
Father and Mother exact coverage
from Bam files
Add hard filters on variant observed in
reads
GBJ2 gene with convincing deNovo
variant and evidence suggesting it to
be the causative mutation
Erosive Dermatose
De novo mutation
–
Heterozygote in offspring
–
Absent in both parents, and absent/very rare in all SNP/indel datasets
●
–
–
–
9 potential variants : 8 have no disease link or low quality due to difficult region
GJB2 p.29_29del
●
Depth 36 (including 21 alternative allele)
●
Non-frameshift deletion (3bp), high quality genotype
●
Highly conserved position
●
Amino acid deletion in middle of first transmembrane helix
GJB2 alias connexin 26 is previously reported with
●
Dominant mutations : missense reported in skin disorders
●
Also , recessive : loss of function reported in deafness
Best hypothesis for causative
Comments in local knowledgebase
•
Annotate variants both in CSA and SM
•
Can view them in CSA as known variants or as annotations
•
Can view them in SM as genome browser track or as variant drill-in reports
•
SM report builder provides a special app
–
Comments on variant or gene level
–
With carrier phenotypes (requires the subject report Phenotypes.rep to be open)
Comments in local knowledgebase
Comments in local knowledgebase
Comments in local knowledgebase
LOHZ regions – with recessive variants
LOHZ regions – with recessive variants
Recessive analysis with multiple cases
Report all variants in the cases, not just those observed in a index case
Recessive analysis with multiple cases
Filter on a colum in the Participants.rep
Recessive analysis with multiple cases
Create list with cases and controls
Recessive analysis with multiple cases
fill out dialog
Recessive analysis
•
Separate frequency filter for alleles and compound alleles CHZ
•
CHZ found in controls eliminated (unless CtrlDelta >0)
•
Each CHZ is ranked according 1/score where the score is calculated by the 2x2 ACMG Kingsmore
scheme
•
Each case adds and ctrls subtracts
•
Sum over all possible CHZ in a gene
-------------------------------------------------------| Allele1/Allele2 | Cat1 | Cat2 | Cat3a | Cat3b | Cat4 |
-------------------------------------------------------| Cat1 | 1 | 1 | 2 | 3 | 5 |
-------------------------------------------------------| Cat2 | 1 | 1 | 2 | 3 | 5 |
-------------------------------------------------------| Cat3a | 2 | 2 | 3 | 3 | 6 |
-------------------------------------------------------| Cat3b | 3 | 3 | 3 | 4 | 6 |
-------------------------------------------------------| Cat4 | 5 | 5 | 6 | 6 | 7 |
Recessive analysis
Recessive analysis
Recessive analysis
Recessive analysis
Recessive analysis
Recessive analysis
Data Query tool (gorpipe)
tissue specific expression and regulation - examples
Example1:
Here we use gorpipe to query dcref (encode) And Decode association results from Freeze 120521 (kidney cancer, assocsummary)
We want to locate regulatory regions having impact in kidney cancer. still have to add footprint data to dcref
(see http://massgenomics.org/2012/09/encode-regulatory-variation-in-the-human-genome.html ),
to narrow done the real regulatory regions.
Will narrow down regulatory regions to
Binding sites of tfacs.
pgor #topCCvar# -s Phenotype -f Kidney_Cancer_ICR_CC_type_12102011 | join -r -snpseg #encodednasecell# |split lis_signalvalue,lis_cell | rank 1 lis_signalvalue o desc | select 1,2,name,nr_of_cells,lis_cell,lis_signalvalue,rank_lis_Signalvalue,description,pval | where (Contains(Description,'renal')) | where lis_signalValue > 200
and rank_lis_signalvalue < 10 and pval<1e-4
Data Query tool (gorpipe)
tissue specific expression and regulation - examples
Example2:
Lets look for important regulatory regions in Myocardial_infarction.
1. get top associations 2. pickup NHGRI gwas catalog 3. filter in pval< 1e-7 4. add protein atlas expression data
5. restrict genes to those expressed in heart_muscle and no more than 150kb distance 6. hide columns not needed
7. add encode dnase open chromatin data 8. sum signals in encode dnase data
gor TOP_GWAS_CC(Myocardial_Infarction_All_13102009) | join -snpsnp -l #gwas# | where pval < 1e-7 | join -l -snpseg <(#protatlas_normal# | where Tissue in ('heart_muscle') and (Level in
('Moderate','Strong'))) -f 150000 | hide name,phenotype,stable_id,description,biotype,gene_start,gene_end,neglogp,disrank,pubmedid | where Tissue in ('heart_muscle','')
| join -snpseg -r <(#encodednasecell# | where lis_signalvalue>50 and Tissue in ('blood vessel','heart') | hide
description,sex,documents,vendor_id,term_id,label,nr_of_cells,cells,lis_cell,tier,tissue) -l | group 1 -gc 3- -sum -fc lis_signalvalue -count | where sum_lis_signalvalue>0
Query track in Genome Browser
Use a track to view variation report
Define variation track with cases and controls
Query track in Genome Browser
Use a track to view variation report
Define variation track with cases and controls
Query track in Genome Browser
Use a track to view variation report
Define variation track with cases and controls
Query track in Genome Browser
Use a track to view variation report
Define variation track with cases and controls
Query track in Genome Browser
Use a track to view variation report
Define variation track with cases and controls
Neuromics – variant statistics
51
Neuromics – using the variant statistics
52
Neuromics – variant statistics
53
Neuromics – variant statistics
54
Neuromics – variant statistics
25% reduction in false positives
55