deCODE’s Clinical Sequence Analyzer Neuromics workshop - Heidelberg Sigurjón Axel Guðjónsson Bioinformatics group February 2014 How does it work • deCODE issues sample aliases for each PI to use for their samples • DNA samples are labeled with the deCODE aliases and shipped to Iceland • Samples are prepared and sequenced - (2-4) weeks • Sequencing data is processed through Secondary analysis pipeline – Create fastq files – BWA aligner -> bam files – Variant calling, using GATK (Unified genotyper) – Coverage calculation ● Full coverage - all bases in the genome ● Gene/Transcript coverage – Variant annotation ● VEP ● Frequencies ● Other • deCODE employees create a project within the Clinical Sequence Analyzer • deCODE employee creates user accounts for the PI and his researchers using a list of email addresses SM and CSA important components To mention few important components in the Web access (CSA) and more powerful java client (SM) Both use following : ● dcref – reference data (ref) ● damp – source data (anno,bam,cov,var) ● gor_aliases (button in “Data Query”) ● deCODE's Genome Browser (bam, ucsc, gor) ● “Data Query” (gorpipe,drill in, transpose,perspectives) ● “Report Builder” (mendel, recessive, LOH regions,varstat,denovofilter) ● Local comments knowledgebase Damp pipeline Damp pipeline Damp pipeline dcref Key features • dcref is reference data repository in gor format, it's purpose is to help identify clinically important variants in exome and genome sequence data. • dcref uses same release of ensembl database as dcvep, a wrapper around VEP (variant effect predictor, from Ensembl), currently release 73. • Central important gor files include clinical_variants.gorz and clinical_genes.gorz, Those 2 are integration of several clinical variant databases. Currently OMIM,Clinvar and HGMD. • Uses rp for pbs job dependency submission, a utility to submit and manage jobs on cluster (dcref is currently running in about 10 hours on the cluster, after all reference data is downloaded) (currently DCREF_1-1-0) dcref future additions and fixes ● ● ● ● ● ● ● ● ● More expression datasets, and integrated tissue expression map cnv public tracks improve all.info (column info) used as metadata add weblink document better statistics sanity checks detailed map from biosystems standardize better column names coding exons maps for ensegens and entrezgenes from ensembl database dcvep Key features • dcvep is bash scripts that runs “Variant Effect Predictor” (VEP) on gor-varjoin/vcf format files (reformats to Ensembl native format) • VEP only works for hg19 variants, Enembl moves to hg20 in 6 months. • dcvep uses cache files from Ensembl 73 database. Also adds additional information not available in VEP (transcript biotype, protein_domain). Keeps track of id (i.e. if annotating hg18 variants) • Uses rp for pbs job dependency submission. Will Annotate ~40 million variants in 800 jobs (refgene and ensgene) in ~1 hour on free pbs cluster (~40 vars/sec for each job). Gor aliases - gorpipe Central access to important annotations • • • Text file with short aliases for important gorfiles in project filesystem Access from SM “Query Tool”. gor #egenes# | where gene_symbol = 'MYC' GOR – varjoin just one example of a handy function • Variations can be presented differently between two data sources • Important to have the ability to join with other genomic information without worrying about this • Normalized view – Present the variation with the least information possible – If more than one position possible always choose the one with the lowest position GOR - varmerge Genome browser Integrates Decode and public annotations ucsc_tracks: /katla/groups/bioinfo/users/sigurjon/clinseq/oa_hand/trackfile ; session1: assembly-hg18,server-genome.ucsc.edu,hgsid-349215861; session2: assemby-hg19,server-genome-test.cse.ucsc.edu,hgsid-5068388,offset-227395 Clinical Sequence Analyzer - Phenotips Study setup Phenotypes Clinical Sequence Analyzer - Phenotips Study setup Phenotypes using Phenotips Clinical Sequence Analyzer - Phenotips Study setup Phenotypes using Phenotips Suggest Diagnosis Clinical Sequence Analyzer - Phenotips Study setup Candidate Gene list Clinical Sequence Analyzer - Phenotips Study setup Candidate Gene list Clinical Sequence Analyzer - Phenotips Study setup Candidate Gene list SM - Report Builder - Mendel SM - Report Builder - denovo filter Analysis De Novo Advanced filter Father and Mother exact coverage from Bam files SM - Report Builder - denovo filter Analysis De Novo Advanced filter Father and Mother exact coverage from Bam files Sequence Miner - Query Tool Analysis De Novo Advanced filter Father and Mother exact coverage from Bam files Add hard filters on variant observed in reads Sequence Miner - Query Tool - denovo filter Analysis De Novo Advanced filter Father and Mother exact coverage from Bam files Add hard filters on variant observed in reads Reduces output from 31 to 8 Sequence Miner - denovo filter - BAM files Analysis De Novo Advanced filter Father and Mother exact coverage from Bam files Add hard filters on variant observed in reads GBJ2 gene with convincing deNovo variant and evidence suggesting it to be the causative mutation Erosive Dermatose De novo mutation – Heterozygote in offspring – Absent in both parents, and absent/very rare in all SNP/indel datasets ● – – – 9 potential variants : 8 have no disease link or low quality due to difficult region GJB2 p.29_29del ● Depth 36 (including 21 alternative allele) ● Non-frameshift deletion (3bp), high quality genotype ● Highly conserved position ● Amino acid deletion in middle of first transmembrane helix GJB2 alias connexin 26 is previously reported with ● Dominant mutations : missense reported in skin disorders ● Also , recessive : loss of function reported in deafness Best hypothesis for causative Comments in local knowledgebase • Annotate variants both in CSA and SM • Can view them in CSA as known variants or as annotations • Can view them in SM as genome browser track or as variant drill-in reports • SM report builder provides a special app – Comments on variant or gene level – With carrier phenotypes (requires the subject report Phenotypes.rep to be open) Comments in local knowledgebase Comments in local knowledgebase Comments in local knowledgebase LOHZ regions – with recessive variants LOHZ regions – with recessive variants Recessive analysis with multiple cases Report all variants in the cases, not just those observed in a index case Recessive analysis with multiple cases Filter on a colum in the Participants.rep Recessive analysis with multiple cases Create list with cases and controls Recessive analysis with multiple cases fill out dialog Recessive analysis • Separate frequency filter for alleles and compound alleles CHZ • CHZ found in controls eliminated (unless CtrlDelta >0) • Each CHZ is ranked according 1/score where the score is calculated by the 2x2 ACMG Kingsmore scheme • Each case adds and ctrls subtracts • Sum over all possible CHZ in a gene -------------------------------------------------------| Allele1/Allele2 | Cat1 | Cat2 | Cat3a | Cat3b | Cat4 | -------------------------------------------------------| Cat1 | 1 | 1 | 2 | 3 | 5 | -------------------------------------------------------| Cat2 | 1 | 1 | 2 | 3 | 5 | -------------------------------------------------------| Cat3a | 2 | 2 | 3 | 3 | 6 | -------------------------------------------------------| Cat3b | 3 | 3 | 3 | 4 | 6 | -------------------------------------------------------| Cat4 | 5 | 5 | 6 | 6 | 7 | Recessive analysis Recessive analysis Recessive analysis Recessive analysis Recessive analysis Recessive analysis Data Query tool (gorpipe) tissue specific expression and regulation - examples Example1: Here we use gorpipe to query dcref (encode) And Decode association results from Freeze 120521 (kidney cancer, assocsummary) We want to locate regulatory regions having impact in kidney cancer. still have to add footprint data to dcref (see http://massgenomics.org/2012/09/encode-regulatory-variation-in-the-human-genome.html ), to narrow done the real regulatory regions. Will narrow down regulatory regions to Binding sites of tfacs. pgor #topCCvar# -s Phenotype -f Kidney_Cancer_ICR_CC_type_12102011 | join -r -snpseg #encodednasecell# |split lis_signalvalue,lis_cell | rank 1 lis_signalvalue o desc | select 1,2,name,nr_of_cells,lis_cell,lis_signalvalue,rank_lis_Signalvalue,description,pval | where (Contains(Description,'renal')) | where lis_signalValue > 200 and rank_lis_signalvalue < 10 and pval<1e-4 Data Query tool (gorpipe) tissue specific expression and regulation - examples Example2: Lets look for important regulatory regions in Myocardial_infarction. 1. get top associations 2. pickup NHGRI gwas catalog 3. filter in pval< 1e-7 4. add protein atlas expression data 5. restrict genes to those expressed in heart_muscle and no more than 150kb distance 6. hide columns not needed 7. add encode dnase open chromatin data 8. sum signals in encode dnase data gor TOP_GWAS_CC(Myocardial_Infarction_All_13102009) | join -snpsnp -l #gwas# | where pval < 1e-7 | join -l -snpseg <(#protatlas_normal# | where Tissue in ('heart_muscle') and (Level in ('Moderate','Strong'))) -f 150000 | hide name,phenotype,stable_id,description,biotype,gene_start,gene_end,neglogp,disrank,pubmedid | where Tissue in ('heart_muscle','') | join -snpseg -r <(#encodednasecell# | where lis_signalvalue>50 and Tissue in ('blood vessel','heart') | hide description,sex,documents,vendor_id,term_id,label,nr_of_cells,cells,lis_cell,tier,tissue) -l | group 1 -gc 3- -sum -fc lis_signalvalue -count | where sum_lis_signalvalue>0 Query track in Genome Browser Use a track to view variation report Define variation track with cases and controls Query track in Genome Browser Use a track to view variation report Define variation track with cases and controls Query track in Genome Browser Use a track to view variation report Define variation track with cases and controls Query track in Genome Browser Use a track to view variation report Define variation track with cases and controls Query track in Genome Browser Use a track to view variation report Define variation track with cases and controls Neuromics – variant statistics 51 Neuromics – using the variant statistics 52 Neuromics – variant statistics 53 Neuromics – variant statistics 54 Neuromics – variant statistics 25% reduction in false positives 55
© Copyright 2025 ExpyDoc