Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Anthony Underwood Bioinformatics Unit, Infectious Disease Informatics, Microbiological Services, Public Health England Public Health England • PHE is an executive agency, sponsored by the Department of Health, UK. • We protect and improve the nation's health and wellbeing, and reduce health inequalities • Microbiology Services • • 2 we provide specialist investigation and control of communicable disease outbreaks, chemical incidents, radiation and other environmental hazards we provide the evidence-based science and clinical practice in specialist microbiology in support of the wider public health system and NHS hospitals Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing PHE Reference Microbiology Reference Microbiology carries out a broad spectrum of work relating to prevention of infectious disease. The remit of the centre at Colindale includes: • Infectious disease surveillance, • Providing specialist and reference microbiology and microbial epidemiology, • Research & Development • Coordinating the investigation and cause of national and uncommon outbreaks, • Helping advise government on the risks posed by various infections • Responding to international health alerts. 3 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing PHE Specialist Microbiology Services • The PHE Specialist Microbiology Services consists of 8 specialist clinical laboratories operating across England. These laboratories provide a comprehensive range of clinical diagnostic and public health microbiology tests and services to the NHS and allied healthcare providers sector. • SMS also includes a further five dedicated food, water and environmental (FW&E) testing laboratories who undertake statutory testing for the NHS, local authorities, and other key stakeholders. 4 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Bioinformatics Unit Infectious Disease Informatics • Formed in 2013 • • • 3 staff 2 Linux servers Amongst first public health institutes to see the potential of bioinformatics and fund it • Now • • • 15 staff 512 cores (UGE) 300Tb usable HPC storage 5 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing MS Public Health Functions Questions we often ask of a pathogen isolate: 1. What is it? 2. What characteristics does it have? 3. How does it relate to other isolates? 6 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing 1. What is it? 2. What characteristics does it possess? 1. Identify the infectious agent or exclude particular infections and associated risks. 2. Antibiotic resistant? Presence of toxins? 3. How does it relate to other isolates? • Do cases of an infection have a common source or are they linked? • What is the source and what are the risk factors? e.g: food, school, travel to certain countries • What is the best way of: • treating the affected? • protecting others? • limiting further spread? 8 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Pathogen typing Giving bugs a label • If we discover isolates with the same type we can • Include/exclude individual cases to an outbreak (e.g MRSA in hospitals) • Establish an association between an outbreak of food poisoning and a specific food vehicle (e.g egg mayo sandwich) • Trace the source of contaminants within a manufacturing process (e.g chocolate factory, baby feed) • The type also helps • Determine changes in microbial populations in response to interventions (e.g. vaccination strategies, vaccine escape) • Study variations and trends in the pathogenicity, virulence and antibiotic resistance within a species (e.g new ABr acquisition) 9 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing th 20 Century Microbiology 10 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Bacterial Identification Culture 11 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Bacterial Identification Gram Stain and API Strips 12 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Phenotypic Characterisation of Microbes Serotyping Sensitivity Testing Phage Typing 13 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Gel-based Typing methods Ciprofloxacin-resistant Salmonella Kentucky in Travellers http://wwwnc.cdc.gov/eid/article/12/10/06-0589-f1.htm 14 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing MLST: multi-locus sequence typing % G+C dn/ds Position in GBS genome ( bp) adhP Alcohol dehydrogena se (gbs0054) 498 11 12 (2.4) 43.1 0.13 72286 pheS Phenylalanyl tRNA synthetase 501 5 7 (1.4) 37.1 0.17 912817 atr Amino acid transporter (gbs0538) 501 8 12 (2.4) 36.9 0.14 560085 glnA Glutamine synthetase 498 6 6 (1.2) 35.7 0.12 1868862 sdhA Serine dehydratase (gbs2105) 519 6 13 (2.5) 41.4 0.12 2179923 glcK Glucose kinase (gbs0518) 459 4 7 (1.5) 42.6 0.13 538770 tkt Transketolas e (gbs0268) 480 5 8 (1.7) 38.9 0.42 287111 Locus Sørensen U B S et al. mBio 2010; doi: 10.1128/mBio.00178-10 15 Size of No. (%) of sequence polymorp d No. of hic fragment alleles nucleotid (bp) identified e sites Putative function of gene Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Virus Identification 16 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Typing of Viruses 17 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing st 21 Our vision for Century Microbiology 18 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Whole genome sequencing 19 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Why whole genome sequencing? • Cost – e.g replacement of Salmonella serotyping • Speed – e.g replacement of TB drug resistance testing • Added value – multiple outputs from one test • Extra resolution – increased discriminatory power over traditional technqiues 20 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Salmonella population structure Minimal spanning tree of MLST data for S. enterica subspecies enterica • Each circle corresponds to a sequence type (ST) • eBGs are natural clusters of genetically related isolates • MLST STs correlate with serotypes Achtman et al., 2012 21 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Hypothetical WGS-based workflow for Diagnostics & Reference Microbiology 22 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Achieving the ambition 23 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Pilot studies: Lab protocols Clinical scientist Sequencer 24 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Pilot studies: Bioinformatics process Bioinformatician Blah, blah, blah X,Y,Z A,B,C Blah, blah ….. 25 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Pilot studies: Interpretation ? Department of Health officials, Doctors, Epidemiologists Blah, blah, blah X,Y,Z A,B,C Blah, blah ….. 26 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Moving from pilot studies to routine WGS for public health microbiology 27 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Writing scripts is easy Creating software is hard • In order for WGS to replace current tests the assays require accreditation (ISO15189) • Quality • Reproducibility • Audit trail • Any WGS-based test suitable for public health intervention will need • Speed • Resilience 28 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Presentation title - edit in Header and Footer Quality • Working with laboratory scientists and epidemiologists in a 3-phase approach 1. Generation of a command line workflow based on user-requirements 2. User testing and accustomisation using Galaxy 3. Automation 29 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing How do we generate outputs? • Quality assessment and trimming • Important to be able to provide a quality score for the result as well as the reads • Majority of our workflows use mapping rather than assembly • Derivation of 7-locus MLST from mapping to loci that comprise the schema • Gene profiling for ABr and virulence factors using single-copy housekeeping gene as +ve 30 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Reproducibility • Fastq files are tagged campylobacter-jejuni-complex-typing : 2-0-0 : UID-sample_name-workflow-version.fastq.gz kmerid_pattern: Campylobacter (jejuni|coli)" : • components Workflows are described in a config file -campylobacter-jejuni-complex-typing : 2-0-0 : component_name: "phe/qa_and_trim" kmerid_pattern: Campylobacter (jejuni|coli)" components"1-1" : version: - component_name: "phe/qa_and_trim" version: "1-1" component_name: "phe/kmerid" version: "1-0" "phe/kmerid" component_name: - - version: "1-0" component_name: "phe/mlst_typing" component_name: "phe/mlst_typing" version: "1-1" version: "1-1" - component_name: "phe/gene_finder" version: "1-0" component_name: "phe/gene_finder" component_name: "phe/combine_xml" version: "1-0" version: "1-0" component_name: • The results for"phe/combine_xml" each sample are tagged with same version: "1-0" workflow and version 31 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Auditability • Each sample is tracked throughout the process from sending lab to report output • • • Metrics from lab processing recorded Sequencing quality is logged Each component of the bioinformatics process logs its own progress and success/ failure • Only when all quality thresholds are achieved and all components are completed are results/reports transferred to end-users 32 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Pathogen isolates received Pathogen isolates received Ad Hoc Scripts 96 well plate Automated Reports Bioinformatics Plate and form Workflows: UNIX System submission Bioinformatician Culture Administrator 4 Hrs UGE DDN Lustre-based Nucleic Workflow-specific Kmer Clinical scientist Sequencing Technician UNIX Sys Admin Computing acid High performance components Identification Trimming Liquid handling robots Sequencing Cluster storage extraction • Bioinformatician Sample Dilution G A T C on C Illumina Gene ofSerotype TAutomated GMLST G A Cworkflows Reports • Department Library Preparation profile HiSeq 2500 Health officials G A type ACT GATCC • Metrics=> LIMS Rapid Mode Web-based T G G Aform C CCGAT G A A C T Gene G AMLST TCC selecting CCGAT profile type TGGAC Department workflows G A T for CC of Health GAAC T TGGAC officials samples GAAC T Drug CCGAT Library GATCC preparation and Sequence TGGAC Sample Doctors, GAAC T Preparation: Hrs Deplexing: 4hrs Epidemiologists 2472 sequencing: CHrs C G Clinical A T Sequencing scientistTechnician resistance Doctors, Epidemiologists Consensus sequence CCGAT GATCC TGGAC GAAC T CCGAT Speed and Resilience Infrastructure 34 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Speed and Resilience PHE/Colindale zone Sequencing machines HAproxy & iRODS server With SSL SAN certificate UGE High Performance Computing cluster High Performance Storage DDN EXAScaler / Lustre filesystem PHE/Birmingham zone Sequencing machines Computing and Storage system DDN WOS object storage system iRODS / iRODS / PHE zone other zone? PHE WAN WTSI? HAproxy iRODS server SSL SAN certificate DDN WOS object storage system 35 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Examples of WGS in action Routine samples processed from April to September 2014 Organism Number Processed Salmonella Staphylococcus aureus Streptococcus pyogenes Streptococcus pneumoniae Other bacteria HCV HEV HIV Influenza MeV Other viruses Total 3954 913 1274 959 238 114 3 257 187 47 35 7981 36 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Current and Future Challenges • Lustre FS problems • Apparently random failures to write or writing of incomplete files • Bio-banking for phenotype-genotype studies • Data release • • Timely release of raw data Minimal meta data o o o Date of isolation Source (Human/Environmental/Food) Place (Country?) • Policy makers still wary 37 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Current and Future Challenges • Scalability • • • • • Plan to scale to 3000 samples/week OpenStack for surge compute Currently have 300Tb usable Lustre storage Medium term archive to object store Release data to ENA (SRA) at EBI 16 weeks Lustre Fastqs (CRAM?) Bam files All result/log/error and meta files 6 months Object store Fastqs (with workflow descriptor) Text/pdf result files and reports Meta data for ever… SRA Fastqs (CRAM?) 38 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Acknowledgements • Virtual Pathogen Genomics Unit • Bioinformatics Unit • • • Francesco Giannoccaro Matthew Goulden ,Steven Platt, Rediat Tewolde, Aleksey Jironkin, Ali Al-Shahib, Ulf Schaefer, Kieren Lythgow Jonathan Green • Other Bioinformaticians • • Tim Dallman, Phil Ashton Michel Doumith • Reference and Specialist Microbiology laboratories • Icons made by Freepik from www.flaticon.com 39 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
© Copyright 2024 ExpyDoc