Provisioning robust automated analytical pipelines for whole

Provisioning robust automated
analytical pipelines for whole
genome-based public health
microbiological typing
Anthony Underwood
Bioinformatics Unit, Infectious Disease Informatics, Microbiological
Services, Public Health England
Public Health England
•  PHE is an executive agency, sponsored by the
Department of Health, UK.
•  We protect and improve the nation's health and
wellbeing, and reduce health inequalities
•  Microbiology Services
• 
• 
2
we provide specialist investigation and control of
communicable disease outbreaks, chemical
incidents, radiation and other environmental
hazards
we provide the evidence-based science and clinical
practice in specialist microbiology in support of
the wider public health system and NHS hospitals
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
PHE Reference Microbiology
Reference Microbiology carries
out a broad spectrum of work
relating to prevention of infectious
disease.
The remit of the centre at Colindale includes:
•  Infectious disease surveillance,
•  Providing specialist and reference microbiology and microbial
epidemiology,
•  Research & Development
•  Coordinating the investigation and cause of national and uncommon
outbreaks,
•  Helping advise government on the risks posed by various infections
•  Responding to international health alerts.
3
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
PHE Specialist Microbiology Services
•  The PHE Specialist Microbiology Services
consists of 8 specialist clinical laboratories
operating across England. These
laboratories provide a comprehensive range
of clinical diagnostic and public health
microbiology tests and services to the NHS
and allied healthcare providers sector.
•  SMS also includes a further five dedicated
food, water and environmental (FW&E)
testing laboratories who undertake statutory
testing for the NHS, local authorities, and
other key stakeholders.
4
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Bioinformatics Unit
Infectious Disease Informatics
•  Formed in 2013
• 
• 
• 
3 staff
2 Linux servers
Amongst first public health institutes to see
the potential of bioinformatics and fund it
•  Now
• 
• 
• 
15 staff
512 cores (UGE)
300Tb usable HPC storage
5 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
MS Public Health Functions
Questions we often ask of a
pathogen isolate:
1. What is it?
2. What characteristics does it have?
3. How does it relate to other isolates?
6 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
1. What is it?
2. What characteristics does it possess?
1.  Identify the
infectious agent or
exclude particular
infections and
associated risks.
2.  Antibiotic resistant?
Presence of toxins?
3. How does it relate to other isolates?
•  Do cases of an infection have a common
source or are they linked?
•  What is the source and what are the risk
factors? e.g: food, school, travel to certain
countries
•  What is the best way of:
•  treating the affected?
•  protecting others?
•  limiting further spread?
8
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pathogen typing
Giving bugs a label
•  If we discover isolates with the same type we can
•  Include/exclude individual cases to an outbreak (e.g MRSA in
hospitals)
•  Establish an association between an outbreak of food poisoning
and a specific food vehicle (e.g egg mayo sandwich)
•  Trace the source of contaminants within a manufacturing
process (e.g chocolate factory, baby feed)
•  The type also helps
•  Determine changes in microbial populations in response to
interventions (e.g. vaccination strategies, vaccine escape)
•  Study variations and trends in the pathogenicity, virulence and
antibiotic resistance within a species (e.g new ABr acquisition)
9
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
th
20
Century
Microbiology
10 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Bacterial Identification
Culture
11
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Bacterial Identification
Gram Stain and API Strips
12
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Phenotypic Characterisation of
Microbes
Serotyping
Sensitivity Testing
Phage Typing
13
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Gel-based Typing methods
Ciprofloxacin-resistant Salmonella Kentucky in Travellers
http://wwwnc.cdc.gov/eid/article/12/10/06-0589-f1.htm
14
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
MLST: multi-locus sequence typing
% G+C
dn/ds
Position
in GBS
genome (
bp)
adhP
Alcohol
dehydrogena
se (gbs0054)
498
11
12 (2.4)
43.1
0.13
72286
pheS
Phenylalanyl
tRNA
synthetase
501
5
7 (1.4)
37.1
0.17
912817
atr
Amino acid
transporter
(gbs0538)
501
8
12 (2.4)
36.9
0.14
560085
glnA
Glutamine
synthetase
498
6
6 (1.2)
35.7
0.12
1868862
sdhA
Serine
dehydratase
(gbs2105)
519
6
13 (2.5)
41.4
0.12
2179923
glcK
Glucose
kinase
(gbs0518)
459
4
7 (1.5)
42.6
0.13
538770
tkt
Transketolas
e (gbs0268)
480
5
8 (1.7)
38.9
0.42
287111
Locus
Sørensen U B S et al. mBio 2010; doi:
10.1128/mBio.00178-10
15
Size of
No. (%) of
sequence
polymorp
d
No. of
hic
fragment
alleles
nucleotid
(bp)
identified
e sites
Putative
function
of gene
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Virus Identification
16
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Typing of Viruses
17
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
st
21
Our vision for
Century Microbiology
18 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Whole genome
sequencing
19 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Why whole genome sequencing?
•  Cost – e.g replacement of Salmonella
serotyping
•  Speed – e.g replacement of TB drug
resistance testing
•  Added value – multiple outputs from one
test
•  Extra resolution – increased
discriminatory power over traditional
technqiues
20 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Salmonella population structure
Minimal spanning tree of MLST
data for S. enterica subspecies
enterica
•  Each circle corresponds to a
sequence type (ST)
•  eBGs are natural clusters of
genetically related isolates
•  MLST STs correlate with
serotypes
Achtman et al., 2012
21 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Hypothetical WGS-based workflow for
Diagnostics & Reference Microbiology
22
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Achieving the ambition
23 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pilot studies: Lab protocols
Clinical scientist
Sequencer
24 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pilot studies: Bioinformatics process
Bioinformatician
Blah, blah,
blah
X,Y,Z
A,B,C
Blah, blah
…..
25 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pilot studies: Interpretation
?
Department of
Health officials,
Doctors,
Epidemiologists
Blah, blah,
blah
X,Y,Z
A,B,C
Blah, blah
…..
26 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Moving from pilot
studies to routine WGS
for public health
microbiology
27 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Writing scripts is easy
Creating software is hard
•  In order for WGS to replace current tests
the assays require accreditation
(ISO15189)
•  Quality
•  Reproducibility
•  Audit trail
•  Any WGS-based test suitable for public
health intervention will need
•  Speed
•  Resilience
28 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Presentation title - edit in Header and Footer
Quality
•  Working with laboratory scientists and
epidemiologists in a 3-phase approach
1.  Generation of a command line workflow
based on user-requirements
2.  User testing and accustomisation using
Galaxy
3.  Automation
29 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
How do we generate outputs?
•  Quality assessment and trimming
•  Important to be able to provide a quality
score for the result as well as the reads
• 
Majority of our workflows use mapping rather
than assembly
•  Derivation of 7-locus MLST from mapping to
loci that comprise the schema
•  Gene profiling for ABr and virulence factors
using single-copy housekeeping gene as +ve
30 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Reproducibility
• 
Fastq files are tagged
campylobacter-jejuni-complex-typing :
2-0-0
:
UID-sample_name-workflow-version.fastq.gz
kmerid_pattern: Campylobacter (jejuni|coli)"
:
•  components
Workflows
are described in a config file
-campylobacter-jejuni-complex-typing :
2-0-0 :
component_name:
"phe/qa_and_trim"
kmerid_pattern: Campylobacter (jejuni|coli)"
components"1-1"
:
version:
- component_name: "phe/qa_and_trim"
version: "1-1"
component_name:
"phe/kmerid"
version:
"1-0" "phe/kmerid"
component_name:
- - version: "1-0"
component_name: "phe/mlst_typing"
component_name:
"phe/mlst_typing"
version: "1-1"
version:
"1-1"
- component_name: "phe/gene_finder"
version: "1-0"
component_name:
"phe/gene_finder"
component_name: "phe/combine_xml"
version:
"1-0"
version: "1-0"
component_name:
•  The
results for"phe/combine_xml"
each sample are tagged with same
version: "1-0"
workflow
and version
31 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Auditability
•  Each sample is tracked throughout the
process from sending lab to report output
• 
• 
• 
Metrics from lab processing recorded
Sequencing quality is logged
Each component of the bioinformatics
process logs its own progress and success/
failure
•  Only when all quality thresholds are
achieved and all components are completed
are results/reports transferred to end-users
32 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pathogen isolates received
Pathogen
isolates
received
Ad Hoc
Scripts
96 well plate
Automated
Reports
Bioinformatics
Plate
and
form
Workflows:
UNIX System
submission
Bioinformatician
Culture
Administrator
4
Hrs
UGE
DDN Lustre-based
Nucleic
Workflow-specific
Kmer
Clinical scientist
Sequencing
Technician
UNIX Sys Admin
Computing acid
High performance
components
Identification Trimming
Liquid
handling
robots
Sequencing
Cluster
storage
extraction
• Bioinformatician
Sample Dilution
G A T C on
C Illumina
Gene
ofSerotype
TAutomated
GMLST
G A Cworkflows
Reports
• Department
Library
Preparation
profile
HiSeq 2500
Health
officials
G A type
ACT
GATCC
• Metrics=> LIMS
Rapid
Mode
Web-based
T G G Aform
C
CCGAT
G
A
A
C
T
Gene
G AMLST
TCC
selecting
CCGAT
profile
type
TGGAC
Department
workflows
G A T for
CC
of Health
GAAC T
TGGAC
officials
samples
GAAC T
Drug
CCGAT
Library
GATCC
preparation
and
Sequence
TGGAC
Sample
Doctors,
GAAC T
Preparation:
Hrs
Deplexing:
4hrs
Epidemiologists 2472
sequencing:
CHrs
C G Clinical
A T Sequencing
scientistTechnician
resistance
Doctors, Epidemiologists
Consensus
sequence
CCGAT
GATCC
TGGAC
GAAC T
CCGAT
Speed and Resilience
Infrastructure
34 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Speed and Resilience
PHE/Colindale zone
Sequencing
machines
HAproxy & iRODS
server
With SSL SAN
certificate
UGE High Performance
Computing cluster
High Performance Storage DDN
EXAScaler / Lustre filesystem
PHE/Birmingham zone
Sequencing
machines
Computing and
Storage system
DDN WOS object
storage system
iRODS / iRODS /
PHE zone other
zone?
PHE
WAN
WTSI?
HAproxy iRODS server
SSL SAN certificate
DDN WOS object
storage system
35
Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Examples of WGS in action
Routine samples processed from April to September
2014
Organism
Number Processed
Salmonella
Staphylococcus aureus
Streptococcus pyogenes
Streptococcus pneumoniae
Other bacteria
HCV
HEV
HIV
Influenza
MeV
Other viruses
Total
3954
913
1274
959
238
114
3
257
187
47
35
7981
36 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Current and Future Challenges
•  Lustre FS problems
• 
Apparently random failures to write or writing of
incomplete files
•  Bio-banking for phenotype-genotype studies
•  Data release
• 
• 
Timely release of raw data
Minimal meta data
o 
o 
o 
Date of isolation
Source (Human/Environmental/Food)
Place (Country?)
•  Policy makers still wary
37 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Current and Future Challenges
•  Scalability
• 
• 
• 
• 
• 
Plan to scale to 3000 samples/week
OpenStack for surge compute
Currently have 300Tb usable Lustre storage
Medium term archive to object store
Release data to ENA (SRA) at EBI
16 weeks
Lustre
Fastqs (CRAM?)
Bam files
All result/log/error
and meta files
6 months
Object store
Fastqs (with workflow descriptor)
Text/pdf result files and reports
Meta data
for ever…
SRA
Fastqs (CRAM?)
38 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Acknowledgements
•  Virtual Pathogen Genomics Unit
•  Bioinformatics Unit
• 
• 
• 
Francesco Giannoccaro
Matthew Goulden ,Steven Platt, Rediat Tewolde, Aleksey
Jironkin, Ali Al-Shahib, Ulf Schaefer, Kieren Lythgow
Jonathan Green
•  Other Bioinformaticians
• 
• 
Tim Dallman, Phil Ashton
Michel Doumith
•  Reference and Specialist Microbiology laboratories
•  Icons made by Freepik from www.flaticon.com
39 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing