Quest to a perfect (NGS) production pipeline

Quest to a perfect (NGS)
production pipeline
Mateusz Kuzak (eScienceCenter/UvA)
Wibowo Arindrarto (LUMC)
Peter van `t Hof (LUMC)
Leon Mei (LUMC)
…...
Agenda
Cluster: LUMC (SGE),
UMCG (PBS), WUR
(SLURM), UMCU
(SGE), KeyGene (?)
Server: ErasmusMC,
AMC, VUMC, UMCN,
UMCM
Cloud, Grid, Clusters
at SURFsara
NGS groups &
Infrastructure
Pipeline
●
set of subsequent analysis steps
●
output of one step is input for next one
●
input – raw data
●
output – alignments, cont tables, visualizations
Perfect pipeline
Sustainable good support and reliable community
●
Robust can rerun part of the pipeline
●
Scalable utilize multiple cores (in a cluster)
●
Modular and no boiler plate code swap in/out similar
components (e.g. switching aligners), modules can be written in
different languages
●
Portable can easily run on a different site
●
Transparent control directly manage script, file location and
change parameters
●
User friendliness defining jobs and setting parameters should be
done via an easy to read file format (e.g. YAML)
●
Provenance explicit tracking of all scripts and options used,
executed steps for report generation or monitoring using a
webpage
...
●
Options
MOA command-line workflows for bioinformatics
●
Ruffus light-weight Python Computational Pipeline Management
●
GNU make standard unix build tool
●
Snakemake python based language (DSL)
●
Bpipe Java and Groovy-based tool
●
Bcbio-nextgen community based (Blue Collar Bioinformatics)
●
Molgenis-compute local expertise
●
GATK queue Scala based pipeline
●
Galaxy GUI, active community
●
Further possibilities new initiatives from open source community,
scientific workflow projects
●
GATK queue #1
●
java -jar Queue.jar -S <script>.scala
java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala
-R exampleFASTA.fasta -I exampleBAM.bam -run
●
●
MIT license, made in Broad, roadmap is unclear
Use DRMAA (native support to LSF, Grid Engine, batches available for
PBS, condor, etc)
●
●
can visualize pipeline into a dot graph
GATK queue #2
GATK queue #3
Agenda