The GRAPE RNA-seq Analysis Pipeline Environment

The GRAPE RNA-seq Analysis Pipeline Environment
Sarah Djebali, CRG, Barcelona
[email protected]
EMBO Global Exchange Lecture Course on High-throughput / NGS
applied to infectious diseases
Institut Pasteur de Tunis – September 2014
Outline
RNA-seq processing: an introduction
Why do we need a novel RNA-seq processing pipeline?
GRAPE RNA-seq Analysis Pipeline Environment
Description of the different steps
Example of application for the current released version (1.0)
High-throughput sequencing
3
Metzker, Nature Reviews Genetics, 2010
RNA-seq: how does it work?
Population of RNA
RNA Purification
Fragmentation &
Reverse transcription
Library of cDNA
Sequencing
Mapping reads on a
reference genome
Expression Quantification
RPKM → # Reads / Kb x MMR (Million Mappable Reads)
4
RNA purification
We can ask different questions by modifying the RNA purification
procedure:
Different RNA types <-> Size selection: long RNA >200nt (eg mRNAs)
or small RNA <200nt (regulatory RNAs such as microRNA). Comparison of
long and short RNAs allows us to infer precursors
RNA Processing <->Polyadenylation: unprocessed mRNAs (ie in the act
of transcription) as well as some non-coding RNAs do not have a polyA
tail. We can compare polyA+ and polyA- populations to calculate rates of
splicing
RNA localisation and trafficking <-> Subcellular fractionation: we can
take whole cell (usual) or separate the nucleus and cytoplasm, to ask
where RNAs are within the cell
Decay rates <-> Transcriptional blocks: we can stop transcription and
measure RNA levels at subsequent timepoints, to calculate RNA half lives
Other approaches to measure protein binding or RNA binding involve
immunoprecipitation and / or crosslinking of RNAs
5
Metadata information: a prerequisite for any RNA-seq
analysis
gender? male / female / unknown / mix?
Donor
age?
cell type? cell line / primary cell / tissue?
Sample
treatment / perturbation?
RNA size selection? size?
RNA population
RNA fraction? PolyA+ / polyA- / total / Ribo- ...?
cell compartment? whole cell / cytosol / nucleus... ?
directional?
read type? single end (SE) / paired end (PE)?
RNA-seq
experiment
read length?
Insert size for paired end? Configuration of the two mates?
quality offset? number of lanes?
6
Metadata information: a prerequisite for any RNA-seq
analysis
Influences the choice of inputs and parameters for the basic RNAseq data processing steps:
male or female gene annotation?
all transcripts or only long transcripts?
expected configuration of the reads within a transcript?
Helps us to understand the data biologically
Is particularly convenient to store in an index file (introduced by
UCSC):
wgEncodeRikenCage/releaseLatest/wgEncodeRikenCageK562NucleolusTotalTssHmmV3.bedRnaElements.gz
project=wgEncode; grant=Gingeras; lab=RIKEN; composite=wgEncodeRikenCage; dataType=Cage;
view=TssHmm; localization=nucleolus; rnaExtract=total; dataVersion=ENCODE Mar 2012 Freeze; labExpId=I33;
bioRep=gen0144NL; dccAccession=wgEncodeEH000335; dateSubmitted=2012-02-10; dateUnrestricted=2012-11-10;
subId=5611; size=13M; geoSampleAccession=GSM849332;
tableName=wgEncodeRikenCageK562NucleolusTotalTssHmmV3; cell=K562; type=bedRnaElements;
md5sum=b563064e57ecacaac656b0392f4b611a; replicate=N/A; organism=human;
...
Bioinformatics analyses: aims
1. To discover novel genes, or novel isoforms of known genes
→ Return a GTF (general transfer format) output of new transcript structures
2. To quantify the expression level of known genes and transcripts
→ Unit of expression is RPKM (reads per kilobase of transcript per million mapped reads)
3. To identify genes or transcripts that are differentially expressed between
conditions or samples
→ Log2 fold change in gene levels, with adjusted statistical P value
Haas et al PMID 20458303
Bioinformatics analyses: main steps
1. Read mapping (eg Tophat, GEM) (problem more or less solved satisfactorily)
Requires raw sequences
Each read is “mapped” to the genome (usually ~100 million x ~70nt reads)
Need to specify read mapping parameters: # of mismatches, uniqueness
Some reads will span splice junctions: “split reads” - not all mappers can handle this
2. Transcript reconstruction (eg Cufflinks, Scripture) (problem incomplete!)
Requires mapped read locations
Reads are combined to infer the underlying structure of transcripts: start, end, splice
sites => annotation
Very challenging problem: many false positives, dependent on read depth
3. Quantifications (eg Cufflinks, Flux capacitor)
Requires mapped reads + annotation
Difficult to differentiate between transcript isoforms, contain many shared exons
Many biases when comparing genes: GC content, mappability, length ...etc
4. Differential expression (eg DESeq, Cuffdiff)
Requires mapped reads + annotation
Bioinformatics analyses: program benchmarking
RNA-seq is a young technique, therefore programs for performing the
different processing tasks are relatively recent, and users have
trouble finding which is the most adapted to their needs.
RGASP (RNA-seq Genome Annotation Assessment Project,
http://www.gencodegenes.org/rgasp/) has done benchmarking:
Spliced alignment programs:
Systematic evaluation of spliced alignment programs for RNAseq data, Engström et al, Nat Methods, 2013.
Transcript reconstruction programs:
Assessment of transcript reconstruction methods for RNA-seq,
Steijger et al, Nat Methods, 2013.
Two good reviews on RNA-seq read mapping:
Fonseca et al, Bioinformatics, 2012.
Ruffalo et al, Bioinformatics, 2011.
Motivation for RNA-seq analysis pipelines
Vast amounts of RNA-seq data currently being generated (GTeX,
ENCODE, Geuvadis ...)
RNA-seq is becoming the standard for transcriptome analyses
Need for RNA-seq analysis pipelines that:
1) Can cope with big amount of data efficiently (fast),
2) Allow for data organization, quality control, basic processing,
analysis and result visualization
3) Is user-friendly for configuration and error management
4) Is modular enough to integrate new tools when they become
available
5) Allow for reproducibility of results (controlled version system)
6) Allow for both within and between sample analyses (project level)
State of the art for RNA-seq analysis pipelines
RNA-seq
Reference
pipeline name
ArrayExpress Goncalves
HTS
et al,
Bioinformati
cs, 2011
RUM (RNAseq Unified
Mapper)
RseqFlow
Actions performed
- quality control
- read mapping
- annotated gene, transcript,
exon quantification
Tools used
Languag
e used
- shortRead
- several possible mappers
- several possible quantifiers
Possible
hardware for
computation
- EBI R cloud
Notes
R
Grant et al, - read mapping
Bioinformati - annotated gene, exon,
cs, 2011
junction quantification
- novel splice junctions
- Bowtie + Blat
- custom scripts
Wang et al, - quality control
Bioinformati - read mapping
cs, 2011
- annotated gene, exon,
junction quantification
- differential gene expression
analysis
- coding SNP calling
- DESeq
- custom scripts
- Amazon cloud
- Sun grid engine
- local machine
For paired
end data
Perl
Virtual machine
that can be run on
Pegasus local machine or
Workflow Amazon EC2
Manage
ment
System
- For single
end
illumina
data
- keep track
of
programs
and
parameters
GRAPE's novelty
Many different quality control and analysis steps
within sample
between samples
Implementation using NextFlow so that many cluster engines can be
supported in addition to local machine:
Amazon
DNA nexus
Sun Grid Engine
Able to run on big scale project without too much effort
Modular so that new tools can be added and assembled as needed
Visualization of the results
GRAPE 1.0
GRAPE (Grape Analysis Pipeline Environment) = workflow
management system for the organization, processing, analysis
and visualization of RNA-seq data.
Current GRAPE release (version 1.0):
Paper: David G. Knowles, Maik Röder, Angelika Merkel, and
Roderic Guigó, Grape RNA-Seq analysis pipeline
environment, Bioinformatics 29 (5): 614-621 (2013)
Website: http://big.crg.cat/services/grape
GRAPE 2.0
The development of a new version of GRAPE has been started
in 2013 in order to:
redesign the command line interface
use the NextFlow pipeline environment for a better support
to both local and cluster environment execution
http://www.nextflow.io/about-us.html
make the pipeline configuration and setup easier to the enduser
Github: http://github.com/grape-pipeline/grape
GRAPE 2.0: main features
user friendly configuration and execution
automated workflow
data organization and management at project level
full reproducibility
local/cluster execution support
visualization of the results
Different steps of GRAPE 2.0
Step
Program used
Reference
- RNA-seq read mapping to
- GEMtools RNA-seq mapping - Marco-Sola et al, Nature
genome / annotated transcriptome pipeline
Methods, 2012
/ novel exon-exon junctions
- http://gemtools.github.io
Gene / transcript / exon level
Sample -quantifications
level
analyses
- Splicing analyses
- Flux capacitor
- IPSA (Integrative Pipeline for Splicing Analyses)
https://github.com/pervouchine/i
psa
- Summary statistics (QC)
- in-house programs
* Mapping statistics
* genome / annotation coverage
* Gene RPKM distribution
Project
level
analysis
http://sammeth.net/confluence/di
splay/FLUX/Home
- can be made available on
demand
- Sample clustering
- in-house program using the - can be made available on
R hclust function
demand
- Differential Gene Expression
analysis
- in-house program using the - can be made available on
DESeq R package
demand
RNA-seq read mapping with GEMtools
Filtering based mapping algorithm
Formalize the choice of seeds toward exhaustive search by selecting non
overlapping segments of the same length
Uses the following strategy:
takes segments all having the same length,
given m = number of mismatches, consider exactly m+1 of them,
uses only exact alignments as candidates.
it is impossible to distribute
m mismatches in m+1 segments
without leaving at least one
segment without mismatches
Advantages:
as fast as seeded,
exhaustive,
predictable behaviour→tunable.
The GEM mapper
Ferragina Mancini index,
Exhaustive region-based
adaptive filtering algorithm,
partitions alignments into
“distance” strata,
uses base-call qualities,
reports multiple mappings,
supports longer reads (>250bp).
Marco-Sola et al, Nature methods, 2012
Paired end alignment
Both ends of the fragments are
sequenced→paired-end reads
Provides connectivity information
Insert size and read length are known in advance (from library
preparation):
insert size distribution can be used to solve ambiguities (or even
enhance the mapping process)
GEM based RNA-seq mapping pipeline
Gemtools
http://gemtools.github.io
RNA-seq read mapping parameters
Specific variables to consider when mapping RNA-seq data
intron size
overhang
number of bases from each side of the junction that
should be covered by the read
splice site consensus
donor/acceptor splice site consensus sequences
junction “filtering”
chromosome/strand
block order
min/max distance
Mapping filtering
GEM is exhaustive: it reports all possible mappings of a read given the
input parameters → need for filtering the mapping:
mapping quality
unique without any sub-dominant mapping:
251 <= MAPQ <= 254, XT=U
unique with sub-dominant mappings but a different score:
175 <= MAPQ <= 181, XT=U
putatively unique (not unique, but distinguishable by score):
119 <= MAPQ <= 127, XT=U
perfect ties:
78 <= MAPQ <= 90, XT=R
edit distance
max 4 edit operations allowed (including mismatches, insertions
and deletions)
multimaps
max 9 mappings per read are allowed.
SAM format
Sequence Alignment/Map
http://samtools.sourceforge.net/
Individual isoform quantification with flux capacitor
The flux capacitor program
Here we will use the flux capacitor program (exon-centric) to quantify
individual transcript abundances (http://sammeth.net/confluence/display/FLUX/Home ).
The flux capacitor program takes as input:
a reference gene annotation (gtf file),
a set of mapped reads (bam file).
The flux capacitor program includes 2 main steps:
the transcript read profile determination, based on the distribution of
reads across all single transcript loci,
the individual transcript coverage determination for each locus, based
on 2 steps (see next):
the locus segment graph building,
the comparison of the mapped reads to the segment graph in order
to weigh the segment graph edges. This enables the program to
build a system of linear equations which variables are individual
transcript coverages. The solution of this system will thus make it
possible to know the abundances of each transcript of the locus.
Segment graph
(exon segments, segment junctions)
From Micha Sammeth
Segment graph + Reads
= Flow network
From Micha Sammeth
Basic algorithm outline
Example of flux output file
Splicing analyses with IPSA
IPSA Analysis
Filtered mapped RNA-seq reads
(BAM)
IPSA Analysis
Splice junctions
(SSJ)
Splice boundaries
(SSC)
Splicing indices
PSI / CoSI (GFF)
- PSI (Ψ) = Percent Spliced-in Index = measure of inclusion rate
- CoSI (θ) = Complete Splicing Index = measure of processing rate
PSI (Ψ) and CoSI (θ) definitions
- PSI (Ψ) = Percent Spliced-in Index = measure of inclusion rate
- CoSI (θ) = Complete Splicing Index = measure of processing rate
­ i = inclusion
­ e = exclusion
­ r = retention
Counting split-mapped alignments with offsets
Splits are decided by the mapper (N in CIGAR)
Offset = the position of the split in the query sequence
Some offsets give rise to artifactually large counts
Fast counting by sjcount­2.0 utility
Different steps of IPSA
Diagnostic plots
Offset distribution
Strand disproportion plot log(c+)­log(c­)
Percent of junctions with positive counts on both strands
Example of IPSA output file
Summary statistics at the project level
Mapping statistics
Total reads
Mapped reads (number and %)
Uniquely mapped reads (number and %)
Mappings (including multimaps)
Number and % in exons / introns / intergenic regions ...
Allows to spot outlier samples ...
Genome and annotation coverage
1) Produce contigs, i.e. genomic regions covered by RNA-seq reads;
2) Compute the % of the following regions covered by contigs:
the genome
The annotated exons
The annotated introns
The intergenic regions.
Gene RPKM distribution
Look at the RPKM distribution for individual samples (min, max, mean, median):
● Detect outlier genes
● Spot possible biases
● Compare the RPKM distributions of the different samples:
● Assess the similarity between samples (useful for downstream analyses)
● Spot possible biases
●
Sample clustering
Sample hierarchical clustering
Context: a project with several samples interrogated by RNAseq (or any other technique);
Aim: understand the differences between samples in order to
detect technical issues about samples and/or sequencing
(quality control), or real biology;
How: represent the relationships between samples as a tree
(hierarchy) by clustering the samples based on the differences
between them:
each sample is represented as a vector of values (e.g.
RPKM, exon inclusion level, ...) obtained for a fixed set of
biological objects (e.g. genes, exons, ...).
Sample hierarchical clustering
based on gene expression
1) Assess the differences between each pair of samples based
on their gene expression (RPKM)→ each sample is a vector of
n gene RPKM values:
g1
g2
.
.
.
.
.
.
.
gn
x1
x2
.
.
.
.
.
.
.
xn
y1
y2
.
.
.
.
.
.
.
yn
Sample 1
Sample 2
Possible distances:
1) Euclidian:
√∑
2) Maximum:
max i ∈[1 : n] ( x i− y i)
3) Manhattan:
∑i=1 ∣x i− y i∣
4) Camberra:
∑i=1 ∣x i + y i∣
n
i=1
( x i− y i )2
n
∣x − y ∣
n
i
i
5) Binary: bits where only one is 'on', over bits
where at least one is 'on'
6) Minkowski:
n
1/ p
( ∑i=1 ∣x i− y i∣p )
of Euclidian and Manhattan)
7) 1-Pearson correlation
From the R function dist
8) 1-Spearman correlation.
(generalization
Sample hierarchical clustering
based on gene expression
2) Perform a hierarchical clustering of the samples based on the distances
between them, a distance measure and an agglomeration method:
each sample is initially assigned to its own cluster and then the algorithm
proceeds iteratively, at each stage joining the two most similar clusters,
continuing until there is just a single cluster;
There are various agglomeration methods:
1) Ward's minimum variance method: finds compact spherical clusters,
2) Single linkage method: close to minimal spanning tree, adopts a
'friends of friends' clustering strategy,
3) Complete linkage method: known as farthest neighbour clustering
method, finds similar clusters,
4) Average,
5) Mcquitty,
6) Median,
7) Centroid: above 4 find clusters with characteristics between 2) and 3).
From the R function hclust
Sample clustering example
Sample clustering example
Sample clustering example
Sample clustering example
Sample clustering example
Differential gene expression analysis
Differential gene expression (DGE) analysis
Aim: given 2 samples, identify genes that are more expressed in 1
sample than in the other:
read count of a gene in an experiment is related to its
abundance in this experiment (Anders, Genome Biology, 2010)
compare for each gene, its read count between 2 samples,
use statistical testing to decide whether, for a given gene, an
observed difference in read count is significant = is greater than
what would be expected just due to natural random variation
(Anders, Genome Biology, 2010).
Issues:
small number of biological replicates for each exp: not possible
to use non-parametric methods such as rank-based or
permutation tests,
biological data is more variable than predicted by multinomial or
Poisson distributions (overdispersion problem):
use of Negative Binomial distribution (DESEQ, EdgeR).
Basics of DGE: estimate parameters
3 steps:
1) the actual read count for a gene in one sample is drawn from
a Negative Binomial distribution,
2) the parameters (mean and standard deviation) of this
distribution are estimated from the data (for each gene in each
sample),
3) the probability to observe the actual read count for each gene
and each sample is computed using the estimated distribution.
Basics of DGE: assess significance
Null hypothesis (H0): the expression of the gene is the same in both
conditions,
Calculate a p-value,
Adjust the p-value for multiple testing:
e.g. adjusted p-value = 0.05 → 95% confidence in rejecting the
null hypothesis.
Anders, Genome Biology, 2010
Example from GRAPE 1.0
GRAPE 2.0: work in progress
Error management and reporting improvement
Flexible workflow and configuration
Centralized data management
Version control system for data and runs
Shared project configuration and pipeline setup
Take-home messages
RNA-seq is becoming the standard for transcriptomic profiling.
Need for automated workflow that are:
Efficient for large scale transcriptomic projects
User-friendly to both configure and execute
Modular to incorporate new tools
Able to keep track of metadata, program versions and
parameters, and organize the data at the project level
Able to provide visual information about analysis at both the
sample and the project levels.
GRAPE intends to be such an approach, however the current release
is not user-friendly enough, and therefore a new release is currently
under development.
Acknowledgements
Emilio Palumbo
Alessandra Breschi
David Gonzalez
Maik Röder
Paolo Ribeca
Micha Sammeth
Dmitri Pervouchine
Angelika Merkel
Roderic Guigó
References (I)
Sequencing technology review: Metzker, Sequencing technologies - the next generation, Nat
Rev Genet, 2010.
RNAseq review: Wang, Gerstein, Snyder – RNA-seq: a revolutionary tool for transcriptomics
mapping program review:
Fonseca,…, Marioni, Tools for mapping high-throughput sequencing data, Bioinformatics,
2012.
Ruffalo,..., Koyutürk, Comparative analysis of algorithms for next-generation sequencing
read alignment, Bioinformatics, 2011.
mapping algorithms
GEMsuite: Marco-Sola,..., Ribeca, The GEM mapper: fast, accurate and versatile
alignment by filtration, Nat Methods, 2012 /
http://algorithms.cnag.cat/wiki/The_GEM_library
Ricardo A. Baeza-Yates, Chris H. Perleberg, Fast and practical approximate string
matching, Information Processing Letters, Volume 59, Issue 1, 8 July 1996
gemtools: https://github.com/gemtools/gemtools
sam/bam format: http://samtools.sourceforge.net/SAMv1.pdf
samtools: http://samtools.sourceforge.net/
bamtools: Barnett,…, Marth, BamTools: a C++ API and toolkit for analyzing and managing
BAM files, Bioinformatics, 2011.
bedtools: Quinlan, …, Hall, BEDTools: a flexible suite of utilities for comparing genomic
features, Bioinformatics, 2010 / http://code.google.com/p/bedtools/
65
References (II)
RPKM definition: Mortazavi,..., Wold, Mapping and quantifying mammalian transcriptomes by
RNA-Seq, Nat Methods, 2008.
Multisplice: Huang, …, Liu, A robust method for transcript quantification with RNA-seq data,
RECOMB 2012
Review on transcript quantification methods: Pachter, Models for transcript quantification from
RNA-seq, arXiv, 2011
Flux capacitor: Montgomery,..., Dermitzakis, Transcriptome genetics using second generation
sequencing in a Caucasian population, Nature, 2010 /
http://sammeth.net/confluence/display/FLUX/Home
Lovén,...,Young, Revisiting global gene expression analysis, Cell, 2012.
UCSC genome browser: http://genome.ucsc.edu/
Differential expression analysis:
EdgeR: Robinson,..., Smyth, edgeR: a Bioconductor package for differential expression
analysis of digital gene expression data, Bioinformatics, 2010.
DESeq: Anders, Huber, Differential expression analysis for sequence count data,
Genome Biol. 2010.
Cuffdiff: Trapnell,..., Pachter, Differential analysis of gene regulation at transcript
resolution with RNA-seq, Nat Biotechnol. 2013.
Dillies, …, Jaffrézic, A comprehensive evaluation of normalization methods for Illumina highthroughput..., Brief Bioinform. 2012
Rapaport,..., Betel D, Comprehensive evaluation of differential gene expression analysis
methods..., Genome Biol, 2013.
Genome Res. 2013 Dec;23(12):1961-73. doi: 10.1101/gr.161315.113. Epub 2013 Oct 30.
Functional transcriptomics in the post-ENCODE era. Mudge JM1, Frankish A, Harrow J.
RGASP2 papers
66
Raisin visualization of Grape’s QC step. Panels a and b show the distribution
of quality scores and ambiguous nucleotides along the length of the reads.
Panel c and d show summaries of the number of reads in the dataset as well as
the fraction of reads with no ambiguous bases and the number of unique sequences
as a percentage of the total
Overview of Grape’s mapping strategy. The initial genome and junctions mapping is followed
by a round of split mapping. Remaining unmapped reads are aligned with additional mismatches,
and the still remaining ones are iteratively trimmed. The mappings resulting from the different
steps are combined into a final merged mapping
Raisin visualization of Grape’s mapping step. Panel a shows the overall mapping results
as well as the information on the genome annotation and number of mismatches used for
the alignments. Panel b shows the fraction of reads aligned in the final merged mapping.
Panels c, d and e show the same type of information for the different components of the
mapping process
Distribution of uniquely mapping reads along the annotated transcripts.
This allows us to identify biases that may be caused by issues such as
RNA degradation
Raisin visualization of the transcript quantification step. Panel a shows the distribution of
gene expression. Panels b and c show, respectively, the top genes and transcripts detected
in the different lanes of the analyzed samples
Raisin visualization of Grape’s splicing analysis. Panel a shows the summary table for
the detected splice sites. Panel b contains a table with the top included exons in the samples
examined, and panel c shows the distribution of the inclusion values over all internal exons
Raisin overview of Grape’s analysis results