The GRAPE RNA-seq Analysis Pipeline Environment Sarah Djebali, CRG, Barcelona [email protected] EMBO Global Exchange Lecture Course on High-throughput / NGS applied to infectious diseases Institut Pasteur de Tunis – September 2014 Outline RNA-seq processing: an introduction Why do we need a novel RNA-seq processing pipeline? GRAPE RNA-seq Analysis Pipeline Environment Description of the different steps Example of application for the current released version (1.0) High-throughput sequencing 3 Metzker, Nature Reviews Genetics, 2010 RNA-seq: how does it work? Population of RNA RNA Purification Fragmentation & Reverse transcription Library of cDNA Sequencing Mapping reads on a reference genome Expression Quantification RPKM → # Reads / Kb x MMR (Million Mappable Reads) 4 RNA purification We can ask different questions by modifying the RNA purification procedure: Different RNA types <-> Size selection: long RNA >200nt (eg mRNAs) or small RNA <200nt (regulatory RNAs such as microRNA). Comparison of long and short RNAs allows us to infer precursors RNA Processing <->Polyadenylation: unprocessed mRNAs (ie in the act of transcription) as well as some non-coding RNAs do not have a polyA tail. We can compare polyA+ and polyA- populations to calculate rates of splicing RNA localisation and trafficking <-> Subcellular fractionation: we can take whole cell (usual) or separate the nucleus and cytoplasm, to ask where RNAs are within the cell Decay rates <-> Transcriptional blocks: we can stop transcription and measure RNA levels at subsequent timepoints, to calculate RNA half lives Other approaches to measure protein binding or RNA binding involve immunoprecipitation and / or crosslinking of RNAs 5 Metadata information: a prerequisite for any RNA-seq analysis gender? male / female / unknown / mix? Donor age? cell type? cell line / primary cell / tissue? Sample treatment / perturbation? RNA size selection? size? RNA population RNA fraction? PolyA+ / polyA- / total / Ribo- ...? cell compartment? whole cell / cytosol / nucleus... ? directional? read type? single end (SE) / paired end (PE)? RNA-seq experiment read length? Insert size for paired end? Configuration of the two mates? quality offset? number of lanes? 6 Metadata information: a prerequisite for any RNA-seq analysis Influences the choice of inputs and parameters for the basic RNAseq data processing steps: male or female gene annotation? all transcripts or only long transcripts? expected configuration of the reads within a transcript? Helps us to understand the data biologically Is particularly convenient to store in an index file (introduced by UCSC): wgEncodeRikenCage/releaseLatest/wgEncodeRikenCageK562NucleolusTotalTssHmmV3.bedRnaElements.gz project=wgEncode; grant=Gingeras; lab=RIKEN; composite=wgEncodeRikenCage; dataType=Cage; view=TssHmm; localization=nucleolus; rnaExtract=total; dataVersion=ENCODE Mar 2012 Freeze; labExpId=I33; bioRep=gen0144NL; dccAccession=wgEncodeEH000335; dateSubmitted=2012-02-10; dateUnrestricted=2012-11-10; subId=5611; size=13M; geoSampleAccession=GSM849332; tableName=wgEncodeRikenCageK562NucleolusTotalTssHmmV3; cell=K562; type=bedRnaElements; md5sum=b563064e57ecacaac656b0392f4b611a; replicate=N/A; organism=human; ... Bioinformatics analyses: aims 1. To discover novel genes, or novel isoforms of known genes → Return a GTF (general transfer format) output of new transcript structures 2. To quantify the expression level of known genes and transcripts → Unit of expression is RPKM (reads per kilobase of transcript per million mapped reads) 3. To identify genes or transcripts that are differentially expressed between conditions or samples → Log2 fold change in gene levels, with adjusted statistical P value Haas et al PMID 20458303 Bioinformatics analyses: main steps 1. Read mapping (eg Tophat, GEM) (problem more or less solved satisfactorily) Requires raw sequences Each read is “mapped” to the genome (usually ~100 million x ~70nt reads) Need to specify read mapping parameters: # of mismatches, uniqueness Some reads will span splice junctions: “split reads” - not all mappers can handle this 2. Transcript reconstruction (eg Cufflinks, Scripture) (problem incomplete!) Requires mapped read locations Reads are combined to infer the underlying structure of transcripts: start, end, splice sites => annotation Very challenging problem: many false positives, dependent on read depth 3. Quantifications (eg Cufflinks, Flux capacitor) Requires mapped reads + annotation Difficult to differentiate between transcript isoforms, contain many shared exons Many biases when comparing genes: GC content, mappability, length ...etc 4. Differential expression (eg DESeq, Cuffdiff) Requires mapped reads + annotation Bioinformatics analyses: program benchmarking RNA-seq is a young technique, therefore programs for performing the different processing tasks are relatively recent, and users have trouble finding which is the most adapted to their needs. RGASP (RNA-seq Genome Annotation Assessment Project, http://www.gencodegenes.org/rgasp/) has done benchmarking: Spliced alignment programs: Systematic evaluation of spliced alignment programs for RNAseq data, Engström et al, Nat Methods, 2013. Transcript reconstruction programs: Assessment of transcript reconstruction methods for RNA-seq, Steijger et al, Nat Methods, 2013. Two good reviews on RNA-seq read mapping: Fonseca et al, Bioinformatics, 2012. Ruffalo et al, Bioinformatics, 2011. Motivation for RNA-seq analysis pipelines Vast amounts of RNA-seq data currently being generated (GTeX, ENCODE, Geuvadis ...) RNA-seq is becoming the standard for transcriptome analyses Need for RNA-seq analysis pipelines that: 1) Can cope with big amount of data efficiently (fast), 2) Allow for data organization, quality control, basic processing, analysis and result visualization 3) Is user-friendly for configuration and error management 4) Is modular enough to integrate new tools when they become available 5) Allow for reproducibility of results (controlled version system) 6) Allow for both within and between sample analyses (project level) State of the art for RNA-seq analysis pipelines RNA-seq Reference pipeline name ArrayExpress Goncalves HTS et al, Bioinformati cs, 2011 RUM (RNAseq Unified Mapper) RseqFlow Actions performed - quality control - read mapping - annotated gene, transcript, exon quantification Tools used Languag e used - shortRead - several possible mappers - several possible quantifiers Possible hardware for computation - EBI R cloud Notes R Grant et al, - read mapping Bioinformati - annotated gene, exon, cs, 2011 junction quantification - novel splice junctions - Bowtie + Blat - custom scripts Wang et al, - quality control Bioinformati - read mapping cs, 2011 - annotated gene, exon, junction quantification - differential gene expression analysis - coding SNP calling - DESeq - custom scripts - Amazon cloud - Sun grid engine - local machine For paired end data Perl Virtual machine that can be run on Pegasus local machine or Workflow Amazon EC2 Manage ment System - For single end illumina data - keep track of programs and parameters GRAPE's novelty Many different quality control and analysis steps within sample between samples Implementation using NextFlow so that many cluster engines can be supported in addition to local machine: Amazon DNA nexus Sun Grid Engine Able to run on big scale project without too much effort Modular so that new tools can be added and assembled as needed Visualization of the results GRAPE 1.0 GRAPE (Grape Analysis Pipeline Environment) = workflow management system for the organization, processing, analysis and visualization of RNA-seq data. Current GRAPE release (version 1.0): Paper: David G. Knowles, Maik Röder, Angelika Merkel, and Roderic Guigó, Grape RNA-Seq analysis pipeline environment, Bioinformatics 29 (5): 614-621 (2013) Website: http://big.crg.cat/services/grape GRAPE 2.0 The development of a new version of GRAPE has been started in 2013 in order to: redesign the command line interface use the NextFlow pipeline environment for a better support to both local and cluster environment execution http://www.nextflow.io/about-us.html make the pipeline configuration and setup easier to the enduser Github: http://github.com/grape-pipeline/grape GRAPE 2.0: main features user friendly configuration and execution automated workflow data organization and management at project level full reproducibility local/cluster execution support visualization of the results Different steps of GRAPE 2.0 Step Program used Reference - RNA-seq read mapping to - GEMtools RNA-seq mapping - Marco-Sola et al, Nature genome / annotated transcriptome pipeline Methods, 2012 / novel exon-exon junctions - http://gemtools.github.io Gene / transcript / exon level Sample -quantifications level analyses - Splicing analyses - Flux capacitor - IPSA (Integrative Pipeline for Splicing Analyses) https://github.com/pervouchine/i psa - Summary statistics (QC) - in-house programs * Mapping statistics * genome / annotation coverage * Gene RPKM distribution Project level analysis http://sammeth.net/confluence/di splay/FLUX/Home - can be made available on demand - Sample clustering - in-house program using the - can be made available on R hclust function demand - Differential Gene Expression analysis - in-house program using the - can be made available on DESeq R package demand RNA-seq read mapping with GEMtools Filtering based mapping algorithm Formalize the choice of seeds toward exhaustive search by selecting non overlapping segments of the same length Uses the following strategy: takes segments all having the same length, given m = number of mismatches, consider exactly m+1 of them, uses only exact alignments as candidates. it is impossible to distribute m mismatches in m+1 segments without leaving at least one segment without mismatches Advantages: as fast as seeded, exhaustive, predictable behaviour→tunable. The GEM mapper Ferragina Mancini index, Exhaustive region-based adaptive filtering algorithm, partitions alignments into “distance” strata, uses base-call qualities, reports multiple mappings, supports longer reads (>250bp). Marco-Sola et al, Nature methods, 2012 Paired end alignment Both ends of the fragments are sequenced→paired-end reads Provides connectivity information Insert size and read length are known in advance (from library preparation): insert size distribution can be used to solve ambiguities (or even enhance the mapping process) GEM based RNA-seq mapping pipeline Gemtools http://gemtools.github.io RNA-seq read mapping parameters Specific variables to consider when mapping RNA-seq data intron size overhang number of bases from each side of the junction that should be covered by the read splice site consensus donor/acceptor splice site consensus sequences junction “filtering” chromosome/strand block order min/max distance Mapping filtering GEM is exhaustive: it reports all possible mappings of a read given the input parameters → need for filtering the mapping: mapping quality unique without any sub-dominant mapping: 251 <= MAPQ <= 254, XT=U unique with sub-dominant mappings but a different score: 175 <= MAPQ <= 181, XT=U putatively unique (not unique, but distinguishable by score): 119 <= MAPQ <= 127, XT=U perfect ties: 78 <= MAPQ <= 90, XT=R edit distance max 4 edit operations allowed (including mismatches, insertions and deletions) multimaps max 9 mappings per read are allowed. SAM format Sequence Alignment/Map http://samtools.sourceforge.net/ Individual isoform quantification with flux capacitor The flux capacitor program Here we will use the flux capacitor program (exon-centric) to quantify individual transcript abundances (http://sammeth.net/confluence/display/FLUX/Home ). The flux capacitor program takes as input: a reference gene annotation (gtf file), a set of mapped reads (bam file). The flux capacitor program includes 2 main steps: the transcript read profile determination, based on the distribution of reads across all single transcript loci, the individual transcript coverage determination for each locus, based on 2 steps (see next): the locus segment graph building, the comparison of the mapped reads to the segment graph in order to weigh the segment graph edges. This enables the program to build a system of linear equations which variables are individual transcript coverages. The solution of this system will thus make it possible to know the abundances of each transcript of the locus. Segment graph (exon segments, segment junctions) From Micha Sammeth Segment graph + Reads = Flow network From Micha Sammeth Basic algorithm outline Example of flux output file Splicing analyses with IPSA IPSA Analysis Filtered mapped RNA-seq reads (BAM) IPSA Analysis Splice junctions (SSJ) Splice boundaries (SSC) Splicing indices PSI / CoSI (GFF) - PSI (Ψ) = Percent Spliced-in Index = measure of inclusion rate - CoSI (θ) = Complete Splicing Index = measure of processing rate PSI (Ψ) and CoSI (θ) definitions - PSI (Ψ) = Percent Spliced-in Index = measure of inclusion rate - CoSI (θ) = Complete Splicing Index = measure of processing rate i = inclusion e = exclusion r = retention Counting split-mapped alignments with offsets Splits are decided by the mapper (N in CIGAR) Offset = the position of the split in the query sequence Some offsets give rise to artifactually large counts Fast counting by sjcount2.0 utility Different steps of IPSA Diagnostic plots Offset distribution Strand disproportion plot log(c+)log(c) Percent of junctions with positive counts on both strands Example of IPSA output file Summary statistics at the project level Mapping statistics Total reads Mapped reads (number and %) Uniquely mapped reads (number and %) Mappings (including multimaps) Number and % in exons / introns / intergenic regions ... Allows to spot outlier samples ... Genome and annotation coverage 1) Produce contigs, i.e. genomic regions covered by RNA-seq reads; 2) Compute the % of the following regions covered by contigs: the genome The annotated exons The annotated introns The intergenic regions. Gene RPKM distribution Look at the RPKM distribution for individual samples (min, max, mean, median): ● Detect outlier genes ● Spot possible biases ● Compare the RPKM distributions of the different samples: ● Assess the similarity between samples (useful for downstream analyses) ● Spot possible biases ● Sample clustering Sample hierarchical clustering Context: a project with several samples interrogated by RNAseq (or any other technique); Aim: understand the differences between samples in order to detect technical issues about samples and/or sequencing (quality control), or real biology; How: represent the relationships between samples as a tree (hierarchy) by clustering the samples based on the differences between them: each sample is represented as a vector of values (e.g. RPKM, exon inclusion level, ...) obtained for a fixed set of biological objects (e.g. genes, exons, ...). Sample hierarchical clustering based on gene expression 1) Assess the differences between each pair of samples based on their gene expression (RPKM)→ each sample is a vector of n gene RPKM values: g1 g2 . . . . . . . gn x1 x2 . . . . . . . xn y1 y2 . . . . . . . yn Sample 1 Sample 2 Possible distances: 1) Euclidian: √∑ 2) Maximum: max i ∈[1 : n] ( x i− y i) 3) Manhattan: ∑i=1 ∣x i− y i∣ 4) Camberra: ∑i=1 ∣x i + y i∣ n i=1 ( x i− y i )2 n ∣x − y ∣ n i i 5) Binary: bits where only one is 'on', over bits where at least one is 'on' 6) Minkowski: n 1/ p ( ∑i=1 ∣x i− y i∣p ) of Euclidian and Manhattan) 7) 1-Pearson correlation From the R function dist 8) 1-Spearman correlation. (generalization Sample hierarchical clustering based on gene expression 2) Perform a hierarchical clustering of the samples based on the distances between them, a distance measure and an agglomeration method: each sample is initially assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster; There are various agglomeration methods: 1) Ward's minimum variance method: finds compact spherical clusters, 2) Single linkage method: close to minimal spanning tree, adopts a 'friends of friends' clustering strategy, 3) Complete linkage method: known as farthest neighbour clustering method, finds similar clusters, 4) Average, 5) Mcquitty, 6) Median, 7) Centroid: above 4 find clusters with characteristics between 2) and 3). From the R function hclust Sample clustering example Sample clustering example Sample clustering example Sample clustering example Sample clustering example Differential gene expression analysis Differential gene expression (DGE) analysis Aim: given 2 samples, identify genes that are more expressed in 1 sample than in the other: read count of a gene in an experiment is related to its abundance in this experiment (Anders, Genome Biology, 2010) compare for each gene, its read count between 2 samples, use statistical testing to decide whether, for a given gene, an observed difference in read count is significant = is greater than what would be expected just due to natural random variation (Anders, Genome Biology, 2010). Issues: small number of biological replicates for each exp: not possible to use non-parametric methods such as rank-based or permutation tests, biological data is more variable than predicted by multinomial or Poisson distributions (overdispersion problem): use of Negative Binomial distribution (DESEQ, EdgeR). Basics of DGE: estimate parameters 3 steps: 1) the actual read count for a gene in one sample is drawn from a Negative Binomial distribution, 2) the parameters (mean and standard deviation) of this distribution are estimated from the data (for each gene in each sample), 3) the probability to observe the actual read count for each gene and each sample is computed using the estimated distribution. Basics of DGE: assess significance Null hypothesis (H0): the expression of the gene is the same in both conditions, Calculate a p-value, Adjust the p-value for multiple testing: e.g. adjusted p-value = 0.05 → 95% confidence in rejecting the null hypothesis. Anders, Genome Biology, 2010 Example from GRAPE 1.0 GRAPE 2.0: work in progress Error management and reporting improvement Flexible workflow and configuration Centralized data management Version control system for data and runs Shared project configuration and pipeline setup Take-home messages RNA-seq is becoming the standard for transcriptomic profiling. Need for automated workflow that are: Efficient for large scale transcriptomic projects User-friendly to both configure and execute Modular to incorporate new tools Able to keep track of metadata, program versions and parameters, and organize the data at the project level Able to provide visual information about analysis at both the sample and the project levels. GRAPE intends to be such an approach, however the current release is not user-friendly enough, and therefore a new release is currently under development. Acknowledgements Emilio Palumbo Alessandra Breschi David Gonzalez Maik Röder Paolo Ribeca Micha Sammeth Dmitri Pervouchine Angelika Merkel Roderic Guigó References (I) Sequencing technology review: Metzker, Sequencing technologies - the next generation, Nat Rev Genet, 2010. RNAseq review: Wang, Gerstein, Snyder – RNA-seq: a revolutionary tool for transcriptomics mapping program review: Fonseca,…, Marioni, Tools for mapping high-throughput sequencing data, Bioinformatics, 2012. Ruffalo,..., Koyutürk, Comparative analysis of algorithms for next-generation sequencing read alignment, Bioinformatics, 2011. mapping algorithms GEMsuite: Marco-Sola,..., Ribeca, The GEM mapper: fast, accurate and versatile alignment by filtration, Nat Methods, 2012 / http://algorithms.cnag.cat/wiki/The_GEM_library Ricardo A. Baeza-Yates, Chris H. Perleberg, Fast and practical approximate string matching, Information Processing Letters, Volume 59, Issue 1, 8 July 1996 gemtools: https://github.com/gemtools/gemtools sam/bam format: http://samtools.sourceforge.net/SAMv1.pdf samtools: http://samtools.sourceforge.net/ bamtools: Barnett,…, Marth, BamTools: a C++ API and toolkit for analyzing and managing BAM files, Bioinformatics, 2011. bedtools: Quinlan, …, Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics, 2010 / http://code.google.com/p/bedtools/ 65 References (II) RPKM definition: Mortazavi,..., Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat Methods, 2008. Multisplice: Huang, …, Liu, A robust method for transcript quantification with RNA-seq data, RECOMB 2012 Review on transcript quantification methods: Pachter, Models for transcript quantification from RNA-seq, arXiv, 2011 Flux capacitor: Montgomery,..., Dermitzakis, Transcriptome genetics using second generation sequencing in a Caucasian population, Nature, 2010 / http://sammeth.net/confluence/display/FLUX/Home Lovén,...,Young, Revisiting global gene expression analysis, Cell, 2012. UCSC genome browser: http://genome.ucsc.edu/ Differential expression analysis: EdgeR: Robinson,..., Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, 2010. DESeq: Anders, Huber, Differential expression analysis for sequence count data, Genome Biol. 2010. Cuffdiff: Trapnell,..., Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq, Nat Biotechnol. 2013. Dillies, …, Jaffrézic, A comprehensive evaluation of normalization methods for Illumina highthroughput..., Brief Bioinform. 2012 Rapaport,..., Betel D, Comprehensive evaluation of differential gene expression analysis methods..., Genome Biol, 2013. Genome Res. 2013 Dec;23(12):1961-73. doi: 10.1101/gr.161315.113. Epub 2013 Oct 30. Functional transcriptomics in the post-ENCODE era. Mudge JM1, Frankish A, Harrow J. RGASP2 papers 66 Raisin visualization of Grape’s QC step. Panels a and b show the distribution of quality scores and ambiguous nucleotides along the length of the reads. Panel c and d show summaries of the number of reads in the dataset as well as the fraction of reads with no ambiguous bases and the number of unique sequences as a percentage of the total Overview of Grape’s mapping strategy. The initial genome and junctions mapping is followed by a round of split mapping. Remaining unmapped reads are aligned with additional mismatches, and the still remaining ones are iteratively trimmed. The mappings resulting from the different steps are combined into a final merged mapping Raisin visualization of Grape’s mapping step. Panel a shows the overall mapping results as well as the information on the genome annotation and number of mismatches used for the alignments. Panel b shows the fraction of reads aligned in the final merged mapping. Panels c, d and e show the same type of information for the different components of the mapping process Distribution of uniquely mapping reads along the annotated transcripts. This allows us to identify biases that may be caused by issues such as RNA degradation Raisin visualization of the transcript quantification step. Panel a shows the distribution of gene expression. Panels b and c show, respectively, the top genes and transcripts detected in the different lanes of the analyzed samples Raisin visualization of Grape’s splicing analysis. Panel a shows the summary table for the detected splice sites. Panel b contains a table with the top included exons in the samples examined, and panel c shows the distribution of the inclusion values over all internal exons Raisin overview of Grape’s analysis results
© Copyright 2025 ExpyDoc