RNA-Seq Analysis Methods

RNA-­‐Seq: Analysis Methods FAS Informa:cs April 2, 2014 Agenda •  Overview of current Illumina technology •  RNA-­‐Seq analysis –  Quality control –  Preprocessing –  Alignment to genome –  Transcript assembly –  Expression es:ma:on –  Differen:al expression test Agenda •  Overview of current Illumina technology •  RNA-­‐Seq analysis –  Quality control –  Preprocessing –  Alignment to genome –  Transcript assembly –  Expression es:ma:on –  Differen:al expression test Prepara:on Step: cDNA Library Construc:on •  RNA is split into fragments 200-­‐500 base pairs long •  Hybridiza:on produces cDNA •  random hexamer primers •  oligodT •  Machine-­‐specific adapter sequences are ligated •  A index can be ligated to each fragment •  Fragments are size-­‐selected and PCR-­‐amplified •  Quality control checks library concentra:on and length distribu:on Illumina Technology Step 1: cDNA enters flowcell •  Complementary adapter sequences and primers are ligated to flowcell surface •  Adaptor at fragment end hybridizes to complementary adapter sequence on the surface. Illumina Technology Step 2: Hybridiza:on and synthesis •  Free end bends to surface and hybridizes to a second adapter containing a primer •  The primer allows DNA polymerase to replicate the fragment in place via DNA synthesis Illumina Technology Step 3: Denatura:on and Repe::on •  The double-­‐stranded DNA is denatured, leaving two complementary fragments •  Hybridiza)on, DNA synthesis and denatura)on is repeated many :mes to create a cluster of fragments Illumina Technology Step 4: Complement fragments removed •  All complementary fragments are removed from the surface •  Remaining cluster consists of single-­‐stranded, iden:cal copies of original fragment. Called a clonal cluster •  A flowcell can contain 40-­‐60 million clonal clusters at a :me Illumina Technology Step 5: Single-­‐nucleo:de DNA synthesis •  Primers are bound to the free fragment ends in the cluster •  Fluorescently-­‐labeled, reversibly terminated nucleo)des are introduced •  One nucleo:de is hybed at the primer site; further synthesis is prevented by the terminator •  Unbound nucleo:des are removed and the surface is imaged Illumina Technology Step 6: Imaging •  The color of the bound fluorophores reveals the iden:ty of the nucleo:des that were added to the cluster •  Clonal clusters amplify the signal that would be generated by a single library fragment Agenda •  Overview of current Illumina technology •  Example analysis: Differen:al gene expression –  Quality control –  Preprocessing –  Alignment to genome –  Transcript assembly –  Expression es:ma:on –  Differen:al expression test Agenda •  Overview of current Illumina technology •  RNA-­‐Seq analysis –  Quality control –  Preprocessing –  Alignment to genome –  Transcript assembly –  Expression es:ma:on –  Differen:al expression test Example Analysis: Differen:ally-­‐expressed genes in cancer •  3 breast cancer samples (TNBC1, TNBC3, TNBC4) •  Subtype TNBC, associated with higher recurrence and mortality •  3 control samples (NBS1, NBS2, NBS3) •  Normal :ssue •  ~35M 60nt read pairs per sample •  GEO Accession: GSE52194 •  Ques:on: What genes are differen.ally expressed between these two condi.ons? Prepara:on Step: Download data from GEO •  Step 1: Find dataset on GEO by accession number (GSE52194) •  hbps://www.ncbi.nlm.nih.gov/gds Prepara:on Step: Download data from GEO •  Step 3: Download samples to cluster: • 
module load centos6/sratoolkit.2.3.4-2!
!
fastq-dump --split-files SRR1027171!
fastq-dump --split-files SRR1027173!
fastq-dump --split-files SRR1027174!
fastq-dump --split-files SRR1027188 !
...!
!
!
The ! “split-­‐files” parameter causes Reads 1 and 2 to be wriben to separate files Data format: Reads are in FASTQ format •  File names end with .fq or .fastq •  Read pairs ofen across two files (e.g., A.R1.fastq, A.R2.fastq) •  Format for a single read: <Header> <Sequence> +<Op:onal Header> <Quality Scores Encoding> •  An example read pair: First read in the file TNBC1.R1.fastq (cancer sample): @SRR1027171.1 HWUSI-­‐EAS371_0001:1:1:1017:2351 length=60 NAGCCACCAATTAAGAAAGCGTTCAAGCTCAACACCCACTACCTAAAAAATCCCAAACAT +SRR1027171.1 HWUSI-­‐EAS371_0001:1:1:1017:2351 length=60 BHHHHJGGGJW`W`S`````RRSSSPPMTP``````Q```WSWWLGHFFH````U````T First read in the file TNBC1.R2.fastq: @SRR1027171.1 HWUSI-­‐EAS371_0001:1:1:1017:2351 length=60 TNNNNNNTNGATTGTAGATANNNNNNNNNNNNNNGNCNNNTCNGNNNNNNNNTCNGACGC +SRR1027171.1 HWUSI-­‐EAS371_0001:1:1:1017:2351 length=60 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Data format: Reads are in FASTQ format •  Character range determines quality score encoding type •  Example quality string: BHJGJW`W`S```RRSSPPMTP```Q```WSWG P(error) = 10-­‐q/10 Q=40 -­‐> P(error) = .0001 Q=20 -­‐> P(error) = .1 Q=2 -­‐> P(error) = .6 hbp://en.wikipedia.org/wiki/FASTQ_format Example Analysis: Differen:ally-­‐expressed genes in cancer 1. 
2. 
3. 
4. 
5. 
Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Code and data: /odyssey/informa:cs/examples/rnaseq/breast_cancer Example Analysis: Differen:ally-­‐expressed genes in cancer 1. 
2. 
3. 
4. 
5. 
Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Code and data: /odyssey/informa:cs/examples/rnaseq/breast_cancer Differen:ally-­‐expressed genes in cancer Step 1: Quality Control with FastQC Script to run FastQC on a sample: !
#!/bin/bash!
#SBATCH –N 1
#SBATCH –n 6
#SBATCH -t 100
#SBATCH -p serial_requeue
#SBATCH --mem-per-cpu=1000
#
#
#
#
#
Number of nodes!
Number of cores!
Runtime in minutes!
Partition to submit to!
Memory per cpu in MB!
!
module load centos6/fastqc-0.10.1!
fastqc -t 6 --nogroup Sample1.R1.fastq!
fastqc -t 6 --nogroup Sample1.R2.fastq!
!
Save to file myscript.sh, and launch job with the command: !
sbatch myscript.sh!
!
Quality Score Differen:ally-­‐expressed genes in cancer Step 1: Quality Control with FastQC Read1 Base Index Read2 Base Index •  Quality scores diminish toward read ends •  R2 reads tend to have lower quality than R1, and more rapid quality loss at read end Frequency Differen:ally-­‐expressed genes in cancer Step 1: Quality Control with FastQC Read1 Base Index •  Sequence bias at read start •  Same bias is seen in R1 and R2 Read2 Base Index Sequence bias •  Bias due to random hexamer priming •  Cannot be removed by trimming reads •  Some downstream algorithms (e.g., Cufflinks) try to correct this bias Roberts et al. Genome Biology doi:10.1186/gb-­‐2011-­‐12-­‐3-­‐r22 Hansen et al. (2011) Nucleic Acids Res 38(12), e131. Differen:ally-­‐expressed genes in cancer 1. 
2. 
3. 
4. 
5. 
Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Differen:ally-­‐expressed genes in cancer Step 2: Preprocessing •  Remove reads with undetermined bases •  Trim poly-­‐A tails and any adapter sequence •  Poly-­‐A tails will appear as poly-­‐T tails in cDNA •  Small RNAs (miRNA, ncRNA) shorter than read length will be sequenced into the adapter sequence •  Trim low-­‐quality bases •  Require a minimum read length Differen:ally-­‐expressed genes in cancer Step 2: Preprocessing Example script to preprocess reads with Trimmoma:c: !
#!/bin/bash!
#SBATCH –N 1
#SBATCH –n 6
#SBATCH -t 100
#SBATCH -p serial_requeue
#SBATCH --mem-per-cpu=1000
#
#
#
#
#
Number of nodes!
Number of cores!
Runtime in minutes!
Partition to submit to!
Memory per cpu in MB!
!
module load centos6/Trimmomatic-0.30!
!
java -jar $TRIMMOMATIC/trimmomatic-0.30.jar PE \!
-phred64 –threads 6 R1.fq R2.fq \!
out.R1.paired.fq out.R1.unpaired.fq \!
out.R2.paired.fq out.R2.unpaired.fastq \!
ILLUMINACLIP:$TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10 \!
SLIDINGWINDOW:40:30 MINLEN:30 LEADING:3 TRAILING:3!
!
Differen:ally-­‐expressed genes in cancer Step 2: Preprocessing Afer Preprocessing: Quality Score Before Preprocessing: Read Base Index Read Base Index Other Preprocessing Tools •  Fastx •  Clip adaptor sequences from start/end of reads •  Collapse reads (remove duplicates) •  Convert between FASTQ and FASTA format •  Cutadapt •  Remove adaptor sequence •  Trim low-­‐quality bases from read ends Differen:ally-­‐expressed genes in cancer 1. 
2. 
3. 
4. 
5. 
Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Prepara:on Step: Download Genome •  Many genomes are available with annota:on and pre-­‐built indexes: •  hbp://cufflinks.cbcb.umd.edu/igenomes.html Read Alignment Algorithm (Bow:e2) •  Short “seed” substrings of reads are aligned via genome index, allowing no gaps •  Try to extend seed placement to full read alignment, allowing gaps via dynamic programming Other Tools for Non-­‐Spliced Alignment •  Aligner comparison by Bow:e2 team, using simulated data: Langmead et al. Nat Methods. ; 9(4): 357–359. TopHat2: Spliced Read Alignment to Reference Genome 1.  Align reads to transcriptome (if annota:on provided) 2.  Align unaligned reads (and any poorly mapped reads from Step 1) to genome 3.  Iden:fy novel transcripts, splice sites, indels and fusions via spliced alignment to genome: 1.  Split reads into small segments 2.  Align segments to genome 3.  Merge aligning segments 4.  Re-­‐align reads to novel transcript features TopHat2: Spliced Read Alignment Read Exon from annota:on Novel transcript Intron or intergenic region Bow:e2 + Transcriptome Bow:e2 + Genome Bow:e2 + Genome Kim et al. Genome Biology 2013, 14:R36 TopHat2: Spliced Read Alignment Read Novel transcript Intron or intergenic region Kim et al. Genome Biology 2013, 14:R36 Differen:ally-­‐expressed genes in cancer Step 3: Spliced alignment with TopHat2 First, generate transcriptome index for Bow:e2: !
#!/bin/bash
#SBATCH --nodes=1
# Number of nodes
#SBATCH --ntasks=1
# Number of cores
#SBATCH --time=00:20:00
# Runtime HH:MM:SS
#SBATCH --partition=serial_requeue # Partition!
#SBATCH --mem=500
# Memory in MB !
!
module load centos6/tophat-2.0.11.Linux_x86_64!
export BOWTIE2_INDEXES=/path/to/hg19!
!
tophat -G hg19_genes.gtf \!
--transcriptome-index=tophat/genes \!
$BOWTIE2_INDEXES/genome!
Differen:ally-­‐expressed genes in cancer Step 3: Spliced alignment with TopHat2 Now run TopHat2 on a sample: !
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=2-0
#SBATCH --partition=general
#SBATCH --mem=24000
#
#
#
#
#
Number of nodes
Number of cores
Runtime D-HH:MM:SS
Partition
Memory in MB
!
module load centos6/tophat-2.0.11.Linux_x86_64!
export BOWTIE2_INDEXES=/path/to/hg19!
!
tophat --output-dir tophat --num-threads 32 \!
--transcriptome-index=tophat/genes \!
$BOWTIE2_INDEXES/genome \!
TNBC1.R1.paired.fastq \!
TNBC1.R2.paired.fastq,TNBC1.orphans.fastq!
Differen:ally-­‐expressed genes in cancer Step 3: Spliced alignment with TopHat2 Visualize alignment BAM file in IGV: hbp://www.broadins:tute.org/igv Other Spliced-­‐Alignment Algorithms Compara:ve performance with Illumina 76nt paired-­‐end reads, real and simulated: Engström et al. Nat Methods 2013, 10(12):1185-­‐1191 Differen:ally-­‐expressed genes in cancer 1. 
2. 
3. 
4. 
5. 
Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Transcript Assembly with Reference Genome Trapnell et al. (2010) doi:10.1038/nbt.1621 Transcript Assembly with Reference Genome and Gene Annota:ons Roberts et al. (2011) Bioinforma:cs doi:10.1093/bioinforma:cs/btr355 Expression Es:ma:on •  Expression measured in •  Reads Per Kilobase of transcript per Million mapped reads (RPKM) •  FPKM (Fragments Per Kilobase…) •  Gene expression is sum of isoform expression Differen:ally-­‐expressed genes in cancer Step 4: Transcript Assembly with Cufflinks Script to run Cufflinks on a sample (TNBC1): !
#!/bin/bash!
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --time=10:00:00
#SBATCH --partition=serial_requeue
#SBATCH --mem=10000
#
#
#
#
#
Number of nodes!
Number of cores!
Runtime HH:MM:SS!
Partition to submit to!
Memory in MB!
!
module load centos6/cufflinks-2.2.0.Linux_x86_64!
!
cufflinks --num-threads 10 \!
--GTF-guide hg19.gtf \!
--frag-bias-correct genome.fa \!
--multi-read-correct \!
TNBC1.bam!
Differen:ally-­‐expressed genes in cancer Step 4: Transcript Assembly with Cufflinks Script to run Cuffmerge to merge all novel transcripts into one annota:on: !
#!/bin/bash!
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --time=30:00
#SBATCH --partition=serial_requeue
#SBATCH --mem=2000
#
#
#
#
#
Number of nodes!
Number of cores!
Runtime MM:SS!
Partition to submit to!
Memory in MB!
!
module load centos6/cufflinks-2.2.0.Linux_x86_64!
!
cuffmerge --ref-gtf hg19_genes.gtf --num-threads 6 \!
GTFs.txt
# GTFs.txt contains list of GTF files to merge!
!
Differen:ally-­‐expressed genes in cancer 1. 
2. 
3. 
4. 
5. 
Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Differen:al Expression Test •  Best to have at least 3 samples per class for sta:s:cal significance Differen:al Expression Test •  For each transcript, variance is determined by examining read counts across replicates •  Mean and variance (and assump:on of nega:ve binomial distribu:on) gives significance scores for differen:al expression Differen:ally-­‐expressed genes in cancer Step 5: Differen:al Expression Test Script to test for differen:al expression between two sample classes:!
!
#!/bin/bash !
#SBATCH –N 1
# Number of nodes!
#SBATCH –n 8
# Number of cores!
#SBATCH –t 3:00:00
# Runtime HH:MM:SS!
#SBATCH -p serial_requeue
# Partition to submit to!
#SBATCH --mem-per-cpu=4000
# Memory per cpu in MB!
!
module load centos6/cufflinks-2.1.1.Linux_x86_64!
!
cuffdiff -p 8 annotation.gtf \ !
NBS1.bam,NBS2.bam,NBS3.bam \ !
TNBC1.bam,TNBC3.bam,TNBC4.bam!
Other tools for differen:al expression •  edgeR •  Uses nega:ve binomial distribu:on to model read count varia:on •  Shares informa:on between genes to es:mate variance •  DESeq •  Also uses nega:ve binomial distribu:on •  More conserva:ve at calling differen:ally expressed genes Visualiza:on of differen:ally expressed genes with CummeRbund Summary: RNA-­‐Seq Analysis with Tuxedo Suite Trapnell et al. (2012) Nat Methods. doi:10.1038/nprot.2012.016