RNA-‐Seq: Analysis Methods FAS Informa:cs April 2, 2014 Agenda • Overview of current Illumina technology • RNA-‐Seq analysis – Quality control – Preprocessing – Alignment to genome – Transcript assembly – Expression es:ma:on – Differen:al expression test Agenda • Overview of current Illumina technology • RNA-‐Seq analysis – Quality control – Preprocessing – Alignment to genome – Transcript assembly – Expression es:ma:on – Differen:al expression test Prepara:on Step: cDNA Library Construc:on • RNA is split into fragments 200-‐500 base pairs long • Hybridiza:on produces cDNA • random hexamer primers • oligodT • Machine-‐specific adapter sequences are ligated • A index can be ligated to each fragment • Fragments are size-‐selected and PCR-‐amplified • Quality control checks library concentra:on and length distribu:on Illumina Technology Step 1: cDNA enters flowcell • Complementary adapter sequences and primers are ligated to flowcell surface • Adaptor at fragment end hybridizes to complementary adapter sequence on the surface. Illumina Technology Step 2: Hybridiza:on and synthesis • Free end bends to surface and hybridizes to a second adapter containing a primer • The primer allows DNA polymerase to replicate the fragment in place via DNA synthesis Illumina Technology Step 3: Denatura:on and Repe::on • The double-‐stranded DNA is denatured, leaving two complementary fragments • Hybridiza)on, DNA synthesis and denatura)on is repeated many :mes to create a cluster of fragments Illumina Technology Step 4: Complement fragments removed • All complementary fragments are removed from the surface • Remaining cluster consists of single-‐stranded, iden:cal copies of original fragment. Called a clonal cluster • A flowcell can contain 40-‐60 million clonal clusters at a :me Illumina Technology Step 5: Single-‐nucleo:de DNA synthesis • Primers are bound to the free fragment ends in the cluster • Fluorescently-‐labeled, reversibly terminated nucleo)des are introduced • One nucleo:de is hybed at the primer site; further synthesis is prevented by the terminator • Unbound nucleo:des are removed and the surface is imaged Illumina Technology Step 6: Imaging • The color of the bound fluorophores reveals the iden:ty of the nucleo:des that were added to the cluster • Clonal clusters amplify the signal that would be generated by a single library fragment Agenda • Overview of current Illumina technology • Example analysis: Differen:al gene expression – Quality control – Preprocessing – Alignment to genome – Transcript assembly – Expression es:ma:on – Differen:al expression test Agenda • Overview of current Illumina technology • RNA-‐Seq analysis – Quality control – Preprocessing – Alignment to genome – Transcript assembly – Expression es:ma:on – Differen:al expression test Example Analysis: Differen:ally-‐expressed genes in cancer • 3 breast cancer samples (TNBC1, TNBC3, TNBC4) • Subtype TNBC, associated with higher recurrence and mortality • 3 control samples (NBS1, NBS2, NBS3) • Normal :ssue • ~35M 60nt read pairs per sample • GEO Accession: GSE52194 • Ques:on: What genes are differen.ally expressed between these two condi.ons? Prepara:on Step: Download data from GEO • Step 1: Find dataset on GEO by accession number (GSE52194) • hbps://www.ncbi.nlm.nih.gov/gds Prepara:on Step: Download data from GEO • Step 3: Download samples to cluster: • module load centos6/sratoolkit.2.3.4-2! ! fastq-dump --split-files SRR1027171! fastq-dump --split-files SRR1027173! fastq-dump --split-files SRR1027174! fastq-dump --split-files SRR1027188 ! ...! ! ! The ! “split-‐files” parameter causes Reads 1 and 2 to be wriben to separate files Data format: Reads are in FASTQ format • File names end with .fq or .fastq • Read pairs ofen across two files (e.g., A.R1.fastq, A.R2.fastq) • Format for a single read: <Header> <Sequence> +<Op:onal Header> <Quality Scores Encoding> • An example read pair: First read in the file TNBC1.R1.fastq (cancer sample): @SRR1027171.1 HWUSI-‐EAS371_0001:1:1:1017:2351 length=60 NAGCCACCAATTAAGAAAGCGTTCAAGCTCAACACCCACTACCTAAAAAATCCCAAACAT +SRR1027171.1 HWUSI-‐EAS371_0001:1:1:1017:2351 length=60 BHHHHJGGGJW`W`S`````RRSSSPPMTP``````Q```WSWWLGHFFH````U````T First read in the file TNBC1.R2.fastq: @SRR1027171.1 HWUSI-‐EAS371_0001:1:1:1017:2351 length=60 TNNNNNNTNGATTGTAGATANNNNNNNNNNNNNNGNCNNNTCNGNNNNNNNNTCNGACGC +SRR1027171.1 HWUSI-‐EAS371_0001:1:1:1017:2351 length=60 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Data format: Reads are in FASTQ format • Character range determines quality score encoding type • Example quality string: BHJGJW`W`S```RRSSPPMTP```Q```WSWG P(error) = 10-‐q/10 Q=40 -‐> P(error) = .0001 Q=20 -‐> P(error) = .1 Q=2 -‐> P(error) = .6 hbp://en.wikipedia.org/wiki/FASTQ_format Example Analysis: Differen:ally-‐expressed genes in cancer 1. 2. 3. 4. 5. Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Code and data: /odyssey/informa:cs/examples/rnaseq/breast_cancer Example Analysis: Differen:ally-‐expressed genes in cancer 1. 2. 3. 4. 5. Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Code and data: /odyssey/informa:cs/examples/rnaseq/breast_cancer Differen:ally-‐expressed genes in cancer Step 1: Quality Control with FastQC Script to run FastQC on a sample: ! #!/bin/bash! #SBATCH –N 1 #SBATCH –n 6 #SBATCH -t 100 #SBATCH -p serial_requeue #SBATCH --mem-per-cpu=1000 # # # # # Number of nodes! Number of cores! Runtime in minutes! Partition to submit to! Memory per cpu in MB! ! module load centos6/fastqc-0.10.1! fastqc -t 6 --nogroup Sample1.R1.fastq! fastqc -t 6 --nogroup Sample1.R2.fastq! ! Save to file myscript.sh, and launch job with the command: ! sbatch myscript.sh! ! Quality Score Differen:ally-‐expressed genes in cancer Step 1: Quality Control with FastQC Read1 Base Index Read2 Base Index • Quality scores diminish toward read ends • R2 reads tend to have lower quality than R1, and more rapid quality loss at read end Frequency Differen:ally-‐expressed genes in cancer Step 1: Quality Control with FastQC Read1 Base Index • Sequence bias at read start • Same bias is seen in R1 and R2 Read2 Base Index Sequence bias • Bias due to random hexamer priming • Cannot be removed by trimming reads • Some downstream algorithms (e.g., Cufflinks) try to correct this bias Roberts et al. Genome Biology doi:10.1186/gb-‐2011-‐12-‐3-‐r22 Hansen et al. (2011) Nucleic Acids Res 38(12), e131. Differen:ally-‐expressed genes in cancer 1. 2. 3. 4. 5. Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Differen:ally-‐expressed genes in cancer Step 2: Preprocessing • Remove reads with undetermined bases • Trim poly-‐A tails and any adapter sequence • Poly-‐A tails will appear as poly-‐T tails in cDNA • Small RNAs (miRNA, ncRNA) shorter than read length will be sequenced into the adapter sequence • Trim low-‐quality bases • Require a minimum read length Differen:ally-‐expressed genes in cancer Step 2: Preprocessing Example script to preprocess reads with Trimmoma:c: ! #!/bin/bash! #SBATCH –N 1 #SBATCH –n 6 #SBATCH -t 100 #SBATCH -p serial_requeue #SBATCH --mem-per-cpu=1000 # # # # # Number of nodes! Number of cores! Runtime in minutes! Partition to submit to! Memory per cpu in MB! ! module load centos6/Trimmomatic-0.30! ! java -jar $TRIMMOMATIC/trimmomatic-0.30.jar PE \! -phred64 –threads 6 R1.fq R2.fq \! out.R1.paired.fq out.R1.unpaired.fq \! out.R2.paired.fq out.R2.unpaired.fastq \! ILLUMINACLIP:$TRIMMOMATIC/adapters/TruSeq3-PE.fa:2:30:10 \! SLIDINGWINDOW:40:30 MINLEN:30 LEADING:3 TRAILING:3! ! Differen:ally-‐expressed genes in cancer Step 2: Preprocessing Afer Preprocessing: Quality Score Before Preprocessing: Read Base Index Read Base Index Other Preprocessing Tools • Fastx • Clip adaptor sequences from start/end of reads • Collapse reads (remove duplicates) • Convert between FASTQ and FASTA format • Cutadapt • Remove adaptor sequence • Trim low-‐quality bases from read ends Differen:ally-‐expressed genes in cancer 1. 2. 3. 4. 5. Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Prepara:on Step: Download Genome • Many genomes are available with annota:on and pre-‐built indexes: • hbp://cufflinks.cbcb.umd.edu/igenomes.html Read Alignment Algorithm (Bow:e2) • Short “seed” substrings of reads are aligned via genome index, allowing no gaps • Try to extend seed placement to full read alignment, allowing gaps via dynamic programming Other Tools for Non-‐Spliced Alignment • Aligner comparison by Bow:e2 team, using simulated data: Langmead et al. Nat Methods. ; 9(4): 357–359. TopHat2: Spliced Read Alignment to Reference Genome 1. Align reads to transcriptome (if annota:on provided) 2. Align unaligned reads (and any poorly mapped reads from Step 1) to genome 3. Iden:fy novel transcripts, splice sites, indels and fusions via spliced alignment to genome: 1. Split reads into small segments 2. Align segments to genome 3. Merge aligning segments 4. Re-‐align reads to novel transcript features TopHat2: Spliced Read Alignment Read Exon from annota:on Novel transcript Intron or intergenic region Bow:e2 + Transcriptome Bow:e2 + Genome Bow:e2 + Genome Kim et al. Genome Biology 2013, 14:R36 TopHat2: Spliced Read Alignment Read Novel transcript Intron or intergenic region Kim et al. Genome Biology 2013, 14:R36 Differen:ally-‐expressed genes in cancer Step 3: Spliced alignment with TopHat2 First, generate transcriptome index for Bow:e2: ! #!/bin/bash #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=1 # Number of cores #SBATCH --time=00:20:00 # Runtime HH:MM:SS #SBATCH --partition=serial_requeue # Partition! #SBATCH --mem=500 # Memory in MB ! ! module load centos6/tophat-2.0.11.Linux_x86_64! export BOWTIE2_INDEXES=/path/to/hg19! ! tophat -G hg19_genes.gtf \! --transcriptome-index=tophat/genes \! $BOWTIE2_INDEXES/genome! Differen:ally-‐expressed genes in cancer Step 3: Spliced alignment with TopHat2 Now run TopHat2 on a sample: ! #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=32 #SBATCH --time=2-0 #SBATCH --partition=general #SBATCH --mem=24000 # # # # # Number of nodes Number of cores Runtime D-HH:MM:SS Partition Memory in MB ! module load centos6/tophat-2.0.11.Linux_x86_64! export BOWTIE2_INDEXES=/path/to/hg19! ! tophat --output-dir tophat --num-threads 32 \! --transcriptome-index=tophat/genes \! $BOWTIE2_INDEXES/genome \! TNBC1.R1.paired.fastq \! TNBC1.R2.paired.fastq,TNBC1.orphans.fastq! Differen:ally-‐expressed genes in cancer Step 3: Spliced alignment with TopHat2 Visualize alignment BAM file in IGV: hbp://www.broadins:tute.org/igv Other Spliced-‐Alignment Algorithms Compara:ve performance with Illumina 76nt paired-‐end reads, real and simulated: Engström et al. Nat Methods 2013, 10(12):1185-‐1191 Differen:ally-‐expressed genes in cancer 1. 2. 3. 4. 5. Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Transcript Assembly with Reference Genome Trapnell et al. (2010) doi:10.1038/nbt.1621 Transcript Assembly with Reference Genome and Gene Annota:ons Roberts et al. (2011) Bioinforma:cs doi:10.1093/bioinforma:cs/btr355 Expression Es:ma:on • Expression measured in • Reads Per Kilobase of transcript per Million mapped reads (RPKM) • FPKM (Fragments Per Kilobase…) • Gene expression is sum of isoform expression Differen:ally-‐expressed genes in cancer Step 4: Transcript Assembly with Cufflinks Script to run Cufflinks on a sample (TNBC1): ! #!/bin/bash! #SBATCH --nodes=1 #SBATCH --ntasks=10 #SBATCH --time=10:00:00 #SBATCH --partition=serial_requeue #SBATCH --mem=10000 # # # # # Number of nodes! Number of cores! Runtime HH:MM:SS! Partition to submit to! Memory in MB! ! module load centos6/cufflinks-2.2.0.Linux_x86_64! ! cufflinks --num-threads 10 \! --GTF-guide hg19.gtf \! --frag-bias-correct genome.fa \! --multi-read-correct \! TNBC1.bam! Differen:ally-‐expressed genes in cancer Step 4: Transcript Assembly with Cufflinks Script to run Cuffmerge to merge all novel transcripts into one annota:on: ! #!/bin/bash! #SBATCH --nodes=1 #SBATCH --ntasks=6 #SBATCH --time=30:00 #SBATCH --partition=serial_requeue #SBATCH --mem=2000 # # # # # Number of nodes! Number of cores! Runtime MM:SS! Partition to submit to! Memory in MB! ! module load centos6/cufflinks-2.2.0.Linux_x86_64! ! cuffmerge --ref-gtf hg19_genes.gtf --num-threads 6 \! GTFs.txt # GTFs.txt contains list of GTF files to merge! ! Differen:ally-‐expressed genes in cancer 1. 2. 3. 4. 5. Quality control Preprocessing Alignment to genome Transcript assembly and expression es:ma:on Differen:al expression test Differen:al Expression Test • Best to have at least 3 samples per class for sta:s:cal significance Differen:al Expression Test • For each transcript, variance is determined by examining read counts across replicates • Mean and variance (and assump:on of nega:ve binomial distribu:on) gives significance scores for differen:al expression Differen:ally-‐expressed genes in cancer Step 5: Differen:al Expression Test Script to test for differen:al expression between two sample classes:! ! #!/bin/bash ! #SBATCH –N 1 # Number of nodes! #SBATCH –n 8 # Number of cores! #SBATCH –t 3:00:00 # Runtime HH:MM:SS! #SBATCH -p serial_requeue # Partition to submit to! #SBATCH --mem-per-cpu=4000 # Memory per cpu in MB! ! module load centos6/cufflinks-2.1.1.Linux_x86_64! ! cuffdiff -p 8 annotation.gtf \ ! NBS1.bam,NBS2.bam,NBS3.bam \ ! TNBC1.bam,TNBC3.bam,TNBC4.bam! Other tools for differen:al expression • edgeR • Uses nega:ve binomial distribu:on to model read count varia:on • Shares informa:on between genes to es:mate variance • DESeq • Also uses nega:ve binomial distribu:on • More conserva:ve at calling differen:ally expressed genes Visualiza:on of differen:ally expressed genes with CummeRbund Summary: RNA-‐Seq Analysis with Tuxedo Suite Trapnell et al. (2012) Nat Methods. doi:10.1038/nprot.2012.016
© Copyright 2025 ExpyDoc