STAT435 - Lecture 6 - University of Otago

STAT435 - Lecture 6
RNA-seq - counts and analysis in R
Mik Black
Department of Biochemistry, University of Otago
RNA-seq: generating count data
· In lecture 5 we used SeqMonk to read in aligned sequence data (BAM
files) and generate read counts.
· The output was a text file containing a value representing the number
of reads that mapped (aligned) to each gene in the genome, across
each sample in the experiment.
· These values represent the level of gene expression activity for each
gene, in each sample.
· This information can also be generated in R.
2/30
Attribution notice
· The code in the following slides has been extracted from a presentation
given by Associate Professor Thomas Girke, Institute for Integrative
Genome Biology, UC Riverside:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
· Additional (excellent) R code for genomic analysis is available from
Thomas at:
http://manuals.bioinformatics.ucr.edu/home/ht-seq
3/30
Example data
· The example we'll follow along with is from an Arabidopsis Thaliana
flowering time RNA-seq experiment (a subset of 4 samples is used
here).
· Two conditions:
1. AP3: ``AP3 domain, flower stage 4''
2. TRL : ``Translatome, flower stage 4''
· Two replicates of each condition.
4/30
Count data in R: getting started
library(rtracklayer)
library(GenomicRanges)
library(GenomicAlignments)
library(Rsamtools)
· GenomicRanges: specialized data structures for holding information
about genomic intervals.
· Rsamtools: support for manipulating BAM files.
· rtracklayer: tools for working with annotation data (GFF files here), and
interacting with online genome browsers.
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
5/30
Genome annotation
## Import GFF file into R:
gff <- import.gff("TAIR10_GFF3_trunc.gff", asRangedData=FALSE)
· GFF file contains annotation (e.g., gene positions):
##gff-version 1
##date 2011-05-26
Chr1
TAIR10 chromosome
Chr1
TAIR10 gene
3631
Chr1
TAIR10 mRNA
3631
Chr1
TAIR10 protein 3760
Chr1
TAIR10 exon
3631
Chr1
TAIR10 five_prime_UTR
Chr1
TAIR10 CDS
3760
Chr1
TAIR10 exon
3996
Chr1
TAIR10 CDS
3996
Chr1
TAIR10 exon
4486
Chr1
TAIR10 CDS
4486
Chr1
TAIR10 exon
4706
Chr1
TAIR10 CDS
4706
Chr1
TAIR10 exon
5174
1
5899
5899
5630
3913
3631
3913
4276
4276
4605
4605
5095
5095
5326
100000
0
0
0
0
3759
0
0
0
0
0
0
0
0
0
+
+
+
+
0
+
+
+
+
+
+
+
+
*
.
.
.
.
+
0
.
2
.
0
.
0
.
.
ID=Chr1;Name=Chr1
ID=AT1G01010;Note=protein_codi
ID=AT1G01010.1;Parent=AT1G0101
ID=AT1G01010.1-Protein;Name=AT
Parent=AT1G01010.1
.
Parent=AT1G01010.1
Parent=AT1G01010.1,AT1G01010.1
Parent=AT1G01010.1
Parent=AT1G01010.1,AT1G01010.1
Parent=AT1G01010.1
Parent=AT1G01010.1,AT1G01010.1
Parent=AT1G01010.1
Parent=AT1G01010.1,AT1G01010.1
6/30
Parent=AT1G01010.1
Arabidopis (truncated) gene location info
seqlengths(gff) <- end(ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),]))
subgene_index <- which(elementMetadata(gff)[,"type"] == "exon")
# Extract just the gene ranges
gffsub <- gff[subgene_index,]
gffsub[1:2, c(2,5)]
## GRanges with 2 ranges and 2 metadata columns:
##
seqnames
ranges strand |
type
group
##
<Rle>
<IRanges> <Rle> | <factor>
<factor>
##
[1]
Chr1 [3631, 3913]
+ |
exon Parent=AT1G01010.1
##
[2]
Chr1 [3996, 4276]
+ |
exon Parent=AT1G01010.1
##
##
##
##
--seqlengths:
Chr1
Chr2
Chr3
Chr4
Chr5
ChrC
ChrM
100000 100000 100000 100000 100000 100000 100000
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
7/30
ids <- gsub("Parent=|\\..*", "", elementMetadata(gffsub)$group)
gffsub <- split(gffsub, ids)
gffsub
## $AT1G01010
## GRanges with 6 ranges and 5 metadata columns:
##
seqnames
ranges strand |
source
type
score
phase
##
##
[1]
<Rle>
<IRanges>
Chr1 [3631, 3913]
##
##
[2]
[3]
Chr1 [3996, 4276]
Chr1 [4486, 4605]
+ |
+ |
TAIR10
TAIR10
exon
exon
0
0
<NA>
<NA>
##
##
##
##
##
[4]
[5]
[6]
Chr1 [4706, 5095]
Chr1 [5174, 5326]
Chr1 [5439, 5899]
group
<factor>
+ |
+ |
+ |
TAIR10
TAIR10
TAIR10
exon
exon
exon
0
0
0
<NA>
<NA>
<NA>
##
##
[1] Parent=AT1G01010.1
[2] Parent=AT1G01010.1
##
##
##
##
[3]
[4]
[5]
[6]
Parent=AT1G01010.1
Parent=AT1G01010.1
Parent=AT1G01010.1
Parent=AT1G01010.1
<Rle> | <factor> <factor> <numeric> <integer>
+ |
TAIR10
exon
0
<NA>
8/30
Read BAM files and generate count data
samples <- readLines('bamfiles.txt')
samplespath <- paste("./BAMfiles/", samples, sep="")
names(samplespath) <- samples
bfl <- BamFileList(samplespath, index=character())
countDF <- summarizeOverlaps(gffsub, bfl, mode="Union", ignore.strand=TRUE)
countDF <- assays(countDF)$counts
countDF[1:4,]
##
SRR064154.bam SRR064155.bam SRR064166.bam SRR064167.bam
## AT1G01010
52
26
60
75
## AT1G01020
146
77
82
64
## AT1G01030
## AT1G01040
5
480
1
346
13
282
14
335
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
9/30
Count data!
· At this point, we have generated read counts from the aligned data
stored in the BAM files.
· It is in matrix form, so is very similar to the format produced by
SeqMonk (although the SeqMonk outout has a whole lot of extra
columns that need to be removed).
class(countDF)
## [1] "matrix"
dim(countDF)
## [1] 151
4
10/30
Detecting differential expression
· From here, the process of detecting differentially expressed genes is
similar to what we used for microarray data.
· Specialized packages exist for detecting "differentially expressed
genes" (DEGs) from RNA-seq experiments:
library(DESeq)
library(edgeR)
· Limma also contains methods for analysing RNA-seq data.
· Each method is slightly different - we'll compare the results produced
by each.
11/30
RPKM normalization
· This function generates RPKM values for each sample (i.e., normalized
by gene length and total reads per sample).
returnRPKM <- function(counts, gffsub) {
## Length of exon union per gene in kbp
geneLengthsInKB <- sum(width(reduce(gffsub)))/1000
## Factor for converting to million of mapped reads.
millionsMapped <- sum(counts)/1e+06
## RPK: reads per kilobase of exon model.
rpm <- counts/millionsMapped
## RPKM: reads per kilobase of exon model per million mapped reads.
rpkm <- rpm/geneLengthsInKB
return(rpkm)
}
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
12/30
RPKM normalization
countDFrpkm <- apply(countDF, 2, function(x) returnRPKM(counts=x, gffsub=gffsub))
countDFrpkm[1:8,]
##
SRR064154.bam SRR064155.bam SRR064166.bam SRR064167.bam
## AT1G01010
52.499
24.6495
389.82
366.89
##
##
##
##
##
AT1G01020
AT1G01030
AT1G01040
AT1G01046
AT1G01050
140.256
4.473
130.800
74.096
506.374
69.4616
0.8401
88.5372
61.8482
429.5948
506.92
74.84
494.51
52.98
9135.28
297.90
60.68
442.32
239.35
7995.16
## AT1G01060
115.933
66.3090
1237.75
1220.49
## AT1G01070
9.403
4.4147
385.73
301.82
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
13/30
RPKM - per group means (AP3 and TRL)
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/colAg.R")
countDFrpkm_mean <- colAg(myMA=countDFrpkm, group=c(1,1,2,2),myfct=mean)
countDFrpkm_mean[1:8,]
##
##
##
##
##
##
SRR064154.bam_SRR064155.bam SRR064166.bam_SRR064167.bam
AT1G01010
AT1G01020
AT1G01030
AT1G01040
AT1G01046
38.574
104.859
2.657
109.668
67.972
378.35
402.41
67.76
468.41
146.16
## AT1G01050
467.984
8565.22
## AT1G01060
## AT1G01070
91.121
6.909
1229.12
343.78
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
14/30
RPKM - log ratios
· Find genes with DE log-ratios larger than 2.
countDFrpkm_mean <- cbind(countDFrpkm_mean,
log2ratio=log2(countDFrpkm_mean[,2]/countDFrpkm_mean[,1]))
countDFrpkm_mean <- countDFrpkm_mean[is.finite(countDFrpkm_mean[,3]),]
degs2fold <- countDFrpkm_mean[countDFrpkm_mean[,3] >= 1 | countDFrpkm_mean[,3] <= -1,]
colnames(degs2fold) <- gsub(".bam","",colnames(degs2fold))
degs2fold[1:4,]
##
SRR064154_SRR064155 SRR064166_SRR064167 log2ratio
## AT1G01010
## AT1G01020
38.574
104.859
378.35
402.41
3.294
1.940
## AT1G01030
## AT1G01040
2.657
109.668
67.76
468.41
4.673
2.095
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
15/30
Analysis using the DESeq package
library(DESeq)
conds <- c("AP3","AP3","TRL","TRL")
# Create object of class CountDataSet (like microarray eSet class)
# Note - raw count data is used, not RPKM values
cds <- newCountDataSet(countDF, conds)
# CountDataSet has similar characteristics/methods as eSet class
counts(cds)[1:4, ]
##
SRR064154.bam SRR064155.bam SRR064166.bam SRR064167.bam
## AT1G01010
52
26
60
75
## AT1G01020
## AT1G01030
## AT1G01040
146
5
480
77
1
346
82
13
282
64
14
335
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
16/30
Analysis using the DESeq package
# Estimate library size factors from count data.
cds <- estimateSizeFactors(cds)
# Estimate the variance within replicates
cds <- estimateDispersions(cds)
# Identify DEGs using nbinomTest
res <- nbinomTest(cds, "AP3", "TRL")
res <- na.omit(res)
res2fold <- res[res$log2FoldChange >= 1 | res$log2FoldChange <= -1,]
res2foldpadj <- res2fold[res2fold$padj <= 0.05, ]
res2foldpadj[1:2,-2]
##
id baseMeanA baseMeanB foldChange log2FoldChange
pval
## 6 AT1G01050
272.3
910.1
3.342
1.741 1.711e-11
## 7 AT1G01060
169.4
434.5
2.564
1.358 4.558e-06
##
padj
## 6 1.898e-10
## 7 3.186e-05
Code modified from:
17/30
Analysis using the edgeR package
library(edgeR)
countDF <- read.table("./results/countDF")
# Construct a DGEList object (note - raw count data, not RPKM)
y <- DGEList(counts=countDF, group=conds)
# Estimates common dispersion (overall variability)
y <- estimateCommonDisp(y)
# Estimates tagwise dispersion (gene specific variance)
y <- estimateTagwiseDisp(y)
# Use negative binomial distirbution to perform count-based test
et <- exactTest(y, pair=c("AP3", "TRL"))
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
18/30
Analysis using the edgeR package
topTags(et, n=4)
## Comparison of groups: TRL-AP3
##
logFC logCPM
PValue
FDR
## AT4G00050 5.571 10.39 3.590e-58 5.420e-56
## AT1G01050 4.213
## AT3G01120 3.862
## AT3G01150 3.159
12.11 4.916e-48 3.711e-46
14.08 1.882e-45 9.472e-44
10.99 5.736e-34 2.165e-32
edge <- as.data.frame(topTags(et, n=50000))
edge2fold <- edge[edge$logFC >= 1 | edge$logFC <= -1,]
edge2foldpadj <- edge2fold[edge2fold$FDR <= 0.01, ]
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
19/30
Compare the results of the three methods
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
setlist <- list(edgeR=rownames(edge2foldpadj),
DESeq=as.character(res2foldpadj[,1]),
RPKM=rownames(degs2fold))
OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets")
counts <- sapply(OLlist$Venn_List, length)
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
20/30
Comparing results across methods
vennPlot(counts=counts)
21/30
Limma for DEG detection in RNA-seq data
· Limma contains similar methods (to DESeq) for analysing RNA-seq
data.
· Makes a correction to account for increased variability at low count
numbers.
library(limma)
Counts<-countDF[rowSums(cpm(countDF))>9,]
nf <- calcNormFactors(Counts)
design <- model.matrix(~conds)
Code modified from limma users guide documentation.
22/30
Limma for DEG detection in RNA-seq data
y <- voom(Counts,design,plot=TRUE,lib.size=colSums(Counts)*nf)
23/30
Limma - detecting DEGs
· Methodoloy for identifying differentially expressed genes from RNA-seq
data is the same as we used for microarrays.
fit <- lmFit(y,design)
fit <- eBayes(fit)
tt<-topTable(fit,coef=2,n=nrow(Counts))
topTable(fit,coef=2,n=6)
##
logFC AveExpr
t
P.Value adj.P.Val
B
## ATMG00020 -4.999
## AT2G01021 -4.713
## ATMG00030 -6.123
13.274 -35.68 4.170e-08 4.378e-06 9.697
16.130 -28.56 1.534e-07 8.055e-06 8.520
9.850 -25.82 2.769e-07 9.692e-06 7.502
## AT2G01010 -2.605
## ATCG00490 -5.590
19.452 -18.55 1.898e-06 4.983e-05 5.841
8.706 -14.53 7.786e-06 1.635e-04 4.562
## AT4G00050
3.162
8.650
13.07 1.427e-05 2.498e-04 4.003
Code modified from limma users guide documentation.
24/30
Compare Limma results to other methods
setlist <- list(edgeR=rownames(edge2foldpadj),
DESeq=as.character(res2foldpadj[,1]),
Limma=rownames(tt)[tt$adj.P.Val<0.05])
OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets")
counts <- sapply(OLlist$Venn_List, length)
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
25/30
Compare Limma results to other methods
vennPlot(counts=counts)
26/30
Gene set analysis (Gene Ontology)
library(GOstats)
library(GO.db)
library(ath1121501.db)
geneUniverse <- rownames(countDF)
geneSample <- res2foldpadj[,1]
params <- new("GOHyperGParams", geneIds = geneSample,
universeGeneIds = geneUniverse,
annotation="ath1121501", ontology = "MF",
pvalueCutoff = 0.5,
conditional = FALSE, testDirection = "over")
hgOver <- hyperGTest(params)
htmlReport(hgOver, file = "Output/MyhyperGresult.html")
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
27/30
Gene set analysis (Gene Ontology)
summary(hgOver)[1:4,]
##
GOMFID
Pvalue OddsRatio ExpCount Count Size
## 1 GO:0008324 0.002673
18
2.127
6
7
## 2 GO:0015075 0.002673
18
2.127
6
7
##
##
##
##
##
3 GO:0015077 0.002673
4 GO:0015078 0.002673
1
2
18
18
2.127
2.127
6
6
7
7
Term
cation transmembrane transporter activity
ion transmembrane transporter activity
## 3 monovalent inorganic cation transmembrane transporter activity
## 4
hydrogen ion transmembrane transporter activity
Code modified from:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf
28/30
Simple HTML tabular output of GO analysis
29/30
Summary
· With a little work it is possible to generate read count data in R.
· There are a number of specialized packages for analyzing count data
from RNA-seq experiment.
· These packages give similar (but not identical) results.
30/30