STAT435 - Lecture 6 RNA-seq - counts and analysis in R Mik Black Department of Biochemistry, University of Otago RNA-seq: generating count data · In lecture 5 we used SeqMonk to read in aligned sequence data (BAM files) and generate read counts. · The output was a text file containing a value representing the number of reads that mapped (aligned) to each gene in the genome, across each sample in the experiment. · These values represent the level of gene expression activity for each gene, in each sample. · This information can also be generated in R. 2/30 Attribution notice · The code in the following slides has been extracted from a presentation given by Associate Professor Thomas Girke, Institute for Integrative Genome Biology, UC Riverside: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf · Additional (excellent) R code for genomic analysis is available from Thomas at: http://manuals.bioinformatics.ucr.edu/home/ht-seq 3/30 Example data · The example we'll follow along with is from an Arabidopsis Thaliana flowering time RNA-seq experiment (a subset of 4 samples is used here). · Two conditions: 1. AP3: ``AP3 domain, flower stage 4'' 2. TRL : ``Translatome, flower stage 4'' · Two replicates of each condition. 4/30 Count data in R: getting started library(rtracklayer) library(GenomicRanges) library(GenomicAlignments) library(Rsamtools) · GenomicRanges: specialized data structures for holding information about genomic intervals. · Rsamtools: support for manipulating BAM files. · rtracklayer: tools for working with annotation data (GFF files here), and interacting with online genome browsers. Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 5/30 Genome annotation ## Import GFF file into R: gff <- import.gff("TAIR10_GFF3_trunc.gff", asRangedData=FALSE) · GFF file contains annotation (e.g., gene positions): ##gff-version 1 ##date 2011-05-26 Chr1 TAIR10 chromosome Chr1 TAIR10 gene 3631 Chr1 TAIR10 mRNA 3631 Chr1 TAIR10 protein 3760 Chr1 TAIR10 exon 3631 Chr1 TAIR10 five_prime_UTR Chr1 TAIR10 CDS 3760 Chr1 TAIR10 exon 3996 Chr1 TAIR10 CDS 3996 Chr1 TAIR10 exon 4486 Chr1 TAIR10 CDS 4486 Chr1 TAIR10 exon 4706 Chr1 TAIR10 CDS 4706 Chr1 TAIR10 exon 5174 1 5899 5899 5630 3913 3631 3913 4276 4276 4605 4605 5095 5095 5326 100000 0 0 0 0 3759 0 0 0 0 0 0 0 0 0 + + + + 0 + + + + + + + + * . . . . + 0 . 2 . 0 . 0 . . ID=Chr1;Name=Chr1 ID=AT1G01010;Note=protein_codi ID=AT1G01010.1;Parent=AT1G0101 ID=AT1G01010.1-Protein;Name=AT Parent=AT1G01010.1 . Parent=AT1G01010.1 Parent=AT1G01010.1,AT1G01010.1 Parent=AT1G01010.1 Parent=AT1G01010.1,AT1G01010.1 Parent=AT1G01010.1 Parent=AT1G01010.1,AT1G01010.1 Parent=AT1G01010.1 Parent=AT1G01010.1,AT1G01010.1 6/30 Parent=AT1G01010.1 Arabidopis (truncated) gene location info seqlengths(gff) <- end(ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),])) subgene_index <- which(elementMetadata(gff)[,"type"] == "exon") # Extract just the gene ranges gffsub <- gff[subgene_index,] gffsub[1:2, c(2,5)] ## GRanges with 2 ranges and 2 metadata columns: ## seqnames ranges strand | type group ## <Rle> <IRanges> <Rle> | <factor> <factor> ## [1] Chr1 [3631, 3913] + | exon Parent=AT1G01010.1 ## [2] Chr1 [3996, 4276] + | exon Parent=AT1G01010.1 ## ## ## ## --seqlengths: Chr1 Chr2 Chr3 Chr4 Chr5 ChrC ChrM 100000 100000 100000 100000 100000 100000 100000 Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 7/30 ids <- gsub("Parent=|\\..*", "", elementMetadata(gffsub)$group) gffsub <- split(gffsub, ids) gffsub ## $AT1G01010 ## GRanges with 6 ranges and 5 metadata columns: ## seqnames ranges strand | source type score phase ## ## [1] <Rle> <IRanges> Chr1 [3631, 3913] ## ## [2] [3] Chr1 [3996, 4276] Chr1 [4486, 4605] + | + | TAIR10 TAIR10 exon exon 0 0 <NA> <NA> ## ## ## ## ## [4] [5] [6] Chr1 [4706, 5095] Chr1 [5174, 5326] Chr1 [5439, 5899] group <factor> + | + | + | TAIR10 TAIR10 TAIR10 exon exon exon 0 0 0 <NA> <NA> <NA> ## ## [1] Parent=AT1G01010.1 [2] Parent=AT1G01010.1 ## ## ## ## [3] [4] [5] [6] Parent=AT1G01010.1 Parent=AT1G01010.1 Parent=AT1G01010.1 Parent=AT1G01010.1 <Rle> | <factor> <factor> <numeric> <integer> + | TAIR10 exon 0 <NA> 8/30 Read BAM files and generate count data samples <- readLines('bamfiles.txt') samplespath <- paste("./BAMfiles/", samples, sep="") names(samplespath) <- samples bfl <- BamFileList(samplespath, index=character()) countDF <- summarizeOverlaps(gffsub, bfl, mode="Union", ignore.strand=TRUE) countDF <- assays(countDF)$counts countDF[1:4,] ## SRR064154.bam SRR064155.bam SRR064166.bam SRR064167.bam ## AT1G01010 52 26 60 75 ## AT1G01020 146 77 82 64 ## AT1G01030 ## AT1G01040 5 480 1 346 13 282 14 335 Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 9/30 Count data! · At this point, we have generated read counts from the aligned data stored in the BAM files. · It is in matrix form, so is very similar to the format produced by SeqMonk (although the SeqMonk outout has a whole lot of extra columns that need to be removed). class(countDF) ## [1] "matrix" dim(countDF) ## [1] 151 4 10/30 Detecting differential expression · From here, the process of detecting differentially expressed genes is similar to what we used for microarray data. · Specialized packages exist for detecting "differentially expressed genes" (DEGs) from RNA-seq experiments: library(DESeq) library(edgeR) · Limma also contains methods for analysing RNA-seq data. · Each method is slightly different - we'll compare the results produced by each. 11/30 RPKM normalization · This function generates RPKM values for each sample (i.e., normalized by gene length and total reads per sample). returnRPKM <- function(counts, gffsub) { ## Length of exon union per gene in kbp geneLengthsInKB <- sum(width(reduce(gffsub)))/1000 ## Factor for converting to million of mapped reads. millionsMapped <- sum(counts)/1e+06 ## RPK: reads per kilobase of exon model. rpm <- counts/millionsMapped ## RPKM: reads per kilobase of exon model per million mapped reads. rpkm <- rpm/geneLengthsInKB return(rpkm) } Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 12/30 RPKM normalization countDFrpkm <- apply(countDF, 2, function(x) returnRPKM(counts=x, gffsub=gffsub)) countDFrpkm[1:8,] ## SRR064154.bam SRR064155.bam SRR064166.bam SRR064167.bam ## AT1G01010 52.499 24.6495 389.82 366.89 ## ## ## ## ## AT1G01020 AT1G01030 AT1G01040 AT1G01046 AT1G01050 140.256 4.473 130.800 74.096 506.374 69.4616 0.8401 88.5372 61.8482 429.5948 506.92 74.84 494.51 52.98 9135.28 297.90 60.68 442.32 239.35 7995.16 ## AT1G01060 115.933 66.3090 1237.75 1220.49 ## AT1G01070 9.403 4.4147 385.73 301.82 Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 13/30 RPKM - per group means (AP3 and TRL) source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/colAg.R") countDFrpkm_mean <- colAg(myMA=countDFrpkm, group=c(1,1,2,2),myfct=mean) countDFrpkm_mean[1:8,] ## ## ## ## ## ## SRR064154.bam_SRR064155.bam SRR064166.bam_SRR064167.bam AT1G01010 AT1G01020 AT1G01030 AT1G01040 AT1G01046 38.574 104.859 2.657 109.668 67.972 378.35 402.41 67.76 468.41 146.16 ## AT1G01050 467.984 8565.22 ## AT1G01060 ## AT1G01070 91.121 6.909 1229.12 343.78 Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 14/30 RPKM - log ratios · Find genes with DE log-ratios larger than 2. countDFrpkm_mean <- cbind(countDFrpkm_mean, log2ratio=log2(countDFrpkm_mean[,2]/countDFrpkm_mean[,1])) countDFrpkm_mean <- countDFrpkm_mean[is.finite(countDFrpkm_mean[,3]),] degs2fold <- countDFrpkm_mean[countDFrpkm_mean[,3] >= 1 | countDFrpkm_mean[,3] <= -1,] colnames(degs2fold) <- gsub(".bam","",colnames(degs2fold)) degs2fold[1:4,] ## SRR064154_SRR064155 SRR064166_SRR064167 log2ratio ## AT1G01010 ## AT1G01020 38.574 104.859 378.35 402.41 3.294 1.940 ## AT1G01030 ## AT1G01040 2.657 109.668 67.76 468.41 4.673 2.095 Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 15/30 Analysis using the DESeq package library(DESeq) conds <- c("AP3","AP3","TRL","TRL") # Create object of class CountDataSet (like microarray eSet class) # Note - raw count data is used, not RPKM values cds <- newCountDataSet(countDF, conds) # CountDataSet has similar characteristics/methods as eSet class counts(cds)[1:4, ] ## SRR064154.bam SRR064155.bam SRR064166.bam SRR064167.bam ## AT1G01010 52 26 60 75 ## AT1G01020 ## AT1G01030 ## AT1G01040 146 5 480 77 1 346 82 13 282 64 14 335 Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 16/30 Analysis using the DESeq package # Estimate library size factors from count data. cds <- estimateSizeFactors(cds) # Estimate the variance within replicates cds <- estimateDispersions(cds) # Identify DEGs using nbinomTest res <- nbinomTest(cds, "AP3", "TRL") res <- na.omit(res) res2fold <- res[res$log2FoldChange >= 1 | res$log2FoldChange <= -1,] res2foldpadj <- res2fold[res2fold$padj <= 0.05, ] res2foldpadj[1:2,-2] ## id baseMeanA baseMeanB foldChange log2FoldChange pval ## 6 AT1G01050 272.3 910.1 3.342 1.741 1.711e-11 ## 7 AT1G01060 169.4 434.5 2.564 1.358 4.558e-06 ## padj ## 6 1.898e-10 ## 7 3.186e-05 Code modified from: 17/30 Analysis using the edgeR package library(edgeR) countDF <- read.table("./results/countDF") # Construct a DGEList object (note - raw count data, not RPKM) y <- DGEList(counts=countDF, group=conds) # Estimates common dispersion (overall variability) y <- estimateCommonDisp(y) # Estimates tagwise dispersion (gene specific variance) y <- estimateTagwiseDisp(y) # Use negative binomial distirbution to perform count-based test et <- exactTest(y, pair=c("AP3", "TRL")) Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 18/30 Analysis using the edgeR package topTags(et, n=4) ## Comparison of groups: TRL-AP3 ## logFC logCPM PValue FDR ## AT4G00050 5.571 10.39 3.590e-58 5.420e-56 ## AT1G01050 4.213 ## AT3G01120 3.862 ## AT3G01150 3.159 12.11 4.916e-48 3.711e-46 14.08 1.882e-45 9.472e-44 10.99 5.736e-34 2.165e-32 edge <- as.data.frame(topTags(et, n=50000)) edge2fold <- edge[edge$logFC >= 1 | edge$logFC <= -1,] edge2foldpadj <- edge2fold[edge2fold$FDR <= 0.01, ] Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 19/30 Compare the results of the three methods source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") setlist <- list(edgeR=rownames(edge2foldpadj), DESeq=as.character(res2foldpadj[,1]), RPKM=rownames(degs2fold)) OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets") counts <- sapply(OLlist$Venn_List, length) Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 20/30 Comparing results across methods vennPlot(counts=counts) 21/30 Limma for DEG detection in RNA-seq data · Limma contains similar methods (to DESeq) for analysing RNA-seq data. · Makes a correction to account for increased variability at low count numbers. library(limma) Counts<-countDF[rowSums(cpm(countDF))>9,] nf <- calcNormFactors(Counts) design <- model.matrix(~conds) Code modified from limma users guide documentation. 22/30 Limma for DEG detection in RNA-seq data y <- voom(Counts,design,plot=TRUE,lib.size=colSums(Counts)*nf) 23/30 Limma - detecting DEGs · Methodoloy for identifying differentially expressed genes from RNA-seq data is the same as we used for microarrays. fit <- lmFit(y,design) fit <- eBayes(fit) tt<-topTable(fit,coef=2,n=nrow(Counts)) topTable(fit,coef=2,n=6) ## logFC AveExpr t P.Value adj.P.Val B ## ATMG00020 -4.999 ## AT2G01021 -4.713 ## ATMG00030 -6.123 13.274 -35.68 4.170e-08 4.378e-06 9.697 16.130 -28.56 1.534e-07 8.055e-06 8.520 9.850 -25.82 2.769e-07 9.692e-06 7.502 ## AT2G01010 -2.605 ## ATCG00490 -5.590 19.452 -18.55 1.898e-06 4.983e-05 5.841 8.706 -14.53 7.786e-06 1.635e-04 4.562 ## AT4G00050 3.162 8.650 13.07 1.427e-05 2.498e-04 4.003 Code modified from limma users guide documentation. 24/30 Compare Limma results to other methods setlist <- list(edgeR=rownames(edge2foldpadj), DESeq=as.character(res2foldpadj[,1]), Limma=rownames(tt)[tt$adj.P.Val<0.05]) OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets") counts <- sapply(OLlist$Venn_List, length) Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 25/30 Compare Limma results to other methods vennPlot(counts=counts) 26/30 Gene set analysis (Gene Ontology) library(GOstats) library(GO.db) library(ath1121501.db) geneUniverse <- rownames(countDF) geneSample <- res2foldpadj[,1] params <- new("GOHyperGParams", geneIds = geneSample, universeGeneIds = geneUniverse, annotation="ath1121501", ontology = "MF", pvalueCutoff = 0.5, conditional = FALSE, testDirection = "over") hgOver <- hyperGTest(params) htmlReport(hgOver, file = "Output/MyhyperGresult.html") Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 27/30 Gene set analysis (Gene Ontology) summary(hgOver)[1:4,] ## GOMFID Pvalue OddsRatio ExpCount Count Size ## 1 GO:0008324 0.002673 18 2.127 6 7 ## 2 GO:0015075 0.002673 18 2.127 6 7 ## ## ## ## ## 3 GO:0015077 0.002673 4 GO:0015078 0.002673 1 2 18 18 2.127 2.127 6 6 7 7 Term cation transmembrane transporter activity ion transmembrane transporter activity ## 3 monovalent inorganic cation transmembrane transporter activity ## 4 hydrogen ion transmembrane transporter activity Code modified from: http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/Rrnaseq.pdf 28/30 Simple HTML tabular output of GO analysis 29/30 Summary · With a little work it is possible to generate read count data in R. · There are a number of specialized packages for analyzing count data from RNA-seq experiment. · These packages give similar (but not identical) results. 30/30
© Copyright 2025 ExpyDoc