Galaxy: an open source Bioinformatics platform Wen-Yu Chung CSIE.KUAS 1 Outline • What is Galaxy? • Basic usage • Register/history/tools • Share histories • Analysis pipeline • Quality control/Mapping • RNA-Seq/ChIP-Seq/Metagenomics • Local installation/Toolshed 2 What is Galaxy? • A web-based, open-source software system that aims to make sense of high-throughput data via informatics support interactively transparent reproducible • • • 3 Galaxy is designed for you • Experimentalists • little informatics/programming experiences • simple interfaces • automatically manage computational details • Computer Scientists • make your software easily available with little effort 4 Basic usage usegalaxy.org 5 Main page • Register to use more functionalities • share histories/workflow with collaborators • Left, central and right panels • left: tools • central: menu or result • right: history (datasets) • Get Data • Upload from local/URL • Other servers: UCSC main • Shared data http://galaxy.psu.edu/CPMB/TAF1_ChIP.txt 6 Histories • Color coded • gray (preparing), • • yellow (running), green (ready) View/Edit/Delete Attributes/datatypes History options New Share Delete • • • • 7 history options Tools Main server is hosted at PennState But tools may be developed from 3rd party 8 Tool menu set input dataset(s) and parameters usage/examples 9 Galaxy 101 • Finding exons with the highest number of nucleotide polymorphisms exon annotation from UCSC Table Browser SNP coordinates join coordinates count the number of SNPs per exon sort exon by SNP count restore genomic location for exons visualize results in UCSC Genome Browser • • • • • • • inter val datatype 10 of the y Jim y docl comofficial MAQ/ d title, e coninto a Sanger HRED bytes) ng. In edited, haracspace be used. Rather, an offset of 64 was chosen, meaning ASCII 59 to 126 can be used, allowing Solexa scores from "5 to 62 inclusive. FASTQ variants Table 1. The three described FASTQ variants, with columns giving the description, format name used in OBF projects, range of ASCII characters permitted in the quality string (in decimal notation), ASCII encoding offset, type of quality score encoded and the possible range of scores Description, OBF name Sanger standard fastq-sanger Solexa/early Illumina fastq-solexa Illumina 1.3+ fastq-illumina ASCII characters Quality score Range Offset Type 33–126 33 PHRED 59–126 64 Solexa 64–126 64 PHRED Range 0 to 93 "5 to 62 0 to 62 FASTQ Groomer 11 doi:10.1093/nar/gkp1137 Next-generation sequencing data analysis • Data quality and cleaning • FastQC • Trim • Alignment • Bowtie: jobs over 48hrs will be deleted from server • BWA • Downstream analyses • RNA-Seq: (differential) expression • ChIP-Seq: peak calling • Metagenomics: taxonomy/phylogeny 12 This view shows an overview of the range of quality values across all bases at eac position in the FastQ file. 3.2 Per Base Sequence Quality Per base sequence quality Summary This view shows an overview of the range of quality values across all bases at each position in the FastQ file. green: good quality orange: reasonable quality For each position a BoxWhisker type red: poor quality follows: plot is drawn. The elements of the plot are as The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The blue line represents the mean quality The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of 13 to see base calls on most platforms will degrade as the run progresses, so it is common FastQC manual uction Quality Control Alignment Expression analysis Building pi RNA-Seq alignment NAseq alignment single-end http://www.nature.com/nbt/journal/v27/n5/full/nbt0509455.html 14 doi:10.1038/nbt0509-455 Simple read counts featureCounts Measure gene expression in RNA-Seq experiments from SAM or BAM files To run the analysis, run the wrapper with the following settings: featureCounts • Alignment file: aG bam S M 1 2 file 4 4 8 0 from 9 E 2 your R e p 1 .history b a m • GFF/GTF Source: U s e a b u i lt - i n i nd e x ( wh i c h fi t s yo u r r e f e r e nc e ) • Reference Gene Sets used during alignment (GFF/GTF): U C S C h g 1 8 • Output format: G e ne - na me ” \t ” g e ne - c o u nt \t ” g e ne - le ng t h ( t a b - d e li mi t e d ) • Number of the CPU threads: 2 • featureCounts parameters: D e f a u lt s e t t i ng s The counting procedure will take about ⇠5 min on an average computer. Question 1 Before we proceed with the expression levels of the genes, we would like to get a small impression about whether the counting has been performed correctly. Therefore we take a look at featureCounts’ output-summary file ”f e a t u r e C o u nt s o n. . . : G S M 1 2 4 4 8 0 9 E 2 R e p 1 . b a m s u mma r y”. • How many reads are ”Assigned”? • How many reads are ”UnAssigned (sum of all)”?15 Youri Hoogstrate GCC2014 Figure 2 | An overview of the Tuxedo protocol. In an experiment involving two conditions, reads are first mapped to the genome with TopHat. The reads for each biological replicate are mapped independently. These Bowtie mapped reads are provided as input to Cufflinks, which produces one file of Extremely fast, general purpose short read aligner assembled transfrags for each replicate. The assembly files are merged with the reference transcriptome annotation into a unified annotation for further analysis. This merged annotation is quantified in each condition by Cuffdiff, Condition A f the Tuxedo protocol. In an experiment which producesinvolving expression data in a set of tabular files. These files are first mapped to the genome with and TopHat. The with CummeRbund to facilitate exploration Reads indexed visualized of genes replicate are mapped independently. These identified by Cuffdiff as differentially expressed, spliced, or transcriptionally d as input to Cufflinks, which produces oneFPKM, file offragments per kilobase of transcript per million TopHat regulated genes. Aligns RNA-Seq reads to the genome using Bowtie each replicate. The assembly files aremapped. merged with Step 1 TopHat fragments Discovers splice sites me annotation into a unified annotation for further notation is quantified in each condition by Cuffdiff, Mapped n data in a set of tabular files. These files are reads feel comfortable creating directories, moving files between them th CummeRbund to facilitate exploration of genes and editing text files in a UNIX environment. Installation of the ifferentially expressed, spliced, or transcriptionally Cufflinks package Stepand 2 Cufflinks tools may permission from one’s agments per kilobase of transcript perrequire million additional expertise Condition A Condition B Reads Reads TopHat/Cufflinks computing system administrators. Cufflinks Assembles transcripts Read alignment with TopHat Assembled transcripts Step 1 Condition B TopHat Mapped reads Reads Step 2 Mapped reads Cufflinks Assembled transcripts Steps 3–4 Assembled transcripts ting directories, moving files between them reads to a reference genome is a core step Alignment of sequencing in a UNIX environment. Installation of the for many high-throughput sequencing in the analysis workflows Cuffcompare ditional expertisetranscript and assays, permission from one’s including ChIP-Seq31, RNA-seq, ribosome profiling32Cuffmerge and Compares assemblies to annotation Steps 3–4 Mapped ministrators. others. Sequence alignment itself is a classic problem in computer reads science and appears frequently in bioinformatics. Hence, it is perCuffmerge h TopHat haps not surprising that many read alignment programs have been Merges two or more transcript assemblies Final Step 5 cing reads to a reference genome is a corethe step developed within last few years. One of the most populartranscriptome and to assembly lows for many high-throughput sequencing date most efficient is Bowtie33 (http://bowtie-bio.sourceforge.net/ 31 32 Cuffdiff P-Seq , RNA-seq, ribosome profiling and index.shtml), which uses an extremely economical data structure Finds differentially expressed genes and transcripts Mapped Mapped 34 nmentDetects itself isdifferential a classiccalled problem in computer the FM index to store the reference genome sequence and reads reads splicing and promoter use requently in bioinformatics. Hence, it is per- rapidly. Bowtie uses the FM index to align allows it to be searched hat many read alignment programs beenof millions perStep reads at a ratehave of tens CPU hour. However, Bowtie 5 Cuffdiff last few years. One of the most popular and to is not suitable for all sequence alignment tasks. It does not allow Steps 6–18 33 Bowtie (http://bowtie-bio.sourceforge.net/ alignments between a read and the genome to contain large gaps; CummeRbund uses an extremely economical structure hence, it data cannot align reads that span introns. TopHat was created Differential Plots abundance and differential to store the referenceto genome sequence and address this limitation. expression results expression results from Cuffdiff ed rapidly. Bowtie uses the FM index to alignas an alignment ‘engine’ and breaks up reads TopHat uses Bowtie 16 doi:10.1038/nprot.2012.016 of millions per CPU hour. However, Bowtie that Bowtie cannot align on its own into smaller pieces called seg© 2012 Nature America, Inc. All rights reserved. Mapped reads Assembled transcripts Cuffmerge Final transcriptome assembly Mapped reads Cuffdiff Differential expression results CummeRbund Expression plots Visualization 17 Workflow 18 Workflow 19 Workflow 20 Peak calling 21 Metagenomics 22 Local installation • Get Galaxy • %hg clone https://bitbucket.org/galaxy/galaxy-dist • zip files at wiki.galaxyproject.org/Admin/GetGalaxy • Star a Galaxy instance locally • %sh run.sh • Open a browser • http://localhost • Stop the instance • Ctrl + C 23 Configurations • Main: config/galaxy.ini • add local administrative account • admin_users = [email protected] • enable Toolshed • tool_dependency_dir = dependency_dir • locations • tool_sheds_conf.xml.sample • local Toolshed • config/tool_shed.ini • localhost:9009 24 Toolshed install 25 Toolshed upload 26 Local datasets • Default relational database • SQLite • galaxy-dist/database/universe.sqlite • Actual files • galaxy-dist/database/files 27 Other public sites 28 Other public sites 29 http://toolshed.dtls.nl Genome Space 30 All you need to remember is ... galaxyproject.org 31
© Copyright 2024 ExpyDoc