Wen-Yu Chung CSIE.KUAS

Galaxy: an open source
Bioinformatics platform
Wen-Yu Chung
CSIE.KUAS
1
Outline
• What is Galaxy?
• Basic usage
• Register/history/tools
• Share histories
• Analysis pipeline
• Quality control/Mapping
• RNA-Seq/ChIP-Seq/Metagenomics
• Local installation/Toolshed
2
What is Galaxy?
• A web-based, open-source software system that aims
to make sense of high-throughput data via informatics
support
interactively
transparent
reproducible
•
•
•
3
Galaxy is designed for you
• Experimentalists
• little informatics/programming experiences
• simple interfaces
• automatically manage computational details
• Computer Scientists
• make your software easily available with little effort
4
Basic usage
usegalaxy.org
5
Main page
• Register to use more functionalities
• share histories/workflow with collaborators
• Left, central and right panels
• left: tools
• central: menu or result
• right: history (datasets)
• Get Data
• Upload from local/URL
• Other servers: UCSC main
• Shared data
http://galaxy.psu.edu/CPMB/TAF1_ChIP.txt
6
Histories
• Color coded
• gray (preparing),
•
•
yellow (running),
green (ready)
View/Edit/Delete
Attributes/datatypes
History options
New
Share
Delete
•
•
•
•
7
history options
Tools
Main server is hosted at
PennState
But tools may be developed
from 3rd party
8
Tool menu
set input dataset(s) and parameters
usage/examples
9
Galaxy 101
• Finding exons with the highest number of nucleotide
polymorphisms
exon annotation from UCSC Table Browser
SNP coordinates
join coordinates
count the number of SNPs per exon
sort exon by SNP count
restore genomic location for exons
visualize results in UCSC Genome Browser
•
•
•
•
•
•
•
inter val datatype
10
of the
y Jim
y docl comofficial
MAQ/
d title,
e coninto a
Sanger
HRED
bytes)
ng. In
edited,
haracspace
be used. Rather, an offset of 64 was chosen, meaning
ASCII 59 to 126 can be used, allowing Solexa scores
from "5 to 62 inclusive.
FASTQ variants
Table 1. The three described FASTQ variants, with columns giving
the description, format name used in OBF projects, range of ASCII
characters permitted in the quality string (in decimal notation),
ASCII encoding offset, type of quality score encoded and the
possible range of scores
Description, OBF name
Sanger standard
fastq-sanger
Solexa/early Illumina
fastq-solexa
Illumina 1.3+
fastq-illumina
ASCII characters
Quality score
Range
Offset
Type
33–126
33
PHRED
59–126
64
Solexa
64–126
64
PHRED
Range
0 to 93
"5 to 62
0 to 62
FASTQ Groomer
11
doi:10.1093/nar/gkp1137
Next-generation sequencing data analysis
• Data quality and cleaning
• FastQC
• Trim
• Alignment
• Bowtie: jobs over 48hrs will be deleted from server
• BWA
• Downstream analyses
• RNA-Seq: (differential) expression
• ChIP-Seq: peak calling
• Metagenomics: taxonomy/phylogeny
12
This view shows an overview of the range of quality values across all bases at eac
position in the FastQ file.
3.2 Per Base Sequence Quality
Per base sequence quality
Summary
This view shows an overview of the range of quality values across all bases at each
position in the FastQ file.
green: good quality
orange: reasonable quality
For each position a BoxWhisker type
red: poor quality
follows:




plot is drawn. The elements of the plot are as
The central red line is the median value
The yellow box represents the inter-quartile range (25-75%)
The upper and lower whiskers represent the 10% and 90% points
The blue line represents the mean quality
The y-axis on the graph shows the quality scores. The higher the score the better the
base call. The background of the graph divides the y axis into very good quality calls
(green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of
13 to see base
calls on most platforms will degrade as the run progresses, so it is common
FastQC manual
uction
Quality Control
Alignment
Expression analysis
Building pi
RNA-Seq
alignment
NAseq alignment
single-end
http://www.nature.com/nbt/journal/v27/n5/full/nbt0509455.html
14
doi:10.1038/nbt0509-455
Simple read counts
featureCounts Measure gene expression in RNA-Seq experiments from SAM or BAM files
To run the analysis, run the wrapper with the following settings:
featureCounts
• Alignment file: aG bam
S M 1 2 file
4 4 8 0 from
9 E 2 your
R e p 1 .history
b a m
• GFF/GTF Source: U s e a b u i lt - i n i nd e x ( wh i c h fi t s yo u r r e f e r e nc e )
• Reference Gene Sets used during alignment (GFF/GTF): U C S C
h g 1 8
• Output format: G e ne - na me ” \t ” g e ne - c o u nt \t ” g e ne - le ng t h ( t a b - d e li mi t e d )
• Number of the CPU threads: 2
• featureCounts parameters: D e f a u lt s e t t i ng s
The counting procedure will take about ⇠5 min on an average computer.
Question 1
Before we proceed with the expression levels of the genes, we would like to get a small impression about whether
the counting has been performed correctly. Therefore we take a look at featureCounts’ output-summary file
”f e a t u r e C o u nt s o n. . . : G S M 1 2 4 4 8 0 9 E 2 R e p 1 . b a m s u mma r y”.
• How many reads are ”Assigned”?
• How many reads are ”UnAssigned (sum of all)”?15
Youri Hoogstrate GCC2014
Figure 2 | An overview of the Tuxedo protocol. In an experiment involving
two conditions, reads are first mapped to the genome with TopHat. The
reads for each biological replicate are mapped independently. These
Bowtie
mapped
reads are provided as input to Cufflinks, which produces one file of
Extremely fast, general
purpose
short read
aligner
assembled
transfrags
for each
replicate. The assembly files are merged with
the reference transcriptome annotation into a unified annotation for further
analysis. This merged annotation is quantified in each condition by Cuffdiff,
Condition A
f the Tuxedo protocol. In an
experiment
which
producesinvolving
expression data in a set of tabular files. These files are
first mapped to the genome
with and
TopHat.
The with CummeRbund to facilitate exploration
Reads
indexed
visualized
of genes
replicate are mapped independently.
These
identified by Cuffdiff as differentially expressed, spliced, or transcriptionally
d as input to Cufflinks, which
produces
oneFPKM,
file offragments per kilobase of transcript per million
TopHat
regulated
genes.
Aligns RNA-Seq
reads
to the
genome
using Bowtie
each replicate.
The assembly
files
aremapped.
merged with
Step 1
TopHat
fragments
Discovers
splice sites
me annotation into a unified
annotation
for further
notation is quantified in each condition by Cuffdiff,
Mapped
n data in a set of tabular files. These files are
reads
feel
comfortable
creating
directories,
moving
files
between them
th CummeRbund to facilitate exploration of genes
and editing
text files in a UNIX environment. Installation of the
ifferentially expressed, spliced,
or transcriptionally
Cufflinks
package
Stepand
2
Cufflinks
tools may
permission from one’s
agments per kilobase of transcript
perrequire
million additional expertise
Condition A
Condition B
Reads
Reads
TopHat/Cufflinks
computing system administrators.
Cufflinks
Assembles transcripts
Read alignment with TopHat
Assembled
transcripts
Step 1
Condition B
TopHat
Mapped
reads
Reads
Step 2
Mapped
reads
Cufflinks
Assembled
transcripts
Steps 3–4
Assembled
transcripts
ting directories, moving
files between
them reads to a reference genome is a core step
Alignment
of sequencing
in a UNIX
environment.
Installation
of the for many high-throughput sequencing
in the
analysis workflows
Cuffcompare
ditional
expertisetranscript
and assays,
permission
from
one’s
including
ChIP-Seq31, RNA-seq,
ribosome profiling32Cuffmerge
and
Compares
assemblies
to annotation
Steps 3–4
Mapped
ministrators.
others. Sequence alignment itself is a classic problem in computer
reads
science and appears frequently in bioinformatics. Hence, it is perCuffmerge
h TopHat
haps
not surprising
that many read alignment programs have been
Merges two or more
transcript
assemblies
Final
Step 5
cing reads to a reference
genome is
a corethe
step
developed
within
last few years. One of the most populartranscriptome
and to
assembly
lows for many high-throughput
sequencing
date most efficient
is Bowtie33 (http://bowtie-bio.sourceforge.net/
31
32
Cuffdiff
P-Seq
, RNA-seq, ribosome
profiling
and
index.shtml),
which
uses an extremely economical data structure
Finds differentially expressed genes and transcripts
Mapped
Mapped
34
nmentDetects
itself isdifferential
a classiccalled
problem
in
computer
the
FM
index
to
store
the
reference
genome
sequence
and
reads
reads
splicing and promoter use
requently in bioinformatics.
Hence,
it is per- rapidly. Bowtie uses the FM index to align
allows it
to be searched
hat many read alignment
programs
beenof millions perStep
reads
at a ratehave
of tens
CPU
hour. However, Bowtie
5
Cuffdiff
last few years. One of the
most
popular
and
to
is not suitable for all sequence alignment tasks. It does not allow
Steps 6–18
33
Bowtie (http://bowtie-bio.sourceforge.net/
alignments between a read and the genome to contain large gaps;
CummeRbund
uses an extremely economical
structure
hence, it data
cannot
align reads that span introns. TopHat was created
Differential
Plots abundance and differential
to store the referenceto
genome
sequence
and
address
this
limitation.
expression
results
expression results from Cuffdiff
ed rapidly. Bowtie uses the
FM index
to alignas an alignment ‘engine’ and breaks up reads
TopHat
uses Bowtie
16 doi:10.1038/nprot.2012.016
of millions per CPU hour.
However,
Bowtie
that Bowtie
cannot
align on its own into smaller pieces called
seg© 2012 Nature America, Inc. All rights reserved.
Mapped
reads
Assembled
transcripts
Cuffmerge
Final
transcriptome
assembly
Mapped
reads
Cuffdiff
Differential
expression results
CummeRbund
Expression
plots
Visualization
17
Workflow
18
Workflow
19
Workflow
20
Peak calling
21
Metagenomics
22
Local installation
• Get Galaxy
• %hg clone https://bitbucket.org/galaxy/galaxy-dist
• zip files at wiki.galaxyproject.org/Admin/GetGalaxy
• Star a Galaxy instance locally
• %sh run.sh
• Open a browser
• http://localhost
• Stop the instance
• Ctrl + C
23
Configurations
• Main: config/galaxy.ini
• add local administrative account
• admin_users = [email protected]
• enable Toolshed
• tool_dependency_dir = dependency_dir
• locations
• tool_sheds_conf.xml.sample
• local Toolshed
• config/tool_shed.ini
• localhost:9009
24
Toolshed install
25
Toolshed upload
26
Local datasets
• Default relational database
• SQLite
• galaxy-dist/database/universe.sqlite
• Actual files
• galaxy-dist/database/files
27
Other public sites
28
Other public sites
29
http://toolshed.dtls.nl
Genome Space
30
All you need to remember is ...
galaxyproject.org
31