Visualization and Clustering with High-dimensional - Cedars

Visualization and Clustering
with High-dimensional
Genomic Data
Zhenqiu Liu, PhD
10/14/2014
Agenda
•
•
•
•
•
Data Visualization
Big data and their visualization
PCA and MDS for data visualization
Clustering and data mining
Bioinformatics resources
Introduction
• Visualization is the use of graphical techniques to
communicate information and support reasoning
or analysis
• Visualizations are cost-effective because they
exploit
– powerful human visual processing capabilities and
– high quality graphics created at low cost
• Two kinds of visualizations
– Scientific Visualization
– Information Visualization
3
Visualization
• Scientific Visualization
●
●
●
graphical representations from the results of mathematical
models, computational data, and simulations.
Involves research in computer graphics, statistics, image
processing, high performance computing, and other areas
It's not just a pretty picture or animation
• Information Visualization:
– The use of computer-supported, interactive,
visual representations of abstract data to
amplify cognition.
• Visualization is not only looking into a pretty picture…
– understanding of the data
– been able to analyze and interpret data
Goals of Information Visualization
Visualization should:
– Make large datasets coherent
(Present huge amounts of information compactly)
– Present information from various viewpoints
– Present information at several levels of detail
(from overviews to fine structure)
– Support visual comparisons
– Tell stories about the data
Why Visualization?
Use the eye for pattern recognition; people are good at
scanning
recognizing
remembering images
Graphical elements facilitate comparisons via
length
shape
orientation
texture
Animation shows changes across time
Color helps make distinctions
A Key Question
How do we
Convert abstract information into a visual
representation
While still preserving the underlying meaning
And at the same time providing new insight?
Problem Solving - Example
• London cholera epidemic of 1854
• At that time, two hypotheses of causes of cholera:
– Cholera is related to miasmas concentrated in the
swampy areas of the city
– Cholera is related to ingestion of contaminated water
• Input Data
– Locations of deaths due to cholera
– Locations of water pumps
8
Dr Snow’s Cholera Map
Dots
locate
deaths
due to
cholera
Crosses
Locate water
pumps
9
Dr Snow’s Cholera Map
John Snow’s
deduction that a cholera
epidemic
was caused by a bad
water pump, circa 1854.
Horizontal lines indicate
location of deaths.
From Visual Explanations by Edward Tufte,
Graphics Press, 1997
Dr Snow’s Cholera Map
• Plotting the input data on the map helped Dr
Snow
– to detect the epicentre of the epidemic
– Close to a pump on Broad Street
• Considered a classic case of visualization
helping reasoning with data
11
LifeLines
• Visualization of computerised medical records
• For a patient
– Horizontal lines (time lines) represent medical
problems, hospitalization and medications
– Icons on these lines represent events such as tests and
physician consultations
• All the patient information is fitted into one screen
12
Screenshot of LifeLines
13
Types of Visualization
(Kosslyn 89)
• Graphs
Type name here
Type title here
• Charts
• Maps
• Diagrams
Type name here
Type title here
Type name here
Type title here
Type name here
Type title here
length of access
length of page
length of access
url 1
url 2
url 3
url 4
url 5
url 6
url 7
very
long
long
medium
short
45
40
35
30
25
20
15
10
5
0
# of accesses
URL
length of access
length of page
# of accesses
# of accesses
Common Graph Types
days
# of accesses
Scatter Plots
• Qualitatively determine if variables
– are highly correlated
• linear mapping between horizontal & vertical axes
– have low correlation
• spherical, rectangular, or irregular distributions
– have a nonlinear relationship
• a curvature in the pattern of plotted points
• Place points of interest in context
– color representing special entities
Visualizing Temporal Data
• Traditionally time series are visualized
using trend graphs and seasonality graphs
– A time series can be expressed in terms of its
trend and seasonality components
– Data = trend + seasonal + remainder
17
Trend And Seasonality in Time
Series
18
When to use which type?
•
Line graph
– x-axis requires quantitative variable
– Variables have continuous values
– Ordering among ordinals
•
Bar graph
– comparison of relative point values
•
Scatter plot
– convey overall impression of relationship between two variables
•
Pie Chart?
– Emphasizing differences in proportion among a few numbers
Growth Chart Of GEO (RNA etc)
Gene Expression Omnibus (GEO) database holds over 10 000 experiments comprising 300 000 samples, 16 billion
individual abundance measurements, for over 500 organisms, submitted by 5000 laboratories from around the world. The database typically
receives over 60 000 query hits and 10 000 bulk FTP downloads per day, and has been cited in over 5000 manuscripts.
GenBank growth chart (DNA sequences)
There are 126 billion bases in 135 million sequence records in the traditional GenBank divisions and 191 billion bases in 62 million sequence records in the WGS division as of April 2011.
Big Omics Data
• A lot of genes, and samples, heterogenious data structure and data type. • Big data collection vs. big data objects
• Big data collection: aggregates of many data sets (multi‐source, multi‐disciplinary, heterogeneous, and maybe distributed)
• Big data objects: single object too large – For main memory – For local disk
Even for remote disk
Basic Types of Omics Data
• Nominal (qualitative)
– (no inherent order) SNP, Sequencing, ...
• Ordinal (qualitative)
– (ordered, but not at measurable intervals)
– first, second, third, …
– Clinical phenotypes (e.g. cancer stages)
• Quantitative
– list of integers or reals
– Gene expression, protein expression.
• Structural (PPI networks)
Dimension Reduction
• High dimensional data points are difficult to visualize
• Always good to plot data in 2D
– Easier to detect or confirm the relationship among data points
– Catch stupid mistakes (e.g. in clustering)
• Two ways to reduce:
– By genes: some experiments are similar or have little information
– By experiments: some genes are similar or have little information
Principal Component Analysis
• Optimal linear transformation that chooses a new coordinate system for the data set that maximizes the variance by projecting the data on to new axes in order of the principal components
• Components are orthogonal (mutually uncorrelated)
• Few PCs may capture most variation in original data
• E.g. reduce 2D into 1D data
PCA
v2
v2
v1
v1
v1
v2
Dimension Reduction (PCA)
z
New Axis 1
New Axis 2
New Axis 3
y
Principal Components
pick out the directions
in the data that capture
the greatest variability
x
Representing data in a reduced space
New Axis 2
New Axis 1
The first new axes will be projected through the data so as to explain the
greatest proportion of the variance in the data (most important).
The second new axis will be orthogonal, and will explain the next
largest amount of variance
0.010
PCA
analysis
Plot of eigenvalues,
select number.
0.000
0.005
X
0.015
0.020
0.025
Typical Analysis
Array Projection
Plot PC1 v PC2
etc
Gene Projection
Interpreting an PCA
Each axes represent a different “trend” or set of profiles The further from the origin
Greater loading/contribution
(ie higher expression)
Same direction from the origin
PCA of Gene Expression Data
Multidimensional scaling (MDS)
• MDS deals with the following problem: for a set of observed similarities (or distances) between every pair of N items, find a representation of the items in few dimensions such that the similarity (distance) structure nearly match the structure original similarities (or distance).
• The numerical measure of how close the original distances and the distances at lower dimensional coordinate is called stress.
MDS
MDS
1. MDS attempts to map objects to a visible 2D or 3D
Euclidean space. The goal is to best preserve the
distance structure after the mapping.
2. The original data can be of high-dimensional or even
non-metric space. The method only cares the distance
(dissimilarity) structure.
3. It could be shown that the results of PCA are exactly
those of classical MDS if the distances calculated
from the data matrix are Euclidean.
PCA
MDS
Input data
Data matrix (S subjects in Dissimilarity structure
G dimensions)
(distance between any
pair of subjects)
Method
“Project” subjects to lowdimensional space and
preserve as large
variance as possible
Find a low-dimensional
space that best keep the
dissimilarity structure
Restrictions
Data have to be in
Euclidean space
Flexible to any data
structure as long as the
dissimilarity structure can
be defined
Pros and cons
The PCs can be further
used to model in
downstream analyses. If
a new subject is added, it
can be similarly
projected.
Flexibility and
visualization. But if a new
subject is added, it can’t
be shown in an existing
MDS solution.
PCA application: genomic study
• Population stratification: allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies. • PCA could be used to infer underlying population structure. Figure 2
Nature Genetics 38, 904 - 909 (2006)
Principal components analysis corrects for stratification in genome-wide
association studies
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A
Shadick & David Reic
Chao Tian, Peter K. Gregersen and Michael F. Seldin. (2008)
Accounting for ancestry: population substructure and
genome-wide association studies.
Software for dimension reduction & visualization
PCA in R:
prcomp(stats)
princomp(stats)
screeplot(stats)
Principal Components Analysis (preferred)
Principal Components Analysis
Screeplot of PCA Results
PCA in IMSL (a commercial C library)
MDS in R:
isoMDS(MASS)
cmdscale(stats)
sammon(MASS)
Kruskal's Non-metric Multidimensional Scaling
Classical (Metric) Multidimensional Scaling
Sammon's Non-Linear Mapping
MDS: Various software and resources about MDS
http://www.granular.com/MDS/
Heatmap visualization:
Treeview
http://rana.lbl.gov/EisenSoftware.htm
Visualization vs. Analysis?
•
Applications to data mining and data discovery.
– Visualization tools are helpful for exploring hunches and presenting results
• Examples: scatterplots
– They are the WRONG primary tool when the goal is to find a good prediction model in a complex situation.
Data Mining and Machine Learning
• Machine Learning and data mining can be used: to
recognize or classify complex items (objects,
situations, etc.), to predict future data or events,
and to explore the data structure in the data.
• On the boundary of Computer Science and
Statistics.
• Very exciting area.
Why Data Mining?
•
•
•
•
•
A lot of data
Data is noisy
No clear biological theory
Large number of features (genes)
Complex relationships
• Let the data do the talking!
KDD Process
Unsupervised Learning
Unsupervised learning attempts to discover interesting
structure in the available data
Data mining, Clustering
Example 1: groups people of similar sizes together to make
“small”, “medium” and “large” T-Shirts.
Tailor-made for each person: too expensive
One-size-fits-all: does not fit all.
Example 2: In medicine, identifying patients subtype based on
their omics profiling
Personalized medicine.
44
Supervised Learning
observations
System (unknown)
property of
interest
Train dataset
?
ML algorithm
new observation
model
Classification
prediction
45
Topics
• Clustering and Data Mining
• Classification and Prediction
• Bioinformatics Resources
The challenge
• Biologists are estimated to produce
25.000.000.000.000.000 bytes of data
each year (± 35 billion CD-rooms).
• How do we learn something from this
data?
• Find patterns/structure in the data.  Use
cluster analysis
Cluster analysis
• Definition: Clustering is the process of
grouping several objects into a number of
groups, or clusters.
• Goal: Objects in the same cluster are more
similar to one another than they are to
objects in other clusters.
Basic principles of clustering
Aim: to group observations or variables that are
“similar” based on predefined criteria.
Issues: Which genes / genomic technology to use?
Which similarity or dissimilarity measure?
Which method to use to join the
clusters/observations?
Which clustering algorithm?
How to validate the resulting clusters?
.
49
Clustering
of genes
Omics Data
For each gene, calculate a
summary statistics and/or
adjusted p-values
Set of candidate DE genes.
Biological
verification
Similarity
metrics
Clustering
Descriptive
interpretation
Clustering
algorithm
50
Which similarity or dissimilarity measure?
• A metric is a measure of the similarity or dissimilarity
between two data objects
• Two main classes of metric:
– Correlation coefficients (similarity)
• Compares shape of expression curves
– Kernel matrix (e.g. string kernel)
– Distance metrics (dissimilarity)
• City Block (Manhattan) distance
• Euclidean distance
51
Correlation (a measure between -1 and 1)
• Pearson Correlation Coefficient (centered correlation)
Sx = Standard deviation of x
Sy = Standard deviation of y
 xi  x  yi  y 
  S  S 
x 
y 
i 1 
n
1
n 1
• Others include Spearman’s  and Kendall’s 
Positive correlation
Negative correlation
52
Distance metrics
• City Block (Manhattan)
distance:
– Sum of differences across
dimensions
– Less sensitive to outliers
– Diamond shaped clusters
d ( X , Y )   xi  yi
• Euclidean distance:
– Most commonly used distance
– Sphere shaped cluster
– Corresponds to the geometric
distance into the
multidimensional space
d ( X ,Y ) 
i
Y
X
Condition 1
where gene X = (x1,…,xn) and gene
Y=(y1,…,yn)
Condition 2
Condition 2
i
2
(
x

y
)
 i i
Y
X
Condition 1
53
Euclidean vs Correlation (I)
• Euclidean distance
• Correlation
54
Clustering algorithms
• Clustering algorithm comes in 2 basic flavors
Partitioning
Hierarchical
55
Hierarchical methods
• Hierarchical clustering methods produce a tree or
dendrogram.
• They avoid specifying how many clusters are
appropriate by providing a partition for each k
obtained from cutting the tree at some level.
• The tree can be built in two distinct ways
– bottom-up: agglomerative clustering (usually used).
– top-down: divisive clustering.
56
Illustration of points
In two dimensional
space
Agglomerative
1 5 2 3 4
1,2,3,4,5
4
3
1,2,5
5
1
3,4
1,5
2
1
5
2 3
4
57
Relationships between these pairwise
distances- Clustering Algorithms
• Different algorithms
– Bottom-up or top-down
– Popular hierarchical bottom-up clustering method
– The distance between a cluster and the remaining clusters can be
measured using minimum, maximum or average distance.
– Single lineage algorithm uses the minimum distance.
Comparison of Linkage Methods
Single
Join by
min
Average
average
Complete
max
Partitioning methods
• Partition the data into a pre-specified
number k of mutually exclusive and
exhaustive groups.
• Iteratively reallocate the observations to
clusters until some criterion is met, e.g.
minimize within cluster sums of squares.
Ideally, dissimilarity between clusters will
be maximized while it is minimized within
clusters.
60
Partitioning methods
K=2
61
Partitioning methods
K=4
62
Cluster Analysis
dist()
hclust()
heatmap()
library(heatplus)
Many publications present both
Summary
Data Mining (Clustering)
– Similarity measures
– Partitioning and hierarchical clustering
– Overlapping Clustering
65
Topics
• Clustering and Data Mining
• Classification (Discrimination) and
Prediction
• Bioinformatics resources
Classification and Prediction
Learning Set
Data with
known classes
Prediction
Classification
rule
Data with
unknown classes
Classification
Technique
Class
Assignment
Discrimination
67
Classification in Bioinformatics
• Computational diagnostic: early cancer
detection
• Tumor biomarker discovery
• Protein folding prediction
• Protein-protein binding sites prediction
• Gene function prediction
• …
68
Learning set
Predefine
classes
Clinical
outcome
Bad prognosis
recurrence < 5yrs
Good Prognosis
recurrence > 5yrs
Good Prognosis
?
Matesis > 5
Objects
Array
Feature vectors
Gene
expression
new
array
Reference
L van’t Veer et al (2002) Gene expression
profiling predicts clinical outcome of breast
cancer. Nature, Jan.
.
Classification
rule
69
Microarray Data Challenges to
Machine Learning Algorithms:
• Few samples for analysis (38 labeled)
• Extremely high-dimensional data (7129 gene expression values
per sample)
• Noisy data
• Complex underlying mechanisms, not fully understood
Some genes are more useful than others for
building classification models
Example: genes 36569_at and 36495_at are useful
Some genes are more useful than others for
building classification models
Example: genes 36569_at and 36495_at are useful
AML
ALL
Some genes are more useful than others for
building classification models
Example: genes 37176_at and 36563_at not useful
Importance of Feature (Gene)
Selection
• Majority of genes are not directly related to cancer
• Having a large number of features enhances the model’s
flexibility, but makes it prone to overfitting
• Noise and the small number of training samples makes this even
more likely
A 6 gene signature of lung metastasis
Landemaine T et al., Cancer Res. 2008 Aug 1;68(15):6092-9.
Topics
• Clustering and Data Mining
• Classification (Discrimination) and
Prediction
• Bioinformatics Resources
Five websites that all biologists
should know
• NCBI (The National Center for Biotechnology
Information;
– http://www.ncbi.nlm.nih.gov/
• EBI (The European Bioinformatics Institute)
– http://www.ebi.ac.uk/
• The Canadian Bioinformatics Resource
– http://www.cbr.nrc.ca/
• SwissProt/ExPASy (Swiss Bioinformatics Resource)
– http://expasy.cbr.nrc.ca/sprot/
• PDB (The Protein Databank)
– http://www.rcsb.org/PDB/
A few more resources to be
aware of
•
•
•
•
•
Human Genome Working Draft
– http://genome.ucsc.edu/
TIGR (The Institute for Genomics Research)
– http://www.tigr.org/
Celera
– http://www.celera.com/
(Model) Organism specific information:
– Yeast: http://genome-www.stanford.edu/Saccharomyces/
– Arabidopis: http://www.tair.org/
– Mouse: http://www.jax.org/
– Fruitfly: http://www.fruitfly.org/
– Nematode: http://www.wormbase.org/
Nucleic Acids Research Database Issue
– http://nar.oupjournals.org/ (First issue every year)
Challenges
•
•
•
•
Confusing choice of tools
Developed independently
Written by and for nerds
Need help from bioinformatician
Outline
•
•
•
•
•
what is R
What is Bioconductor
getting and using Bioconductor
Overview of Bioconductor packages
demo
R
• R is a language and environment for
statistical computing and graphics.
• what sorts of things is R good at?
– there are very many statistical algorithms
– there are very many machine learning
algorithms
– visualization
– it is possible to write scripts that can be reused
Goals of Bioconductor
• Provide access to powerful statistical and
graphical methods for the analysis of genomic
data.
• Facilitate the integration of biological metadata
(GenBank, GO, LocusLink, PubMed) in the
analysis of experimental data.
• Allow the rapid development of extensible,
interoperable, and scalable software.
• Promote high-quality documentation and
reproducible research.
• Provide training in computational and statistical
methods.
Microarray data analysis
Installation
1. Main R software: download from CRAN
(cran.r-project.org), use latest release, now
1.8.0.
2. Bioconductor packages: download from
Bioconductor (www.bioconductor.org),
use latest release, now 1.3.
Available for Linux/Unix, Windows, and
Mac OS.
Documentation and help
• R manuals and tutorials:available from the R website or
on-line in an R session.
• R on-line help system: detailed on-line documentation,
available in text, HTML, PDF, and LaTeX formats.
> help.start()
> help(lm)
> ?hclust
> apropos(mean)
> example(hclust)
> demo()
> demo(image)
R cluster analysis packages
• cclust: convex clustering methods.
• class: self-organizing maps (SOM).
• cluster:
–
–
–
–
–
–
AGglomerative NESting (agnes),
Clustering LARe Applications (clara),
DIvisive ANAlysis (diana),
Fuzzy Analysis (fanny),
MONothetic Analysis (mona),
Partitioning Around Medoids (pam).
• e1071:
– fuzzy C-means clustering (cmeans),
– bagged clustering (bclust).
•
•
•
•
•
flexmix: flexible mixture modeling.
fpc: fixed point clusters, clusterwise regression and discriminant plots.
GeneSOM: self-organizing maps.
mclust, mclust98: model-based cluster analysis.
mva:
– hierarchical clustering (hclust),
– k-means (kmeans).
•
Specialized summary, plot, and print methods for clustering results.
Hierarchical clustering
hclust function from
mva package
Heatmaps
heatmap function from mva package
References
• R www.r-project.org, cran.r-project.org
–
–
–
–
software (CRAN);
documentation;
newsletter: R News;
mailing list.
• Bioconductor www.bioconductor.org
– software, data, and documentation (vignettes);
– training materials from short courses;
– mailing list.
Conclusions
•
•
•
•
•
•
Visualization
PCA and MDS visualization
Clustering
Classification
Bioinformatics resources
R and Bioconductor