presentation - University of Maryland

Linear and Non-linear Dimentionality
Reduction
applied to gene expression data of cancer
tissue samples
Franck Olivier Ndjakou Njeunje
Applied Mathematics, Statistics, and Scientific Computation
Advisers
Wojtek Czaja
John J. Benedetto
Norbert Wiener Center for Harmonic Analysis
Department of Mathematics
University of Maryland - College Park
October 9, 2014
Dimensionality
Reduction
Techniques
Franck Njeunje
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
Outline
Dimensionality
Reduction
Techniques
Franck Njeunje
Introduction
Background
Approach
Introduction
Background
Approach
Software and
Hardware
Software and Hardware
Principal component Analysis [1]
Laplacian Eigenmap [2]
Hierarchical Clustering
K-means Clustering
Implementation
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Validation and Test Problem
Timeline
Deliverable
Bibliography
Extra
Bibliography
Dimensionality
Reduction
Techniques
Franck Njeunje
Gene expression matrix
Introduction
Background
Approach
Gene expression data:
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
I
interaction between
genes and environment
I
High-density micro-array
I
Similarity learning
Validation and
Test Problem
I
Gene function
Extra
I
Disease mechanism
I
High-dimension
Validation Methods
Timeline
Deliverable
Bibliography
Dimension reduction
I
Linear (LDR)
I
Non-linear (NDR)
Expression of a single gene/variable
Dimensionality
Reduction
Techniques
Franck Njeunje
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
Similarity learning: Clustering
I
I
Dimensionality
Reduction
Techniques
Franck Njeunje
Clustering: Elements in the same cluster are more
similar than elements in other cluster.
Introduction
Example: K-means and Expectation-Maximization
clustering are applied on an artificial dataset of mouse.1
Software and
Hardware
Background
Approach
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
1
Image by Chire
Wikipedia
Dimensionality
Reduction
Techniques
Franck Njeunje
Introduction
For this project I will be considering the following methods:
1. Dimension reduction algorithms
I
I
LDR: Principal Component Analysis (PCA)
NDR: Laplacian Eigenmap (LE)
2. Clustering algorithm
I
I
Hierachical clustering (HC)
K-means (KM)
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
Dimensionality
Reduction
Techniques
Franck Njeunje
I
Principal Component Analysis:
I
I
I
Introduction
Does not do well on data with non-linear structure.
Preserve the most variance from data.
Background
Approach
Software and
Hardware
˜ of the
Step 1: Compute the standardized matrix X
original matrix X,
˜ = (˜x1 , ˜
X
x2 , . . . , ˜
xM )
x2
xM − ¯
xM
x1 − ¯
x1 x2 − ¯
, √
,..., √
).
= ( √
σ11
σ22
σMM
(1)
(2)
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Here, ¯x1 , ¯x2 , . . . , ¯xM and σ11 , σ22 , . . . , σMM are
respectively the mean values and the variances for
corresponding variable vectors.
Extra
Timeline
Deliverable
Bibliography
I
˜ then
Step 2: Compute the covariance matrix of X,
make spectral decomposition to get the eigenvalues and
its corresponding eigenvectors.
˜ 0X
˜ = XΛX.
C=X
(3)
Here Λ = diag (λ1 , λ2 , . . . , λM ), λ1 ≥ λ2 ≥ . . . ≥ λM ,
U = (u1 , u2 , . . . , uM ). λi and ui are separately the ith
eigenvalue corresponding eigenvector for covariance
matrix C.
I
2
Step 3: Determine the number of principal components
based on the preconcerted value. Supposing the
number to be m, the i th principal component can be
˜ i , and the reduced dimentional
computed as Xu
˜ m .2
(N × m) subspace is XU
Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction of
gene expression data for visualization and clustering analysis of cancer
tissue samples.
Dimensionality
Reduction
Techniques
Franck Njeunje
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
Dimensionality
Reduction
Techniques
Franck Njeunje
I
Laplacian Eigenmap:
I
I
I
Step 1: Given a set of N points x1 , x2 , . . . , xN in RM ,
construct a weighted graph with N nodes.
I
I
I
Non-linear
Preserve natural geometrical structure
-neighborhoods: Link nodes that are away from each
other. This could lead to a disconnected graph.
k nearest neighbors: Each nodes is link to the k th
nearest neighbors.
Step 2: Choose the weight for the edges and construct
the weight matrix W.3
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
3
Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for
Dimentionality Reduction and Data Representation
Dimensionality
Reduction
Techniques
I
Franck Njeunje
Step 3: For each connected sub-graph(s), solve the
following generalized eigenvector problem,
Introduction
Background
Approach
Lf = λDf,
(4)
P
where Dii = j Wji , the diagonal matrix; and
L = D − W, the Laplacian matrix.
Let f0 , f1 , . . . , fN−1 be the solutions of (4) with
corresponding λ0 , λ1 , . . . , λN−1 such that, Lfi = λi Dfi
for i going from 0 to N − 1 and
0 = λ0 ≤ λ1 ≤ . . . ≤ λN−1 . Then the m-dimensional
Euclidean space embedding is given by:
xi → yi = (f1 (i), . . . , fm (i)).
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
(5)
I
Hierarchical Clustering
Dimensionality
Reduction
Techniques
I
Choose a metric:
Franck Njeunje
I
Euclidean distance:
Introduction
ka − bk2 =
s
X
Background
Approach
(ai − bi )2
(6)
i
I
Manhattan distance:
ka − bk1 =
X
|ai − bi |
i
(7)
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Data set
bottom up HC
Extra
Timeline
Deliverable
Bibliography
Dimensionality
Reduction
Techniques
Franck Njeunje
I
Linkage Criteria:
I
Introduction
max{d(a, b) : a ∈ A, b ∈ B}.
I
(8)
Minimum or SLINK (single linkage clustering)
min{d(a, b) : a ∈ A, b ∈ B}.
I
Background
Approach
Maximum or CLINK (complete linkage clustering)
(9)
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Mean or average linkage clustering
Validation Methods
1 XX
d(a, b).
|A||B|
a∈A b∈B
(10)
Extra
Timeline
Deliverable
Bibliography
Dimensionality
Reduction
Techniques
I
K-means Clustering
Franck Njeunje
(1)
(1)
(1)
m1 , m2 , . . . , mk .
I
Initialized a set of k means
I
Assignment step: Assign each observation xp to exactly
one set Si containing the nearest mean to xp .
(t)
Introduction
(t)
Sit = {xp : kxp − mi k2 ≤ kxp − mj k2 ∀j, 1 ≤ j ≤ k}.
(11)
I
Update step: update the mean within each cluster,
mt+1
=
i
1
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
X
xj .
(t)
|S1 | xj ∈Si (t)
I
Repeat the two previous steps.
I
Stop when no new assignments are made.
(12)
Validation Methods
Extra
Timeline
Deliverable
Bibliography
I
Dimensionality
Reduction
Techniques
K-means illustration:
Franck Njeunje
Introduction
k initial means, k = 3
Assignment step
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Update step
Repeat
Validation Methods
Extra
Timeline
Deliverable
Bibliography
Dimensionality
Reduction
Techniques
Franck Njeunje
1. platform
I
I
I
Matlab
Personal laptop
Norbert Wiener Center lab
2. Database
I
I
NIC-60 [3, p. 95-96]
Computer generated data for test run
3. Complexity: Large matrix operations
4. Algorithm implemented
I
I
PCA from the Singular Value Decomposition.
Laplacian Eigenmaps
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
Dimensionality
Reduction
Techniques
I
DRtoolbox4 [4] contains:
I
I
Swiss roll
Franck Njeunje
Implementation of the PCA and LE methods describe
above, they would be used for comparison.
Well understood data such the Swiss roll and the Twin
peaks
Twin peaks
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
I
4
Clustering algorithms will be borrowed and used as a
tool to compare the outputs from the DR methods.
Laurens van der Maaten, Delft University of Technology
Bibliography
Dimensionality
Reduction
Techniques
I
October - November:
I
I
I
Implementation of PCA algorithm.
Resolve issues that come up (storage and memory).
Testing and validating.
I
December: Mid-year presentation.
I
January: First semester progress report.
February - April:
I
I
I
I
April - May:
I
I
Implementation of LE algorithm.
Testing and validating.
Implementation of a clustering algorithm (if time
permits).
May: Final report
Franck Njeunje
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
Dimensionality
Reduction
Techniques
Franck Njeunje
I
Weekly Report
I
Self Introduction
I
Project Proposal
I
First-Semester Progress Report
I
Mid-year Status Report
I
Final Report
I
Code for PCA implementation
I
Code for LE implementation
I
NIC-60 data set
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography
Dimensionality
Reduction
Techniques
Jinlong Shi, Zhigang Luo, Nonlinear dimensionality
reduction of gene expression data for visualization and
clustering analysis of cancer tissue samples. Computers
in Biology and Medicine 40 (2010) 723-732.
Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for
Dimentionality Reduction and Data Representation.
Neural Computation 15, 1373-1396 (2003)
Vinodh N. Rajapakse (2013). Data Representation for
Learning and Information Fusion in Bioinformatics.
Digital Repository at the University of Maryland,
University of Maryland (College Park, Md.)
Laurens van der Maaten, Affiliation: Delft University of
Technology. Matlab Toolbox for Dimensionality
Reduction (v0.8.1b) March 21, 2013.
Franck Njeunje
Introduction
Background
Approach
Software and
Hardware
Principal component
Analysis [1]
Laplacian
Eigenmap [2]
Hierarchical
Clustering
K-means Clustering
Implementation
Validation and
Test Problem
Validation Methods
Extra
Timeline
Deliverable
Bibliography