Linear and Non-linear Dimentionality Reduction applied to gene expression data of cancer tissue samples Franck Olivier Ndjakou Njeunje Applied Mathematics, Statistics, and Scientific Computation Advisers Wojtek Czaja John J. Benedetto Norbert Wiener Center for Harmonic Analysis Department of Mathematics University of Maryland - College Park October 9, 2014 Dimensionality Reduction Techniques Franck Njeunje Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography Outline Dimensionality Reduction Techniques Franck Njeunje Introduction Background Approach Introduction Background Approach Software and Hardware Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Validation and Test Problem Timeline Deliverable Bibliography Extra Bibliography Dimensionality Reduction Techniques Franck Njeunje Gene expression matrix Introduction Background Approach Gene expression data: Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation I interaction between genes and environment I High-density micro-array I Similarity learning Validation and Test Problem I Gene function Extra I Disease mechanism I High-dimension Validation Methods Timeline Deliverable Bibliography Dimension reduction I Linear (LDR) I Non-linear (NDR) Expression of a single gene/variable Dimensionality Reduction Techniques Franck Njeunje Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography Similarity learning: Clustering I I Dimensionality Reduction Techniques Franck Njeunje Clustering: Elements in the same cluster are more similar than elements in other cluster. Introduction Example: K-means and Expectation-Maximization clustering are applied on an artificial dataset of mouse.1 Software and Hardware Background Approach Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography 1 Image by Chire Wikipedia Dimensionality Reduction Techniques Franck Njeunje Introduction For this project I will be considering the following methods: 1. Dimension reduction algorithms I I LDR: Principal Component Analysis (PCA) NDR: Laplacian Eigenmap (LE) 2. Clustering algorithm I I Hierachical clustering (HC) K-means (KM) Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography Dimensionality Reduction Techniques Franck Njeunje I Principal Component Analysis: I I I Introduction Does not do well on data with non-linear structure. Preserve the most variance from data. Background Approach Software and Hardware ˜ of the Step 1: Compute the standardized matrix X original matrix X, ˜ = (˜x1 , ˜ X x2 , . . . , ˜ xM ) x2 xM − ¯ xM x1 − ¯ x1 x2 − ¯ , √ ,..., √ ). = ( √ σ11 σ22 σMM (1) (2) Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Here, ¯x1 , ¯x2 , . . . , ¯xM and σ11 , σ22 , . . . , σMM are respectively the mean values and the variances for corresponding variable vectors. Extra Timeline Deliverable Bibliography I ˜ then Step 2: Compute the covariance matrix of X, make spectral decomposition to get the eigenvalues and its corresponding eigenvectors. ˜ 0X ˜ = XΛX. C=X (3) Here Λ = diag (λ1 , λ2 , . . . , λM ), λ1 ≥ λ2 ≥ . . . ≥ λM , U = (u1 , u2 , . . . , uM ). λi and ui are separately the ith eigenvalue corresponding eigenvector for covariance matrix C. I 2 Step 3: Determine the number of principal components based on the preconcerted value. Supposing the number to be m, the i th principal component can be ˜ i , and the reduced dimentional computed as Xu ˜ m .2 (N × m) subspace is XU Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Dimensionality Reduction Techniques Franck Njeunje Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography Dimensionality Reduction Techniques Franck Njeunje I Laplacian Eigenmap: I I I Step 1: Given a set of N points x1 , x2 , . . . , xN in RM , construct a weighted graph with N nodes. I I I Non-linear Preserve natural geometrical structure -neighborhoods: Link nodes that are away from each other. This could lead to a disconnected graph. k nearest neighbors: Each nodes is link to the k th nearest neighbors. Step 2: Choose the weight for the edges and construct the weight matrix W.3 Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography 3 Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for Dimentionality Reduction and Data Representation Dimensionality Reduction Techniques I Franck Njeunje Step 3: For each connected sub-graph(s), solve the following generalized eigenvector problem, Introduction Background Approach Lf = λDf, (4) P where Dii = j Wji , the diagonal matrix; and L = D − W, the Laplacian matrix. Let f0 , f1 , . . . , fN−1 be the solutions of (4) with corresponding λ0 , λ1 , . . . , λN−1 such that, Lfi = λi Dfi for i going from 0 to N − 1 and 0 = λ0 ≤ λ1 ≤ . . . ≤ λN−1 . Then the m-dimensional Euclidean space embedding is given by: xi → yi = (f1 (i), . . . , fm (i)). Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography (5) I Hierarchical Clustering Dimensionality Reduction Techniques I Choose a metric: Franck Njeunje I Euclidean distance: Introduction ka − bk2 = s X Background Approach (ai − bi )2 (6) i I Manhattan distance: ka − bk1 = X |ai − bi | i (7) Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Data set bottom up HC Extra Timeline Deliverable Bibliography Dimensionality Reduction Techniques Franck Njeunje I Linkage Criteria: I Introduction max{d(a, b) : a ∈ A, b ∈ B}. I (8) Minimum or SLINK (single linkage clustering) min{d(a, b) : a ∈ A, b ∈ B}. I Background Approach Maximum or CLINK (complete linkage clustering) (9) Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Mean or average linkage clustering Validation Methods 1 XX d(a, b). |A||B| a∈A b∈B (10) Extra Timeline Deliverable Bibliography Dimensionality Reduction Techniques I K-means Clustering Franck Njeunje (1) (1) (1) m1 , m2 , . . . , mk . I Initialized a set of k means I Assignment step: Assign each observation xp to exactly one set Si containing the nearest mean to xp . (t) Introduction (t) Sit = {xp : kxp − mi k2 ≤ kxp − mj k2 ∀j, 1 ≤ j ≤ k}. (11) I Update step: update the mean within each cluster, mt+1 = i 1 Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem X xj . (t) |S1 | xj ∈Si (t) I Repeat the two previous steps. I Stop when no new assignments are made. (12) Validation Methods Extra Timeline Deliverable Bibliography I Dimensionality Reduction Techniques K-means illustration: Franck Njeunje Introduction k initial means, k = 3 Assignment step Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Update step Repeat Validation Methods Extra Timeline Deliverable Bibliography Dimensionality Reduction Techniques Franck Njeunje 1. platform I I I Matlab Personal laptop Norbert Wiener Center lab 2. Database I I NIC-60 [3, p. 95-96] Computer generated data for test run 3. Complexity: Large matrix operations 4. Algorithm implemented I I PCA from the Singular Value Decomposition. Laplacian Eigenmaps Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography Dimensionality Reduction Techniques I DRtoolbox4 [4] contains: I I Swiss roll Franck Njeunje Implementation of the PCA and LE methods describe above, they would be used for comparison. Well understood data such the Swiss roll and the Twin peaks Twin peaks Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable I 4 Clustering algorithms will be borrowed and used as a tool to compare the outputs from the DR methods. Laurens van der Maaten, Delft University of Technology Bibliography Dimensionality Reduction Techniques I October - November: I I I Implementation of PCA algorithm. Resolve issues that come up (storage and memory). Testing and validating. I December: Mid-year presentation. I January: First semester progress report. February - April: I I I I April - May: I I Implementation of LE algorithm. Testing and validating. Implementation of a clustering algorithm (if time permits). May: Final report Franck Njeunje Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography Dimensionality Reduction Techniques Franck Njeunje I Weekly Report I Self Introduction I Project Proposal I First-Semester Progress Report I Mid-year Status Report I Final Report I Code for PCA implementation I Code for LE implementation I NIC-60 data set Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography Dimensionality Reduction Techniques Jinlong Shi, Zhigang Luo, Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Computers in Biology and Medicine 40 (2010) 723-732. Mikhail Belkin, Partha Niyogi, Laplacian Eigenmaps for Dimentionality Reduction and Data Representation. Neural Computation 15, 1373-1396 (2003) Vinodh N. Rajapakse (2013). Data Representation for Learning and Information Fusion in Bioinformatics. Digital Repository at the University of Maryland, University of Maryland (College Park, Md.) Laurens van der Maaten, Affiliation: Delft University of Technology. Matlab Toolbox for Dimensionality Reduction (v0.8.1b) March 21, 2013. Franck Njeunje Introduction Background Approach Software and Hardware Principal component Analysis [1] Laplacian Eigenmap [2] Hierarchical Clustering K-means Clustering Implementation Validation and Test Problem Validation Methods Extra Timeline Deliverable Bibliography

© Copyright 2024 ExpyDoc