CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu November 11, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 1 / 18 Dimensionality reduction Outline 1 Dimensionality reduction Curse of Dimensionality Principal component analysis Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 2 / 18 Dimensionality reduction Dimensionality reduction Motivation Given data that are high-dimensional x ∈ RD , we want to find a low-dimensional representation y ∈ RM such that M < D: Visualize data and discover intrinsic structures Save computational and storage cost Robust statistical modeling: curse of dimensionality Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 3 / 18 Dimensionality reduction Curse of Dimensionality Curse of dimensionality Intuition The higher the dimensionality, the more data points we need to train a model. To fill a unit-cube in RD uniformly with data points, we need rD where r is the edge length of small cells (i.e., dividing each axis in equal size of r.) Thus, if data is distributed that way, models such as decision trees need rD training samples in order to make sure every cell is covered – in case a test sample falls into one of those cells. For a unit-ball x ≤ 1, a large percentage of data live in the shell — between the surface x = 1 and the surface x = 1 − . The percentage is 1 − (1 − )D approaches 1. Thus, most data points in the high-dimensional space are crowded in the shell and are about the same distant from each other. To have data points that “look” substantially different from others, we need a exponentially large (in D) number of samples to fill the void x < 1 − . Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 4 / 18 Dimensionality reduction Principal component analysis Problem Definition: Dimension Reduction Problem Definition We have data x1 , . . . , xn with each vector xi being p-dimensional. We want to transform that into λ1 , . . . , λn with each vector λi r-dimensional, where r < p. In other words, we reduce the dimensionality of the data. The key problem is what properties the new space we should have. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 5 / 18 Dimensionality reduction Principal component analysis Summary of Different Dimension Reduction Techniques Principle component analysis: components with maximum variance Independent component analysis: components that are statistically independent from each other Probabilistic principle component analysis: probabilist version of PCA Factor analysis: probabilist version of ICA Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 6 / 18 Dimensionality reduction Principal component analysis PCA Interpretation #1: Minimum Projection Cost (Pearson, 1901) Principal components Useful way to explore high dimensional data Look for meaningful linear projections of the data. What do we mean by meaningful? The data to its projection on the subspace wr (a matrix of p × r) is f (λ) = µ + wr λ “Best fitting hyperplane”: min {λi },wr n X kxi − µ − wr λi k2 i=1 The solutions take some derivations, but are doable (Homework). Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 7 / 18 Dimensionality reduction Principal component analysis PCA Interpretation #2: Direction of maximum variation (Hotelling, 1933) Given data matrix X with n data points of dimension p. Find W ∈ Rp , such that kW k = 1, and Var(XW) is maximized . The vector that satisfies the above is called the first principal component. Since V ar(XW ) = 1 ¯ T (X − X)W ¯ (W T (X − X) ) = W T ΣX W, N where ΣX is the sample covariance matrix of X. Key Observation The first principal component w1 is simply the eigenvector of ΣX corresponding to its largest eigenvalue. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 8 / 18 Dimensionality reduction Principal component analysis Finding the K-th Principle Component We may want to find the k-directions of maximum variation. Let w1 = arg max V ar(Xw) kwk=1 be the first principal component. The second principal component is defined as: w2 = arg max V ar(XwT ) kwk=1,wT w1 =0 that is, the direction of maximal variation that is orthogonal to w1 . The 3, 4, . . . , k principal components can be defined recursively in this way. They correspond to the k eigenvectors corresponding to the k largest eigenvalues of Σx . Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 9 / 18 Dimensionality reduction Principal component analysis A Refresh of Maths Principal components are usually computed by the Singular Value Decomposition (SVD) of X, which gives: X = U DV T , where U: n × p orthogonal matrix, D: p × p diagonal matrix, V : p × p orthogonal matrix. We can visualize it as follows: X = U n×p D1 × 0 0 0 ... 0 0 0 × Dp p×p VT p×p n×p The columns of V are the principal component vectors, also called “loadings”. The columns of U are sometimes called “scores”. The magnitude of projection of X on V are in the columns of UD. The diagonal elements of D are the variances along the principal component vectors. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 10 / 18 Dimensionality reduction Principal component analysis Deriving SVD Find the eigen decomposition of X T X: X T X = V ΛV T where Λ is the diagonal matrix with the eigenvalues (λi ) of X T X on its diagonal. The matrix V is orthogonal whose columns vi are eigenvectors of X T X: X T Xvi = λi vi, , for i = 1, . . . , p. It is also known that λi ≥ 0. If the rank of X is r(r ≤ p), there are exactly r positive eigenvalues, λ1 , . . . , λr . Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 11 / 18 Dimensionality reduction Principal component analysis Derivation on Blackboard. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 12 / 18 Dimensionality reduction Principal component analysis Interpretation of Principal components If the variances of the principal components drop off quickly, then X is highly colinear. To reduce the dimensionality of the data, we keep only the principal components with highest variance . The principal vectors are derived projections of the data, and may not have a specific meaning. The scree plot, which shows variance decreases with the k-th principle component, is useful: Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 13 / 18 Dimensionality reduction Principal component analysis Other Dimension Reduction Techniques: Independent Component Analysis (ICA) Dimension Reduction: Find a set of vectors w1 , . . . , wr that forms a basis for W so that xi = fi,1 w1 + fi,2 w2 + . . . + fir wr for some coefficients fij . PCA: Find an orthogonal set of vectors w1 , . . . , wr (r < p) that maximize the projection variance. ICA: Find a set of basis vectors w1 , . . . , wr (r = p) that are statistically independent Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 14 / 18 Dimensionality reduction Principal component analysis Differences between ICA and PCA Both methods perform linear transformations PCA: low rank matrix factorization for compression; ICA: full rank matrix factorization to remove dependency between the rows PCA just removes correlations, not higher order dependence; ICA removes correlations, and higher order dependence PCA: some components are more important than others (based on eigenvalues); ICA: components are equally important Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 15 / 18 Dimensionality reduction Principal component analysis Differences between ICA and PCA Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 16 / 18 Dimensionality reduction Principal component analysis ICA Solution and Issues ICA Solutions ICA Issues: The Gaussian distribution is spherically symmetric. Mixing it with an orthogonal matrix... produces the same distribution. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 17 / 18 Dimensionality reduction Principal component analysis Other Dimension Reduction Techniques: Factor Analysis and Probabilistic PCA Factor analysis: Attempts of interpreting principal components may lead to the entities with their own lives: factors. And consequently to the technique, or bundle of techniques, dealing with those. Suppose we have observations of random variable X as x1 , . . . , xn . The factor analysis method assumes the following: X = µ + F W + , where µ is the expect value of X, F is the unknown loading (parameters), W is the uncorrelated common factors (hidden variables), and is the Gaussian noise. Probabilistic PCA: Probabilistic, generative view of data with the assumption that underlying latent variable as a Gaussian distribution and linear relationship between latent and observed variables. The Probabilistic PCA assumes the following: Wi ∼ N (0, I), X ∼ N (F W + µ, σ 2 I) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) November 11, 2014 18 / 18
© Copyright 2024 ExpyDoc