CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014)
Drs. Sha & Liu
{feisha,yanliu.cs}@usc.edu
November 11, 2014
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
1 / 18
Dimensionality reduction
Outline
1
Dimensionality reduction
Curse of Dimensionality
Principal component analysis
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
2 / 18
Dimensionality reduction
Dimensionality reduction
Motivation Given data that are high-dimensional x ∈ RD , we want to
find a low-dimensional representation y ∈ RM such that M < D:
Visualize data and discover intrinsic structures
Save computational and storage cost
Robust statistical modeling: curse of dimensionality
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
3 / 18
Dimensionality reduction
Curse of Dimensionality
Curse of dimensionality
Intuition The higher the dimensionality, the more data points we need to train a
model.
To fill a unit-cube in RD uniformly with data points, we need rD where r is
the edge length of small cells (i.e., dividing each axis in equal size of r.)
Thus, if data is distributed that way, models such as decision trees need rD
training samples in order to make sure every cell is covered – in case a test
sample falls into one of those cells.
For a unit-ball x ≤ 1, a large percentage of data live in the shell — between
the surface x = 1 and the surface x = 1 − . The percentage is
1 − (1 − )D
approaches 1. Thus, most data points in the high-dimensional space are
crowded in the shell and are about the same distant from each other. To
have data points that “look” substantially different from others, we need a
exponentially large (in D) number of samples to fill the void x < 1 − .
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
4 / 18
Dimensionality reduction
Principal component analysis
Problem Definition: Dimension Reduction
Problem Definition
We have data x1 , . . . , xn with each vector xi being p-dimensional.
We want to transform that into λ1 , . . . , λn with each vector λi
r-dimensional, where r < p.
In other words, we reduce the dimensionality of the data.
The key problem is what properties the new space we should have.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
5 / 18
Dimensionality reduction
Principal component analysis
Summary of Different Dimension Reduction Techniques
Principle component analysis: components with maximum variance
Independent component analysis: components that are statistically
independent from each other
Probabilistic principle component analysis: probabilist version of PCA
Factor analysis: probabilist version of ICA
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
6 / 18
Dimensionality reduction
Principal component analysis
PCA Interpretation #1: Minimum Projection Cost
(Pearson, 1901)
Principal components
Useful way to explore high dimensional data
Look for meaningful linear projections of the data.
What do we mean by meaningful?
The data to its projection on the subspace wr (a matrix of p × r) is
f (λ) = µ + wr λ
“Best fitting hyperplane”:
min
{λi },wr
n
X
kxi − µ − wr λi k2
i=1
The solutions take some derivations, but are doable (Homework).
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
7 / 18
Dimensionality reduction
Principal component analysis
PCA Interpretation #2: Direction of maximum variation
(Hotelling, 1933)
Given data matrix X with n data points of dimension p. Find W ∈ Rp ,
such that
kW k = 1, and Var(XW) is maximized
.
The vector that satisfies the above is called the first principal component.
Since
V ar(XW ) =
1
¯ T (X − X)W
¯
(W T (X − X)
) = W T ΣX W,
N
where ΣX is the sample covariance matrix of X.
Key Observation The first principal component w1 is simply the
eigenvector of ΣX corresponding to its largest eigenvalue.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
8 / 18
Dimensionality reduction
Principal component analysis
Finding the K-th Principle Component
We may want to find the k-directions of maximum variation. Let
w1 = arg max V ar(Xw)
kwk=1
be the first principal component. The second principal component is
defined as:
w2 = arg
max
V ar(XwT )
kwk=1,wT w1 =0
that is, the direction of maximal variation that is orthogonal to w1 .
The 3, 4, . . . , k principal components can be defined recursively in this
way. They correspond to the k eigenvectors corresponding to the k largest
eigenvalues of Σx .
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
9 / 18
Dimensionality reduction
Principal component analysis
A Refresh of Maths
Principal components are usually computed by the Singular Value Decomposition (SVD)
of X, which gives:
X = U DV T ,
where
U:
n × p orthogonal matrix,
D:
p × p diagonal matrix,
V :
p × p orthogonal matrix.
We can visualize it as follows:









X







=


U
n×p






D1
× 0
0
0
...
0


0
0 
×
Dp p×p

VT

p×p
n×p
The columns of V are the principal component vectors, also called “loadings”. The
columns of U are sometimes called “scores”. The magnitude of projection of X on V are
in the columns of UD. The diagonal elements of D are the variances along the principal
component vectors.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
10 / 18
Dimensionality reduction
Principal component analysis
Deriving SVD
Find the eigen decomposition of X T X:
X T X = V ΛV T
where Λ is the diagonal matrix with the eigenvalues (λi ) of X T X on its diagonal.
The matrix V is orthogonal whose columns vi are eigenvectors of X T X:
X T Xvi = λi vi, , for i = 1, . . . , p.
It is also known that λi ≥ 0. If the rank of X is r(r ≤ p), there are exactly r
positive eigenvalues, λ1 , . . . , λr .
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
11 / 18
Dimensionality reduction
Principal component analysis
Derivation on Blackboard.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
12 / 18
Dimensionality reduction
Principal component analysis
Interpretation of Principal components
If the variances of the principal components drop off quickly, then X
is highly colinear.
To reduce the dimensionality of the data, we keep only the principal
components with highest variance .
The principal vectors are derived projections of the data, and may not
have a specific meaning.
The scree plot, which shows variance decreases with the k-th principle
component, is useful:
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
13 / 18
Dimensionality reduction
Principal component analysis
Other Dimension Reduction Techniques: Independent
Component Analysis (ICA)
Dimension Reduction: Find a set of vectors w1 , . . . , wr that forms a
basis for W so that
xi = fi,1 w1 + fi,2 w2 + . . . + fir wr
for some coefficients fij .
PCA: Find an orthogonal set of vectors w1 , . . . , wr (r < p) that
maximize the projection variance.
ICA: Find a set of basis vectors w1 , . . . , wr (r = p) that are
statistically independent
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
14 / 18
Dimensionality reduction
Principal component analysis
Differences between ICA and PCA
Both methods perform linear transformations
PCA: low rank matrix factorization for compression; ICA: full rank
matrix factorization to remove dependency between the rows
PCA just removes correlations, not higher order dependence; ICA
removes correlations, and higher order dependence
PCA: some components are more important than others (based on
eigenvalues); ICA: components are equally important
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
15 / 18
Dimensionality reduction
Principal component analysis
Differences between ICA and PCA
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
16 / 18
Dimensionality reduction
Principal component analysis
ICA Solution and Issues
ICA Solutions
ICA Issues: The Gaussian distribution is spherically symmetric.
Mixing it with an orthogonal matrix... produces the same distribution.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
17 / 18
Dimensionality reduction
Principal component analysis
Other Dimension Reduction Techniques: Factor Analysis
and Probabilistic PCA
Factor analysis: Attempts of interpreting principal components may lead to the
entities with their own lives: factors. And consequently to the technique, or bundle
of techniques, dealing with those. Suppose we have observations of random
variable X as x1 , . . . , xn . The factor analysis method assumes the following:
X = µ + F W + ,
where µ is the expect value of X, F is the unknown loading (parameters), W is
the uncorrelated common factors (hidden variables), and is the Gaussian noise.
Probabilistic PCA: Probabilistic, generative view of data with the assumption that
underlying latent variable as a Gaussian distribution and linear relationship between
latent and observed variables. The Probabilistic PCA assumes the following:
Wi ∼ N (0, I), X ∼ N (F W + µ, σ 2 I)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu)
CSCI567 Machine Learning (Fall 2014)
November 11, 2014
18 / 18