Power Method for computing SVD

E0 229: Foundations of Data Science
Lecturer: Ravi Kannan
Scribe: Vidya Narayanan
Lecture 9
6-Feb-2014
Power Method for computing SVD
The power method is a conceptual starting point of algorithms for computing singular
vectors. The intent is to look at an algorithm that approximately computes SVD and
establish clean bounds (withoutP
many parameters) on the approximation. Let A be an
n × d matrix of rank r and A = t σt ut vt T be its SVD. We want to find the first singular
vector v1 of A. ( Since the greedy approach works, we can repeat the process on the subspace
orthogonal to v1 to find the next and so on.) Let B = AT A. B is a square(d × d) matrix.
B = AT A
X
T X
σ s us v s T ]
σ t ut v t T ] [
=[
s
t
XX
=[
σt σs vt (ut T us )vs ]
t
=
X
s
σ t 2 vt vt T
(∵ ut T us = 0 for t 6= s)
t
B2 =
X
σ t 4 vt vt T
t
B
k
=
X
σt 2k vt vt T
t
Assume σ1 > σ2
Bk
σ22k
σ32k
T
T
=
v
v
+
v
v
+
v3 v3 T + · · ·
1 1
2 2
σ12k
σ12k
σ12k
Bk
= v1 v1 T
k→∞ σ12k
lim
From the matrix v1 v1 T it is easy to find v1 since each column is a scalar multiple of
v1 . The value of k depends on the gap σ1 − σ2 .
If σ2 = (1 − )σ1
σ 2k
2
= (1 − )2k
σ1
≈ e−2k
As a rule of thumb, we need k 1/.
( small)
(∵ 1 + x < ex )
An issue with the above approach is the use of matrix multiplication. Each step involves
i.e d × d matrix multiplication requiring O(d3 ) computation which is costly. A is often
a large, sparse matrix. However B = AT A need not be sparse and it may not be easy to
even store all the entries of B. To avoid matrix multiplication consider the following.
Let x be a random unit d-dimensional vector. First, B k x too in the limit gives us v1 .
X
σt 2k vt vt T
Bk =
BkB
t
k
B x =
X
σt 2k vt (vt T x)
t
Bk
σ22k
σ32k
T
T
x
=
v
(v
x)
+
v
(v
x)
+
v3 (v3 T x) + · · ·
1 1
2 2
σ12k
σ12k
σ12k
Bk
x = v1 (v1 T x)
k→∞ σ12k
lim
(c = v1 T x, scalar)
= c · v1
Second, By using B k x we can compute powers of B by matrix-vector multiplication which
is computationally more efficient than matrix multiplication.
B k = B(B k−1 x)
= AT A(B k−1 x)
B k can be computed with 2k matrix-vector multiplications. However, if x is orthogonal or
k
nearly orthogonal to v1 , limk→∞ σB2k x = v1 (v1 T x) = 0.
1
If σ1 = σ2 or σ1 ≈ σ2 , by the above procedure we would find a v in the space spanned
by v1 and v2 with |Av| = σ1 . The following theorem shows that the vector found by the
power method has a high component in the space spanned by those singular vectors that
correspond to (nearly) highest singular values and has a low component in the orthogonal
space.
(AT A)k x
Let w = |(A
T A)k x| be the result of running the power method for k steps. Let σ1 · · · σl ≥
(1 − )σ1 and σl+1 · · · σr < (1 − )σ1 .
Assume x satisfies |v1 T x| ≥ δ > 0 provided k = Ω(
ln
1
δ
).
Theorem 1. If k > ln(1/δ)
, the component of w along subspace V spanned by the first l
singular vectors is atleast 1 − .
2
Proof.
(Component) of (AT A)x along V ≥ (Component) of (AT A)x along v1
= |v1 T (AT A)k x|
X
σt 2k vt vt T )x|
= |v1 T (
=
≥
t
2k
T
|σ1 v1 x|
δσ12k
(Component)2 of (AT A)x
along the space orthogonal to V
T
T
= [vl+1
(AT A)k x]2 + [vl+2
(AT A)k x]2 + · · ·
X
T
σt 2k vt vt T )x]2 + · · ·
(
= [vl+1
=
=
t
2
2k T
(σl+1 vl+1 x)
2
4k
T
σl+1
(vl+1
x)
2
2k T
+ (σl+2
vl+2 x) + · · ·
2
4k
T
+ σl+2
(vl+2
x) + · · ·
2
2
T
T
x) + · · · )
x) + (vl+2
≤ (1 − )4k σ14k ((vl+1
(∵ vl+1 . . . vr can be extended to a basis )
≤ (1 − )4k σ14k |x|2
Component of w
along the space orthogonal to V
(1 − )2k σ12k
|(AT A)k x|
(1 − )2k σ12k
=
δσ12k + · · ·
≤
≤
(1 − )2k σ12k
δσ12k
(1 − )2k
δ
≤ e−2k−ln δ
=
= e+ ln (∵ 1 + x < ex )
( for a suitable k )
=
√
For a random unit vector x, with high1 probability |v1 T x| = Ω(1/ d) and even with
d
k ≥ c. ln
( c positive constant) power method should converge. We can now carry out
a greedy procedure for finding the next singular vector in the space perpendicular to w.
The greedy algorithm still works for approximate SVD (Proof requires more work). By
1
Lemma 4.10 in text
3
approximate
that the left singular vectors are only approximately orthogonal i.e
P we0 mean
T
when A= t σt ut wt where wt was computed using approximate SVD, wt are orthogonal
by construction, but ut are only nearly orthogonal. Note that the equality is exact.
4