Numerical Optimization

Numerical Optimization
Lecture 9: Global convergence of line search and convergence rates of
steepest descent/Newton’s methods
Sangkyun Lee
The content is from Nocedal and Wright (2006). Topics marked with ** are
optional.
1
Some Background
1.1
Cauchy-Schwarz Inequality
For two vectors x, y ∈ Rn , we have
|xT y| ≤ kxk2 kyk2
(Cauchy-Schwarz)
** In general, for p, q ∈ [1, ∞] with 1/p + 1/q = 1,
|xT y| ≤ kxkp kykq
(H¨older’s inequality)
Note that H¨
older’s inequality holds for all measurable functions x and y.
1.2
Matrix Norms (Operator Norms)
P
P
Given norms on vectors x ∈ Rn such that kxk1 = ni=1 |xi |, kxk2 = ( ni=1 x2i )1/2 ,
and kxk∞ = maxi=1,...,n |xi |, the matrix norms induced by vector norms (called
operator norms) for a matrix A ∈ Rm×n is defined as
kAk :=
sup
x∈Rn ,x6=0
=
kAxk
kxk
sup
kAxk
x∈Rn ,kxk=1
The following can be derived from the above definition,
P
• kAk1 = maxj=1,...,n m
i=1 |Aij |, the maximal absolute column sum.
P
• kAk∞ = maxi=1,...,m nj=1 |Aij |, the maximal absolute row sum.
p
• kAk2 = λmax (AT A) = σmax (A), where λmax (AT A) is the largest eigenvalue
of AT A and σmax (A) is the largest singular value of A. When A is square
symmetric matrix, then λ(AT A) = {λ(A)}2 , so that kAk2 = λmax (A)1 .
From the definition, it is clear that
kAxk ≤ kAkkxk.
This looks like Cauchy-Schwarz inequality, but it is different since A is a matrix,
not a vector.
1
For a square matrix A, ρ(A) = maxi=1,...,n |λi (A)| = kAk2 is called the spectral radius of A.
1
1.3
Convex Function
Lemma 9.1. Let C ⊆ Rn is a convex set and let f : Rn → R be continuously
differentiable over Rn .
(a) f is convex over C if and only if
f (y) ≥ f (x) + ∇f (x)T (y − x), for all x, y ∈ C.
(b) f is strictly convex over C if any only if the above inequality is strict whenever
x 6= y.
Proof. ** (⇐) : Suppose that the inequality in (a) holds. Choose any x, y ∈ C and
α ∈ [0, 1] and let z = αx + (1 − α)y. Using the inequality in (a) twice, we have
f (x) ≥ f (z) + ∇f (z)T (x − z)
f (y) ≥ f (z) + ∇f (z)T (y − z).
Multiplying the first inequality with α, the second by (1 − α), and adding them
together, we obtain
αf (x) + (1 − α)f (y) ≥ f (z) + ∇f (z)T [αx + (1 − α)y − z] = f (z)
which proves that f is convex.
(⇒) : Conversely, suppose that f is convex. Let x, y ∈ C be any vectors, and
consider the function
g(α) =
f (x + α(y − x)) − f (x)
, α ∈ (0, 1].
α
We will show that g(α) is monotonically increasing with α. Consider any α1 , α2
with 0 < α1 < α2 ≤ 1, and let
α
¯=
α1
∈ (0, 1), y¯ = x + α2 (y − x).
α2
Since f is convex, we have
f (x + α
¯ (¯
y − x)) ≤ α
¯ f (¯
y ) + (1 − α
¯ )f (x)
That is,
f (x + α
¯ (¯
y − x)) − f (x)
≤ f (¯
y ) − f (x)
α
¯
Plugging-in definitions of x
¯ and α
¯ , we get
f (x + α1 (y − x)) − f (x)
f (x + α2 (y − x)) − f (x)
≤
α1
α2
that is,
g(α1 ) ≤ g(α2 ).
This shows that g(α) is monotonic in α ∈ (0, 1]. Then
(y − x)T ∇f (x) = lim g(α) ≤ g(1) = f (z) − f (x).
α→0
This conclude the proof for case (a). The proof for the case (b) is omitted.
2
The quantity E(x, y) = f (y)−f (x)−∇f (x)T (y −x) is called the excess function.
Therefore E(x, y) ≥ 0 for all x, y ∈ C for a given function f implies that f is a
convex function on C.
Theorem 9.2. Let f : D ⊂ Rn → R be a twice continuously differentiable function
on an open set D. Then
• f is convex if and only if ∇2 f (x) is positive semi-definite for all x ∈ D.
• f is strictly convex if and only if ∇2 f (x) is positive definite for all x ∈ D.
Proof. ** From Taylor’s theorem, we have
1
f (y) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x + α(y − x))(y − x)
2
for some α ∈ [0, 1]. When the Hessian is positive semi-definite, then
E(x, y) = f (y) − f (x) − ∇f (x∗ )T (y − x) ≥ 0
and therefore f is convex due to Lemma 9.1.
On the other hand, suppose that the Hessian is not p.s.d. at some point in
D. Since ∇2 f (x) is continuous, there exists y ∈ D such that for all α ∈ [0, 1],
∇2 f (x + α(y − x)) is not positive semi-definite. Then from the Taylor theorem (as
above), we have
E(x, y) = f (y) − f (x) − ∇f (x)T < 0
and therefore f is not convex. The contrapositive gives that when f is convex, the
Hessian is p.s.d. for all points in D.
2
Global Convergence of Wolfe Line Search
Consider the following unconstrained minimization of a continuously differentiable
objective function f : Rn → R,
minn f (x).
x∈R
Given the current iterate xk ∈ Rn , k ≥ 1, and a descent direction pk ∈ Rn , we
generate the next iterate xk+1 by
xk+1 = xk + αk pk .
where the stel length αk > 0 can be determined by a line search strategy satisfying
the Wolfe conditions (for 0 < c1 < c2 < 1),
f (xk + αk pk ) ≤ f (xk ) + c1 α∇f (xk )T pk
∇f (xk + αk pk )T pk ≥ c2 ∇f (xk )T pk
2.1
(Sufficient Decrease)
(Curvature)
θk
For the analysis here, the angle θk between −∇f (xk ) and a search direction pk plays
an important role, which is defined by
cos θk =
−∇f (xk )T pk
k∇f (xk )k2 kpk k2
3
Theorem 9.3 (Global Convergence of Line Search). Consider any iteration of the
form xk+1 = xk + αk pk , where pk is a descent direction and αk satisfies the Wolfe
conditions. Suppose that f : Rn → R is bounded below in Rn and f is continuously
differentiable in an open set N containing the level set L := {x ∈ Rn : f (x) ≤
f (x0 )}, where x0 is the starting point of the iteration. Suppose also that the gradient
∇f is Lipschitz continuous on N , i.e. there exist a constant L > 0 such that
k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 , for all x, y ∈ N .
Then
X
cos2 θk k∇f (xk )k2 < ∞
(Zoutendijk condition).
k≥0
Proof. From the curvature condition and xk+1 = xk +αk pk , and subtracting ∇f (xk )T pk
from both sides, we have
(∇f (xk+1 ) − ∇f (xk ))T pk ≥ (c2 − 1)∇f (xk )T pk
On the other hand, the Cauchy-Schwarz inequality and the Lipschitz condition gives
that
(∇f (xk+1 ) − ∇f (xk ))T pk ≤ k∇f (xk+1 ) − ∇f (xk )k2 kpk k2 ≤ αk Lkpk k22 .
From the above two expressions, we obtain that
αk ≥
(c2 − 1)∇f (xk )T pk
.
Lkpk k22
Plugging in this into the sufficient decrease condition, we get
f (xk+1 ) ≤ f (xk ) + c1 αk ∇f (xk )T pk ≤ f (xk ) + c1
(c2 − 1)(∇f (xk )T pk )2
Lkpk k22
Using the definition of cos θk , this can be rewritten as
f (xk+1 ) ≤ f (xk ) − c cos2 θk k∇f (xk )k22
for c := c1 (1 − c2 )/L > 0. Summing up this expression for all indices up to k, we
have
k
X
f (xk+1 ) ≤ f (x0 ) − c
cos2 θj k∇f (xj )k22 .
j=0
That is, for all k we have
k
X
j=0
1
cos2 θj k∇f (xj )k22 ≤ (f (x0 ) − f (xk+1 ))
c
Since f is bounded below, f (xk+1 ) > −M for some positive scalar M , and therefore
f (x0 ) − f (xk+1 ) < f (x0 ) + M < M 0 for some positive scalar M 0 . Hence, by taking
k → ∞, we get the desired result:
∞
X
cos2 θk k∇f (xk )k22 < ∞.
k=0
4
Note that the Zoutendijk condition implies that
lim cos2 θk k∇f (xk )k22 = 0.
k→∞
When pk is a descent direction, we have cos θk > 0 for all k (strictly speaking, we
need cos θk ≥ δ ∀k for some δ > 0, as we discuss below), and therefore the above
condition implies that
lim k∇f (xk )k2 = 0
k→∞
(Global Convergence).
We call algorithms that satisfy the above condition as globally convergent. Theorem 9.3 is the strongest global convergence result available for Wolfe line search
strategy: we have the only guarantee to find a stationary point, but not necessarily
a minimizer (recall however that in convex minimization a stationary point is a
global minimizer).
2.2
Condition Number
As we’ve seen above, the condition cos θk > 0 is crucial to guarantee the global
convergence of line search.
• Steepest descent direction: pk = −∇f (xk ), so it is easy to check cos θk = 1 > 0
when ∇f (xk ) 6= 0.
−1
2
• Newton’s direction: pN
k = −Hk ∇f (xk ) where the Hessian Hk = ∇ f (xk ).
N
Recall that when Hk is positive definite, we showed that pk is a descent
direction. Furthermore, when Hk is positive definite with a uniformly bounded
condition number, that is, there exists a constant M > 0 such that
κ(Hk ) = kHk k2 kHk−1 k2 ≤ M, for all k,
where κ(Hk ) is called the condition number of Hk . Note that if Hk is square
symmetric, then κ(Hk ) is as simple as
κ(Hk ) = kHk k2 kHk−1 k2 =
λn
,
λ1
the ratio between the largest eigenvalue λn and the smallest eigenvalue λ1 of
Hk . When κ(Hk ) ≤ M , then we can show that
cos θk ≥
3
1
> 0 for all k.
M
Rate of Convergence
3.1
Convergence Rate of Steepest Descent
The behavior of steepest descent can be understood for a ideal case, in which we
consider a strongly convex quadratic objective function
1
f (x) = xT Qx + cT x
2
where Q ∈ Rn×n is symmetric and positive definite. In this particular case, it is
easy to perform an exact line search, finding α > 0 that minimizes f (xk −α∇f (xk )),
1
f (xk − α∇f (xk )) = (xk − α∇f (xk ))T Q(xk − α∇f (xk )) + cT (xk − α∇f (xk )).
2
5
From the optimality conditions, the minimizer αk is
αk =
∇f (xk )T ∇f (xk )
> 0.
∇f (xk )T Q∇f (xk )
(Note that if Q is not positive definite, αk expressed as above can have undesirable
properties.) With this exact minimizer, we can write an explicit expression for the
steepest descent update of minimizing strongly convex quadratic functions as
xk+1 = xk −
∇f (xk )T ∇f (xk )
∇f (xk ).
∇f (xk )T Q∇f (xk )
To quantify the rate of convergence, we use the weighted norm
kxk2Q = xT Qx.
(Note that this definition works at best when Q is positive definite.) Using the fact
that an unconstrained minimizer x∗ of f (x) satisfies Qx∗ + c = 0, we can show that
1
kx − x∗ k2Q = f (x) − f (x∗ ).
2
(9.1)
Using the update rule above and the fact that gk := ∇f (xk ) = Q(xk − x∗ ), we can
show that
(gkT gk )2
∗ 2
kxk − x∗ k2Q .
kxk+1 − x kQ = 1 − T
(gk Qgk )(gkT Q−1 gk )
This can be further simplified to give the following theorem:
Theorem 9.4 (Convergence Rate of Steepest Descent: Quadratic Case). Using
the steepest descent directions with exact line searches for minimizing the strongly
convex quadratic function f (x) = 21 xT Qx + cT x, we have
kxk+1 − x∗ k2Q =
λn − λ1
λn + λ1
2
kxk − x∗ k2Q .
where 0 < λ1 ≤ λ2 · · · ≤ λn are the eigenvalues of Q.
This theorem can be shown using the Kantorovich inequality:
Lemma 9.5 (Kantorovich Inequality). For a positive definite symmetric n × n
matrix Q and for any nonzero vector x ∈ Rn ,
(xT x)2
4λ1 λn
≥
T
T
−1
(x Qx)(x Q x)
(λ1 + λn )2
where λ1 and λn are the smallest and the largest eigenvalues of Q.
Few observations can be made for Theorem 9.4:
• Using (9.1), it is implied that
∗
f (xk+1 ) − f (x ) =
λn − λ1
λn + λ1
6
2
(f (xk ) − f (x∗ )).
• That is, the rate of convergence in terms of objective function values is linear
(Lecture 2), since
f (xk+1 ) − f (x∗ )
=
f (xk ) − f (x∗ )
λn − λ1
λn + λ1
2
∈ (0, 1) ∀k
• The rate of convergence depends on the condition number κ(Q) = λn /λ1 ,
since
λn − λ1 2
κ(Q) − 1 2
=
.
λn + λ1
κ(Q) + 1
• If κ(Q) is large (the contour of quadratic function f is elliptical), then convergence becomes slow. In fact, zigzagging can happen.
• In particular, when all eigenvalues are the same, so that λ1 = λn (so that
the contour of f is perfectly spherical, the steepest descent convergences in a
single iteration in terms of objective function values.
The convergence rate of steepest descent with exact line searches for general
nonlinear objective functions is given by the following theorem:
Theorem 9.6 (Convergence of Steepest Descent). Let f : Rn → R is twice continuously differentiable, and suppose that the iterates generated by the steepest descent
with exact line searches converges to a point x∗ at with ∇2 f (x∗ ) is positive definite.
Let r be a scalar such that
λn − λ1
r∈
,1
λn + λ1
where λ1 ≤ . . . λn are the eigenvalues of ∇2 f (x∗ ). Then for all sufficiently large k,
we have
f (xk+1 ) − f (x∗ ) ≤ r2 (f (xk ) − f (x∗ )).
When inexact line searches are used (e.g. Wolfe line search), the rate of convergence does not improve in general.
3.2
Convergence Rate of Newton’s Method
For a twice differentiable function f : Rn → R, Newton directions are given by
2
−1
pN
k = −(∇ f (xk )) ∇f (xk ).
Suppose that the inverse exists. Recall that if ∇2 f (xk ) is not positive definite, pN
k
may not be a descent direction. There are two remedies when the Hessian is not
p.d.:
• Modify the Hessian so that it becomes p.d.
• Use trust region methods.
Here we discuss the local convergence of Newton’s method: we know that if
∇f 2 (x∗ ) is p.d., then if the Hessian ∇2 f (x) is continuous then ∇2 f (x) will be p.d.
as well in a neighborhood of x∗ . In such a region Newton directions are well defined
and descent.
7
Theorem 9.7 (Local Convergence of Newton’s Method). Suppose that f : Rn → R
is twice differentiable and that ∇2 f (x) is Lipschitz continuous in a neighborhood of
a solution x∗ with a constant L > 02 , that is,
k∇2 f (x) − ∇2 f (y)k2 ≤ Lkx − yk2 ∀x, y ∈ N (x∗ )
where at x∗ the second order sufficient conditions are satisfied. Consider iterations
xk+1 = xk + pN
k (so that step lengths are 1). Suppose also that the starting point x0
is sufficiently close to x∗ . Then,
• the sequence {xk } converges quadratically to x∗ ,
• the sequence of gradient norms {k∇f (xk )k2 } converges quadratically to zero.
Proof. **
∗
∗
2
−1
xk+1 − x∗ = xk + pN
k − x = xk − x − ∇ f (xk ) ∇f (xk )
= ∇2 f (xk )−1 [∇2 f (xk )(xk − x∗ ) − (∇f (xk ) − ∇f (x∗ ))].
Note that we have used ∇f (x∗ ) = 0 from SOSC. Using the relation that
∇f (xk ) − ∇f (x∗ ) =
Z
1
h
i
Jt ∇f (x∗ + t(xk − x∗ )) dt
0
Z
1
=
∇2 f (x∗ + t(xk − x∗ ))(xk − x∗ )dt,
0
we get
kxk+1 − x∗ k2
= k∇2 f (xk )−1 [∇2 f (xk )(xk − x∗ ) − (∇f (xk ) − ∇f (x∗ ))]k2
≤ k∇2 f (xk )−1 k2 ∇2 f (xk )(xk − x∗ ) − (∇f (xk ) − ∇f (x∗ ))2
Z 1
2
−1
2
2
∗
∗
∗
= k∇ f (xk ) k2 [∇ f (xk ) − ∇ f (x + t(xk − x ))](xk − x )dt
≤ k∇2 f (xk )−1 k2
Z
≤ k∇2 f (xk )−1 k2
Z
0
1
2
[∇2 f (xk ) − ∇2 f (x∗ + t(xk − x∗ ))](xk − x∗ ) dt
2
0
1
k∇2 f (xk ) − ∇2 f (x∗ + t(xk − x∗ ))k2 kxk − x∗ k2 dt
0
Z 1
2
−1
∗ 2
≤ k∇ f (xk ) k2 kxk − x k2
L(1 − t) dt
0
L
= k∇2 f (xk )−1 k2 kxk − x∗ k22
2
Since ∇2 f (x∗ ) is positive definite, there exists a radius r > 0 such that k∇2 f (xk )−1 k2 ≤
2k∇2 f (x∗ )−1 k2 for all xk with kxk − x∗ k2 ≤ r. Together with the expression above,
we obtain,
˜ k − x∗ k2
kxk+1 − x∗ k2 ≤ Lk∇2 f (x∗ )−1 k2 kxk − x∗ k22 ≤ Lkx
2
˜ = Lk∇2 f (x∗ )−1 k2 .
where L
2
Lipschitz continuity implies continuity.
8
˜
Choosing x0 so that kx0 − x∗ k2 ≤ min(r, 1/(2L)),
we can deduce from this
∗
inequality that the sequence converges to x (try this by yourself), where the rate
of convergence is quadratic.
2
N
Finally, using the facts that xk+1 − xk = pN
k and ∇f (xk ) + ∇ f (xk )pk = 0,
k∇f (xk+1 )k2 = k∇f (xk+1 ) − ∇f (xk ) − ∇2 f (xk )pN
k k2
Z 1
2
N
2
N
=
∇ f (xk + tpk )(xk+1 − xk )dt − ∇ f (xk )pk 0
2
Z 1
2
∇ f (xk + tpN ) − ∇2 f (xk ) pN dt
≤
k
k 2
2
0
Z 1
L
2
2
≤ kpN
k
Lt dt = kpN
k 2
k k2
2
0
L 2
≤ k∇ f (xk )−1 k22 k∇f (xk )k22
2
˜
= Lk∇f (xk )k22 .
References
Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, second
edition.
9