Numerical Optimization Lecture 9: Global convergence of line search and convergence rates of steepest descent/Newton’s methods Sangkyun Lee The content is from Nocedal and Wright (2006). Topics marked with ** are optional. 1 Some Background 1.1 Cauchy-Schwarz Inequality For two vectors x, y ∈ Rn , we have |xT y| ≤ kxk2 kyk2 (Cauchy-Schwarz) ** In general, for p, q ∈ [1, ∞] with 1/p + 1/q = 1, |xT y| ≤ kxkp kykq (H¨older’s inequality) Note that H¨ older’s inequality holds for all measurable functions x and y. 1.2 Matrix Norms (Operator Norms) P P Given norms on vectors x ∈ Rn such that kxk1 = ni=1 |xi |, kxk2 = ( ni=1 x2i )1/2 , and kxk∞ = maxi=1,...,n |xi |, the matrix norms induced by vector norms (called operator norms) for a matrix A ∈ Rm×n is defined as kAk := sup x∈Rn ,x6=0 = kAxk kxk sup kAxk x∈Rn ,kxk=1 The following can be derived from the above definition, P • kAk1 = maxj=1,...,n m i=1 |Aij |, the maximal absolute column sum. P • kAk∞ = maxi=1,...,m nj=1 |Aij |, the maximal absolute row sum. p • kAk2 = λmax (AT A) = σmax (A), where λmax (AT A) is the largest eigenvalue of AT A and σmax (A) is the largest singular value of A. When A is square symmetric matrix, then λ(AT A) = {λ(A)}2 , so that kAk2 = λmax (A)1 . From the definition, it is clear that kAxk ≤ kAkkxk. This looks like Cauchy-Schwarz inequality, but it is different since A is a matrix, not a vector. 1 For a square matrix A, ρ(A) = maxi=1,...,n |λi (A)| = kAk2 is called the spectral radius of A. 1 1.3 Convex Function Lemma 9.1. Let C ⊆ Rn is a convex set and let f : Rn → R be continuously differentiable over Rn . (a) f is convex over C if and only if f (y) ≥ f (x) + ∇f (x)T (y − x), for all x, y ∈ C. (b) f is strictly convex over C if any only if the above inequality is strict whenever x 6= y. Proof. ** (⇐) : Suppose that the inequality in (a) holds. Choose any x, y ∈ C and α ∈ [0, 1] and let z = αx + (1 − α)y. Using the inequality in (a) twice, we have f (x) ≥ f (z) + ∇f (z)T (x − z) f (y) ≥ f (z) + ∇f (z)T (y − z). Multiplying the first inequality with α, the second by (1 − α), and adding them together, we obtain αf (x) + (1 − α)f (y) ≥ f (z) + ∇f (z)T [αx + (1 − α)y − z] = f (z) which proves that f is convex. (⇒) : Conversely, suppose that f is convex. Let x, y ∈ C be any vectors, and consider the function g(α) = f (x + α(y − x)) − f (x) , α ∈ (0, 1]. α We will show that g(α) is monotonically increasing with α. Consider any α1 , α2 with 0 < α1 < α2 ≤ 1, and let α ¯= α1 ∈ (0, 1), y¯ = x + α2 (y − x). α2 Since f is convex, we have f (x + α ¯ (¯ y − x)) ≤ α ¯ f (¯ y ) + (1 − α ¯ )f (x) That is, f (x + α ¯ (¯ y − x)) − f (x) ≤ f (¯ y ) − f (x) α ¯ Plugging-in definitions of x ¯ and α ¯ , we get f (x + α1 (y − x)) − f (x) f (x + α2 (y − x)) − f (x) ≤ α1 α2 that is, g(α1 ) ≤ g(α2 ). This shows that g(α) is monotonic in α ∈ (0, 1]. Then (y − x)T ∇f (x) = lim g(α) ≤ g(1) = f (z) − f (x). α→0 This conclude the proof for case (a). The proof for the case (b) is omitted. 2 The quantity E(x, y) = f (y)−f (x)−∇f (x)T (y −x) is called the excess function. Therefore E(x, y) ≥ 0 for all x, y ∈ C for a given function f implies that f is a convex function on C. Theorem 9.2. Let f : D ⊂ Rn → R be a twice continuously differentiable function on an open set D. Then • f is convex if and only if ∇2 f (x) is positive semi-definite for all x ∈ D. • f is strictly convex if and only if ∇2 f (x) is positive definite for all x ∈ D. Proof. ** From Taylor’s theorem, we have 1 f (y) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x + α(y − x))(y − x) 2 for some α ∈ [0, 1]. When the Hessian is positive semi-definite, then E(x, y) = f (y) − f (x) − ∇f (x∗ )T (y − x) ≥ 0 and therefore f is convex due to Lemma 9.1. On the other hand, suppose that the Hessian is not p.s.d. at some point in D. Since ∇2 f (x) is continuous, there exists y ∈ D such that for all α ∈ [0, 1], ∇2 f (x + α(y − x)) is not positive semi-definite. Then from the Taylor theorem (as above), we have E(x, y) = f (y) − f (x) − ∇f (x)T < 0 and therefore f is not convex. The contrapositive gives that when f is convex, the Hessian is p.s.d. for all points in D. 2 Global Convergence of Wolfe Line Search Consider the following unconstrained minimization of a continuously differentiable objective function f : Rn → R, minn f (x). x∈R Given the current iterate xk ∈ Rn , k ≥ 1, and a descent direction pk ∈ Rn , we generate the next iterate xk+1 by xk+1 = xk + αk pk . where the stel length αk > 0 can be determined by a line search strategy satisfying the Wolfe conditions (for 0 < c1 < c2 < 1), f (xk + αk pk ) ≤ f (xk ) + c1 α∇f (xk )T pk ∇f (xk + αk pk )T pk ≥ c2 ∇f (xk )T pk 2.1 (Sufficient Decrease) (Curvature) θk For the analysis here, the angle θk between −∇f (xk ) and a search direction pk plays an important role, which is defined by cos θk = −∇f (xk )T pk k∇f (xk )k2 kpk k2 3 Theorem 9.3 (Global Convergence of Line Search). Consider any iteration of the form xk+1 = xk + αk pk , where pk is a descent direction and αk satisfies the Wolfe conditions. Suppose that f : Rn → R is bounded below in Rn and f is continuously differentiable in an open set N containing the level set L := {x ∈ Rn : f (x) ≤ f (x0 )}, where x0 is the starting point of the iteration. Suppose also that the gradient ∇f is Lipschitz continuous on N , i.e. there exist a constant L > 0 such that k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 , for all x, y ∈ N . Then X cos2 θk k∇f (xk )k2 < ∞ (Zoutendijk condition). k≥0 Proof. From the curvature condition and xk+1 = xk +αk pk , and subtracting ∇f (xk )T pk from both sides, we have (∇f (xk+1 ) − ∇f (xk ))T pk ≥ (c2 − 1)∇f (xk )T pk On the other hand, the Cauchy-Schwarz inequality and the Lipschitz condition gives that (∇f (xk+1 ) − ∇f (xk ))T pk ≤ k∇f (xk+1 ) − ∇f (xk )k2 kpk k2 ≤ αk Lkpk k22 . From the above two expressions, we obtain that αk ≥ (c2 − 1)∇f (xk )T pk . Lkpk k22 Plugging in this into the sufficient decrease condition, we get f (xk+1 ) ≤ f (xk ) + c1 αk ∇f (xk )T pk ≤ f (xk ) + c1 (c2 − 1)(∇f (xk )T pk )2 Lkpk k22 Using the definition of cos θk , this can be rewritten as f (xk+1 ) ≤ f (xk ) − c cos2 θk k∇f (xk )k22 for c := c1 (1 − c2 )/L > 0. Summing up this expression for all indices up to k, we have k X f (xk+1 ) ≤ f (x0 ) − c cos2 θj k∇f (xj )k22 . j=0 That is, for all k we have k X j=0 1 cos2 θj k∇f (xj )k22 ≤ (f (x0 ) − f (xk+1 )) c Since f is bounded below, f (xk+1 ) > −M for some positive scalar M , and therefore f (x0 ) − f (xk+1 ) < f (x0 ) + M < M 0 for some positive scalar M 0 . Hence, by taking k → ∞, we get the desired result: ∞ X cos2 θk k∇f (xk )k22 < ∞. k=0 4 Note that the Zoutendijk condition implies that lim cos2 θk k∇f (xk )k22 = 0. k→∞ When pk is a descent direction, we have cos θk > 0 for all k (strictly speaking, we need cos θk ≥ δ ∀k for some δ > 0, as we discuss below), and therefore the above condition implies that lim k∇f (xk )k2 = 0 k→∞ (Global Convergence). We call algorithms that satisfy the above condition as globally convergent. Theorem 9.3 is the strongest global convergence result available for Wolfe line search strategy: we have the only guarantee to find a stationary point, but not necessarily a minimizer (recall however that in convex minimization a stationary point is a global minimizer). 2.2 Condition Number As we’ve seen above, the condition cos θk > 0 is crucial to guarantee the global convergence of line search. • Steepest descent direction: pk = −∇f (xk ), so it is easy to check cos θk = 1 > 0 when ∇f (xk ) 6= 0. −1 2 • Newton’s direction: pN k = −Hk ∇f (xk ) where the Hessian Hk = ∇ f (xk ). N Recall that when Hk is positive definite, we showed that pk is a descent direction. Furthermore, when Hk is positive definite with a uniformly bounded condition number, that is, there exists a constant M > 0 such that κ(Hk ) = kHk k2 kHk−1 k2 ≤ M, for all k, where κ(Hk ) is called the condition number of Hk . Note that if Hk is square symmetric, then κ(Hk ) is as simple as κ(Hk ) = kHk k2 kHk−1 k2 = λn , λ1 the ratio between the largest eigenvalue λn and the smallest eigenvalue λ1 of Hk . When κ(Hk ) ≤ M , then we can show that cos θk ≥ 3 1 > 0 for all k. M Rate of Convergence 3.1 Convergence Rate of Steepest Descent The behavior of steepest descent can be understood for a ideal case, in which we consider a strongly convex quadratic objective function 1 f (x) = xT Qx + cT x 2 where Q ∈ Rn×n is symmetric and positive definite. In this particular case, it is easy to perform an exact line search, finding α > 0 that minimizes f (xk −α∇f (xk )), 1 f (xk − α∇f (xk )) = (xk − α∇f (xk ))T Q(xk − α∇f (xk )) + cT (xk − α∇f (xk )). 2 5 From the optimality conditions, the minimizer αk is αk = ∇f (xk )T ∇f (xk ) > 0. ∇f (xk )T Q∇f (xk ) (Note that if Q is not positive definite, αk expressed as above can have undesirable properties.) With this exact minimizer, we can write an explicit expression for the steepest descent update of minimizing strongly convex quadratic functions as xk+1 = xk − ∇f (xk )T ∇f (xk ) ∇f (xk ). ∇f (xk )T Q∇f (xk ) To quantify the rate of convergence, we use the weighted norm kxk2Q = xT Qx. (Note that this definition works at best when Q is positive definite.) Using the fact that an unconstrained minimizer x∗ of f (x) satisfies Qx∗ + c = 0, we can show that 1 kx − x∗ k2Q = f (x) − f (x∗ ). 2 (9.1) Using the update rule above and the fact that gk := ∇f (xk ) = Q(xk − x∗ ), we can show that (gkT gk )2 ∗ 2 kxk − x∗ k2Q . kxk+1 − x kQ = 1 − T (gk Qgk )(gkT Q−1 gk ) This can be further simplified to give the following theorem: Theorem 9.4 (Convergence Rate of Steepest Descent: Quadratic Case). Using the steepest descent directions with exact line searches for minimizing the strongly convex quadratic function f (x) = 21 xT Qx + cT x, we have kxk+1 − x∗ k2Q = λn − λ1 λn + λ1 2 kxk − x∗ k2Q . where 0 < λ1 ≤ λ2 · · · ≤ λn are the eigenvalues of Q. This theorem can be shown using the Kantorovich inequality: Lemma 9.5 (Kantorovich Inequality). For a positive definite symmetric n × n matrix Q and for any nonzero vector x ∈ Rn , (xT x)2 4λ1 λn ≥ T T −1 (x Qx)(x Q x) (λ1 + λn )2 where λ1 and λn are the smallest and the largest eigenvalues of Q. Few observations can be made for Theorem 9.4: • Using (9.1), it is implied that ∗ f (xk+1 ) − f (x ) = λn − λ1 λn + λ1 6 2 (f (xk ) − f (x∗ )). • That is, the rate of convergence in terms of objective function values is linear (Lecture 2), since f (xk+1 ) − f (x∗ ) = f (xk ) − f (x∗ ) λn − λ1 λn + λ1 2 ∈ (0, 1) ∀k • The rate of convergence depends on the condition number κ(Q) = λn /λ1 , since λn − λ1 2 κ(Q) − 1 2 = . λn + λ1 κ(Q) + 1 • If κ(Q) is large (the contour of quadratic function f is elliptical), then convergence becomes slow. In fact, zigzagging can happen. • In particular, when all eigenvalues are the same, so that λ1 = λn (so that the contour of f is perfectly spherical, the steepest descent convergences in a single iteration in terms of objective function values. The convergence rate of steepest descent with exact line searches for general nonlinear objective functions is given by the following theorem: Theorem 9.6 (Convergence of Steepest Descent). Let f : Rn → R is twice continuously differentiable, and suppose that the iterates generated by the steepest descent with exact line searches converges to a point x∗ at with ∇2 f (x∗ ) is positive definite. Let r be a scalar such that λn − λ1 r∈ ,1 λn + λ1 where λ1 ≤ . . . λn are the eigenvalues of ∇2 f (x∗ ). Then for all sufficiently large k, we have f (xk+1 ) − f (x∗ ) ≤ r2 (f (xk ) − f (x∗ )). When inexact line searches are used (e.g. Wolfe line search), the rate of convergence does not improve in general. 3.2 Convergence Rate of Newton’s Method For a twice differentiable function f : Rn → R, Newton directions are given by 2 −1 pN k = −(∇ f (xk )) ∇f (xk ). Suppose that the inverse exists. Recall that if ∇2 f (xk ) is not positive definite, pN k may not be a descent direction. There are two remedies when the Hessian is not p.d.: • Modify the Hessian so that it becomes p.d. • Use trust region methods. Here we discuss the local convergence of Newton’s method: we know that if ∇f 2 (x∗ ) is p.d., then if the Hessian ∇2 f (x) is continuous then ∇2 f (x) will be p.d. as well in a neighborhood of x∗ . In such a region Newton directions are well defined and descent. 7 Theorem 9.7 (Local Convergence of Newton’s Method). Suppose that f : Rn → R is twice differentiable and that ∇2 f (x) is Lipschitz continuous in a neighborhood of a solution x∗ with a constant L > 02 , that is, k∇2 f (x) − ∇2 f (y)k2 ≤ Lkx − yk2 ∀x, y ∈ N (x∗ ) where at x∗ the second order sufficient conditions are satisfied. Consider iterations xk+1 = xk + pN k (so that step lengths are 1). Suppose also that the starting point x0 is sufficiently close to x∗ . Then, • the sequence {xk } converges quadratically to x∗ , • the sequence of gradient norms {k∇f (xk )k2 } converges quadratically to zero. Proof. ** ∗ ∗ 2 −1 xk+1 − x∗ = xk + pN k − x = xk − x − ∇ f (xk ) ∇f (xk ) = ∇2 f (xk )−1 [∇2 f (xk )(xk − x∗ ) − (∇f (xk ) − ∇f (x∗ ))]. Note that we have used ∇f (x∗ ) = 0 from SOSC. Using the relation that ∇f (xk ) − ∇f (x∗ ) = Z 1 h i Jt ∇f (x∗ + t(xk − x∗ )) dt 0 Z 1 = ∇2 f (x∗ + t(xk − x∗ ))(xk − x∗ )dt, 0 we get kxk+1 − x∗ k2 = k∇2 f (xk )−1 [∇2 f (xk )(xk − x∗ ) − (∇f (xk ) − ∇f (x∗ ))]k2 ≤ k∇2 f (xk )−1 k2 ∇2 f (xk )(xk − x∗ ) − (∇f (xk ) − ∇f (x∗ ))2 Z 1 2 −1 2 2 ∗ ∗ ∗ = k∇ f (xk ) k2 [∇ f (xk ) − ∇ f (x + t(xk − x ))](xk − x )dt ≤ k∇2 f (xk )−1 k2 Z ≤ k∇2 f (xk )−1 k2 Z 0 1 2 [∇2 f (xk ) − ∇2 f (x∗ + t(xk − x∗ ))](xk − x∗ ) dt 2 0 1 k∇2 f (xk ) − ∇2 f (x∗ + t(xk − x∗ ))k2 kxk − x∗ k2 dt 0 Z 1 2 −1 ∗ 2 ≤ k∇ f (xk ) k2 kxk − x k2 L(1 − t) dt 0 L = k∇2 f (xk )−1 k2 kxk − x∗ k22 2 Since ∇2 f (x∗ ) is positive definite, there exists a radius r > 0 such that k∇2 f (xk )−1 k2 ≤ 2k∇2 f (x∗ )−1 k2 for all xk with kxk − x∗ k2 ≤ r. Together with the expression above, we obtain, ˜ k − x∗ k2 kxk+1 − x∗ k2 ≤ Lk∇2 f (x∗ )−1 k2 kxk − x∗ k22 ≤ Lkx 2 ˜ = Lk∇2 f (x∗ )−1 k2 . where L 2 Lipschitz continuity implies continuity. 8 ˜ Choosing x0 so that kx0 − x∗ k2 ≤ min(r, 1/(2L)), we can deduce from this ∗ inequality that the sequence converges to x (try this by yourself), where the rate of convergence is quadratic. 2 N Finally, using the facts that xk+1 − xk = pN k and ∇f (xk ) + ∇ f (xk )pk = 0, k∇f (xk+1 )k2 = k∇f (xk+1 ) − ∇f (xk ) − ∇2 f (xk )pN k k2 Z 1 2 N 2 N = ∇ f (xk + tpk )(xk+1 − xk )dt − ∇ f (xk )pk 0 2 Z 1 2 ∇ f (xk + tpN ) − ∇2 f (xk ) pN dt ≤ k k 2 2 0 Z 1 L 2 2 ≤ kpN k Lt dt = kpN k 2 k k2 2 0 L 2 ≤ k∇ f (xk )−1 k22 k∇f (xk )k22 2 ˜ = Lk∇f (xk )k22 . References Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, second edition. 9
© Copyright 2024 ExpyDoc