Splitting Envelopes Accelerated Second-order Proximal Methods Panos Patrinos (joint work with Lorenzo Stella, Alberto Bemporad) September 8, 2014 Outline X forward-backward envelope (FBE) X forward-backward Newton method (FBN) X dual FBE and Augmented Lagrangian X alternating minimization Newton method (AMNM) X Douglas Rachford envelope (DRE) X accelerated Douglas Rachford splitting (ADRS) based on 1. P. Patrinos and A. Bemporad. Proximal Newton methods for convex composite optimization. In Proc. 52nd IEEE Conference on Decision and Control (CDC), pages 2358-2363, Florence, Italy, 2013. 2. P. Patrinos, L. Stella, and A. Bemporad, Forward-backward truncated Newton methods for convex composite optimization. submitted, arXiv:1402.6655, 2014. 3. P. Patrinos, L. Stella, and A. Bemporad. Douglas-Rachford splitting: complexity estimates and accelerated variants. In Proc. 53rd IEEE Conference on Decision and Control (CDC), Los Angeles, CA, arXiv:1407.6723, 2014. 4. L. Stella, P. Patrinos, and A. Bemporad, Alternating minimization Newton method for separable convex optimization, 2014 (submitted). fixed point implementation for MPC X A. Guiggiani, P. Patrinos, and A. Bemporad. Fixed-point implementation of a proximal Newton method for embedded model predictive control. In 19th IFAC, South Africa, 2014. 2 / 40 Convex composite optimization minimize F (x) = f (x) + g(x) X f : IRn → IR convex, twice continuously differentiable with k∇f (x) − ∇f (y)k ≤ Lf kx − yk, for all x, y ∈ IRn X g : IRn → IR convex, nonsmooth with inexpensive proximal mapping n o 1 kz − xk2 proxγg (x) = arg min g(z) + 2γ z∈IRn X many problem classes: QPs, cone programs, sparse least-squares, rank minimization, total variation minimization,. . . X applications: control, system identification, signal processing, image analysis, machine learning,. . . 3 / 40 Proximal mappings n proxγg (x) = arg min g(z) + z∈IRn 1 2γ kz − xk2 o γ>0 X resolvent of maximal monotone operator ∂g proxγg (x) = (I + γ∂g)−1 (x) X single-valued and (firmly) nonexpansive X explicitly computable for many functions (see Parikh, Boyd ’14, Combettes, Pesquet ’10) X reduces to projection when g is indicator of convex set proxγδC (x) = ΠC (x) X z = proxγg (x) is implicit a subgradient step (0 ∈ ∂g(z) + γ −1 (z − x)) z = x − γv v ∈ ∂g(z) 4 / 40 Proximal Minimization Algorithm minimize g(x), g : IRn → IR closed proper convex given x0 ∈ IRn , repeat xk+1 = proxγg (xk ) γ>0 X fixed point iteration for optimality conditions 0 ∈ ∂g(x? ) ⇐⇒ x? ∈ (I + γ∂g)(x? ) ⇐⇒ x? = proxγg (x? ) X special case of proximal point algorithm (Martinet ’70, Rockafellar ’76) X converges under very general conditions X mostly conceptual algorithm 5 / 40 Moreau envelope Moreau envelope of closed proper convex g : IRn → IR n g γ (x) = inf n g(z) + z∈IR 1 2γ kz o − xk2 , γ>0 X g γ is real-valued, convex, differentiable with 1/γ-Lipschitz gradient ∇g γ (x) = (1/γ)(x − proxγg (x)) X minimizing nonsmooth g is equivalent to minimizing smooth g γ X proximal minimization algorithm = gradient method for g γ xk+1 = xk − γ∇g γ (xk ) X can use any method of unconstrained smooth minimization for g γ 6 / 40 Forward-Backward Splitting (FBS) minimize F (x) = f (x) + g(x) X optimality condition: x? ∈ IRn is optimal if and only if x? = proxγg (x? − γ∇f (x? )), γ>0 X forward-backward splitting (aka proximal gradient) xk+1 = proxγg (xk − γ∇f (xk )), γ ∈ (0, 2/Lf ) X FBS is a fixed point iteration X g = 0: gradient method, g = δC : gradient projection, f = 0: prox min X accelerated versions (Nesterov) 7 / 40 Forward-Backward Envelope x − proxγg (x − γ∇f (x)) = 0 X use proxγg (y) = y − γ∇g γ (y) for y = x − γ∇f (x) γ∇f (x) + γ∇g γ (x − γ∇f (x)) = 0 X multiply with γ −1 (I − γ∇2 f (x)) (positive definite for γ ∈ (0, 1/Lf )) X gradient of the Forward Backward Envelope (FBE) FγFB (x) = f (x) − γ2 k∇f (x)k22 + g γ (x − γ∇f (x)) X alternative expression for FBE FγFB (x) = inf n {f (x) + h∇f (x), z − xi +g(z) + {z } z∈IR | 1 2γ kz − xk2 } linearize f around x 8 / 40 Properties of FBE X stationary points of FγFB = minimizers of F X reformulates original nonsmooth problem into a smooth one minimize FγFB (x) n x∈IR equivalent to minimize F (x) n x∈IR X FγFB is real-valued, continuously differentiable ∇FγFB (x) = γ −1 (I − γ∇2 f (x))(x − proxγg (x − γ∇f (x)) X FBS is a variable metric gradient method for FBE xk+1 = xk − γDk−1 ∇FγFB (xk ) 9 / 40 Forward-Backward Newton Method (FBN) Input: x0 ∈ IRn , γ ∈ (0, 1/Lf ), σ ∈ (0, 1/2) for k = 0, 1, 2, . . . do Newton direction Choose H k ∈ ∂ˆ2 FγFB (xk ). Compute dk by solving (approximately) H k d = −∇FγFB (xk ) Line search Compute stepsize by backtracking FγFB (xk + τk dk ) ≤ FγFB (xk ) + στk h∇FγFB (xk ), dk i Update: xk+1 = xk + τk dk end 10 / 40 Linear Newton approximation FBE is C 1 but not C 2 Hd = −∇FγFB (x) where ∇FγFB (x) = γ −1 (I − γ∇2 f (x))(x − proxγg (x − γ∇f (x)) and ∂ˆ2 Fγ (x) is an approximate generalized Hessian H = γ −1 (I − γ∇2 f (x))(I − P (I − γ∇2 f (x))) ∈ ∂ˆ2 Fγ (x) where P ∈ ∂C (proxγg ) (x − γ∇f (x)) | {z } Clarke’s generalized Jacobian X preserves all favorable properties of the Hessian for C 2 functions X “Gauss-Newton” generalized Hessian: we omit 3rd order terms 11 / 40 Generalized Jacobians of proximal mappings X ∂C proxγg (x) is the following set of matrices conv (Clarke, 1983) limits of (ordinary) Jacobians for every sequence that converges to x, consisting of points where proxγg is differentiable I proxγg (x) simple to compute =⇒ P ∈ ∂C (proxγg )(x) for free I g (block) separable =⇒ P ∈ ∂C (proxγg )(x) (block) diagonal example–`1 norm X g(x) = kxk1 more examples in Patrinos, Stella, Bemporad (2014) xi + γ, proxγf (x)i = 0, xi − γ, X P ∈ ∂C (proxγg )(x) are diagonal 1, Pii = ∈ [0, 1], 0, if xi ≤ −γ, if − γ ≤ xi ≤ γ if xi ≥ γ matrices with if i ∈ {i | |xi | > γ}, if i ∈ {i | |xi | = γ}, if i ∈ {i | |xi | < γ}. 12 / 40 Convergence of FBN X every limit point of {xk } converges to arg min F (x) x∈IRn X all H ∈ ∂ˆ2 Fγ (x? ) nonsingular =⇒ Q-quadratic asymptotic rate extension: FBN II X apply FB step after a Newton step X same asymptotic rate + global complexity estimates I I non-strongly convex f : sublinear rate for F (xk ) − F (x? ) F (xk ) − F (x? ) strongly convex f : linear rate for kxk − x? k2 13 / 40 FBN–CG large problems conjugate gradient (CG) on regularized Newton system until k (H k + δk I)dk + ∇FγFB (xk ) k ≤ ηk k∇FγFB (xk )k | {z } residual with ηk = O(k∇FγFB (xk )k), δk = O(k∇FγFB (xk )k) properties X no need to form ∇2 f (x) and H k – only matvec products X same convergence properties 14 / 40 Box-constrained convex programs minimize f (x) subject to ` ≤ x ≤ u Newton direction solves minimize subject to 2 k k 1 2 hd, ∇ f (x )di + h∇f (x ), di di = `i − xki , i ∈ β1 , di = ui − xki , i ∈ β2 where β1 = {i | xki − γ∇i f (xk ) ≤ `i } estimate of x?i = `i β2 = {i | xki − γ∇i f (xk ) ≥ ui } estimate of x?i = ui Newton system becomes Qδδ dδ = −(∇δ f (xk ) + ∇δβ f (xk )dβ ), (β = β1 ∪ β2 , δ \ β = [n]) 15 / 40 Example 1 2 hx, Qxi minimize + hq, xi, n = 1000 subject to ` ≤ x ≤ u 10 2 10 0 10 −2 10 −4 10 −6 10 −8 10 −10 10 −12 cond(Q) = 108 PNM FBN PGNMII FBN FGM AFBS 0.2 0.4 0.6 0.8 1 1.2 1.4 time [sec] GUROBI: 4.87 sec, CPLEX: 3.73 sec F (x ν ) − F ! F (x ν ) − F ! cond(Q) = 104 10 2 10 0 10 −2 10 −4 10 −6 10 −8 10 −10 10 −12 PNM FBN PGNMII FBN FGM AFBS 10 20 30 40 50 time [sec] GUROBI: 5.96 sec, CPLEX: 4.83 sec FBN: much less sensitive to bad conditioning 16 / 40 Sparse least-squares minimize 1 2 kAx − bk22 + λkxk1 X Newton system becomes dβ > A·δ A·δ dδ = −xβ = −[A> ·δ (A·δ xδ − b) + λ sign(xδ − γ∇δ f (x))] X δ is an estimate of the nonzero components of x? δ = {i | |xi − γ∇i f (x)| > λγ} X close to solution δ small 17 / 40 FBN Methods are robust X 548 datasets taken from http://wwwopt.mathematik.tu-darmstadt.de/spear/ X compared against AFBS, YALL1 (ADDM-based), SpaRSA (Barzilai-Borwein), l1-ls (interior point) X performance plot: point (x, y) indicates algorithm is at most x times slower in a fraction of y problems 1 0.9 0.8 frequency 0.7 0.6 0.5 0.4 FBN−CG I FBN−CG II Accelerated FBS YALL1 SpaRSA l1−ls 0.3 0.2 0.1 0 0 5 10 15 performance ratio 20 25 30 18 / 40 Sparse Logistic Regression minimize x,y m X | log 1 + e−bi (ai x+y) + λkxk1 , λ>0 i=1 103 Fk − F? AFBS FBN 101 10−1 10−3 0 5 10 time (s) 15 20 19 / 40 Augmented Lagrangians and Moreau envelopes minimize f (x) f : IRn → IR convex (can be nonsmooth), subject to Ax = b A ∈ IRp×n Augmented Lagrangian Lγ (x, y) = f (x) + hy, Ax − bi + γ2 kAx − bk2 Augmented Lagrangian method (ALM or Method of Multipliers) xk = inf n Lγ (x, y k ) x∈IR k Hestenes (1969), Powell (1969) y k+1 = y + γ(Axk − b) ALM = proximal minimization for dual Rockafellar (1973,1976) = gradient method for Moreau envelope of dual 20 / 40 Separable convex problems minimize f (x) + g(z), subject to Ax + Bz = b f, g convex (can be nonsmooth) A ∈ IRp×n , B ∈ IRp×m X f is strongly convex with convexity parameter µf X f , g are nice, e.g. separable. coupling introduced through constraints X ALM not amenable to decomposition: x and z updates are coupled Alternating Minimization Method (AMM): FBS applied to the dual xk+1 = arg min L0 (x, z k , y k ) x∈IRn z k+1 = arg min Lγ (xk+1 , z k , y k ) z∈IRm k γ ∈ (0, 2µf /kAk2 ) y k+1 = y + γ(Axk+1 + Bz k+1 − b) Augmented Lagrangian Lγ (x, z, λ) = f (x) + g(z) + hy, Ax + Bz − bi + γ2 kAx + Bz − bk2 21 / 40 Dual FBE FBE for dual problem is augmented Lagrangian! FγFB (y) = Lγ (x(y), z(y), y) where x(y), z(y) are the AMM updates x(y) = arg min f (x) + hy, Axi x∈IRn z(y) = arg min g(z) + hy, Bzi + γ2 kAx(y) + Bz − bk2 z∈IRm 22 / 40 Connection between AMM and FBE dual problem is equivalent to maximize FγFB (y) = Lγ (x(y), z(y), y) p y∈IR γ ∈ 0, µf /kAk2 X f ∈ C 2 (IRn ) =⇒ FγFB ∈ C 1 (IRp ) D(y) ∇FγFB (y) }| { −1 > = I − γA(∇ f (x(y))) A (Ax(y) + Bz(y) − b) z 2 X AMM as a variable metric gradient method on dual FBE y k+1 = y k + γD(y k )−1 ∇FγFB (y k ) X AMNM: FBN to dual 23 / 40 Strictly convex QPs 1 2 hx, Qxi minimize cond(Q) = 104 + hq, xi A ∈ IR2000×1000 −2 10 AMNM PNM AMNM PGNM II FGM FAMM −4 10 duality gap primal infeasibility subject to ` ≤ Ax ≤ u 10 −2 10 −4 10 −6 10 −8 PNM AMNM PGNM II AMNM FGM FAMM −6 10 10 20 30 10 40 time [sec] GUROBI: 6.7 s CPLEX: 60.8 s 20 30 40 time [sec] Fast AMM: 41 s AMNM: 8.1 s AMNM II:11.3 s 24 / 40 Projection onto Convex Sets m minimize x X 1 kx − pk2 + δCi (zi ) 2 subject to x = zi , i=1 i = 1, . . . , M X δC is the indicator of C X x? is the projection of p onto C1 ∩ . . . ∩ Cm Dykstra Fast AMM AMNM 25 / 40 Random problem with m = 100 random hyperplanes in IR120 101 100 kxk − x? k 10−1 10−2 10−3 10−4 10−5 10−6 0 AMM 1 Fast AMM 2 3 time (s) Dykstra 4 ADMM 5 AMNM 26 / 40 Distributed MPC minimize −1 M N X X kξi (t)k2Qi + kui (t)k2Ri + kξi (N )k2Pi i=1 t=1 subject to ξi (0) = ξi0 , ξi (t + 1) = i ∈ N[1,M ] X Φij ξj (t) + Γij uj (t), t ∈ N[0,N −1] , i ∈ N[1,M ] j∈Ni (ξi (t), ui (t)) ∈ Yi , t ∈ N[0,N −1] , i ∈ N[1,M ] ξi (N ) ∈ Zi , i ∈ N[1,M ] X solve an optimal control problem over a network of agents X local constraint sets Yi , Zi are simple, coupling in the dynamics X can handle complicated local and coupled constraints as well 27 / 40 DMPC simulations M = 100 subsystems 23600 vars, 20600 cons average over 20 instances 8 states, 3 inputs, N = 10 101 primal residual Fast AMM AMNM M 5 10 20 50 100 200 10−2 10−5 10−8 0 1 2 time (s) 3 FAMM local 29.5 47.0 65.7 104.8 139.2 159.3 AMNM local global 2.1 2.1 2.2 2.1 2.4 2.4 3.3 3.3 3.9 3.8 4.4 4.3 communication rounds (in thousands) 28 / 40 Douglas-Rachford Splitting minimize F (x) = f (x) + g(x) X optimality condition: x? is optimal if and only if x? = proxγf (˜ x), where x ˜ solves proxγg (2 proxγf (x) − x) − proxγf (x) = 0 X DRS y k = proxγf (xk ) z k = proxγg (2y k − xk ) xk+1 = xk + λk (z k − y k ) P X γ > 0 and λk ∈ [0, 2] with k∈N λk (2 − λk ) = +∞ X DRS is a relaxed fixed point iteration 29 / 40 ADMM minimize f (x) + g(z), subject to Ax + Bz = b f, g convex (can be nonsmooth) A ∈ IRp×n , B ∈ IRp×m Alternating Direction Method of Multipliers xk+1 = arg min Lγ (x, z k , y k ) x∈IRn z k+1 y k+1 = arg min Lγ (xk+1 , z, y k ) z∈IRm k = y + γ(Axk+1 + Bz k+1 − b) DRS applied to the dual (Eckstein, Bertsekas, 1992) 30 / 40 Douglas Rachford Envelope assume f convex, C 2 =⇒ Moreau envelope f γ is C 2 k∇f (x) − ∇f (y)k ≤ Lf kx − yk for all x, y ∈ IRn X optimality condition proxγf (x) − proxγg (2 proxγf (x) − x) = 0 X use ∇hγ (x) = γ −1 (x − proxγh (x)) ∇f γ (x) + ∇g γ (x − 2γ∇f γ (x)) = 0 X multiply by (I − 2γ∇2 f γ (x)), γ ∈ (0, 1/Lf ) and “integrate” Douglas Rachford Envelope (DRE) FγDR (x) = f γ (x) − γk∇f γ (x)k2 + g γ (x − 2γ∇f γ (x)) 31 / 40 Properties of DRE X γ ∈ (0, 1/Lf ): miniziming DRE = minimizing nonsmooth F inf F (x) = inf FγDR (x) arg min F (x) = proxγf (arg min FγDR (x)) X f ∈ C 2 (IRn ) =⇒ DRE is C 1 on IRn X f quadratic =⇒ DRE is convex with (1/γ)-Lipschitz gradient 32 / 40 Connection between DRE and FBE X partial linearization of F around x `F (z; x) = f (x) + h∇f (x), z − xi + g(z) X FBE is n FγFB (x) = minn `F (z; x) + z∈IR 1 2γ kz − xk2 o X DRE is [use ∇f γ (x) = γ −1 (x − proxγf (x)) = ∇f (proxγf (x))] n FγDR (x) = minn `F (z; proxγf (x)) + z∈IR 1 2γ kz − proxγf (x)k2 o X DRE is equal to FBE evaluated at proxγf (x) FγDR (x) = FγFB (proxγf (x)) 33 / 40 Connection between DRS and FBS X FBS (with relaxation) xk+1 = xk + λk proxγg xk − γ∇f (xk ) − xk X DRS y k = proxγf (xk ) xk+1 = xk + λk proxγg (2y k − xk ) − y k X f ∈ C 1 (IRn ) y k = proxγf (xk ) = xk − γ∇f (proxγf (xk )) X DRS becomes y k = proxγf (xk ) xk+1 = xk + λk proxγg y k − γ∇f (y k ) − y k X DRS iteration = FBS iteration at “shifted” point y k = proxγf (xk ) 34 / 40 DRS as a variable metric method DRS xk+1 = xk + λk proxγg (2 proxγf (xk ) − xk ) − proxγf (xk ) gradient of DRE ∇FγDR (x) = I − 2γ∇2 f γ (xk ) proxγf (xk ) − proxγg (2 proxγf (xk ) − xk ) DRS: variable metric method applied to DRE xk+1 = xk − λk Dk ∇FγDR (xk ) where Dk = (I − 2γ∇2 f γ (xk ))−1 X relaxation parameter λk of DRS =⇒ stepsize for gradient method X can use backtracking for selecting λk 35 / 40 DRS – complexity estimates X assume f quadratic =⇒ DRE is convex for γ ∈ (0, 1/Lf ) X DRS is preconditioned gradient method under change of variables S = D1/2 x = Sw X convergence rate of DRS λk = λ = (1 − γLf )/(1 + γLf ) 1 kx0 − x ˜k2 (2γλ)k F (z k+1 ) − F? ≤ X optimal prox-parameter γ for DRE √ γ? = 2−1 Lf X linear convergence rate if F is strongly convex 36 / 40 Accelerated DRS X DRE is convex with (1/γ)-Lipschitz gradient for f quadratic and γ ∈ (0, 1/Lf ) X Nesterov’s FGM applied to (preconditioned) DRE: x0 = x−1 ∈ IRn uk = xk + βk (xk − xk−1 ) y k = proxγf (uk ) z k = proxγg (2y k − uk ) xk+1 = uk + λ(z k − y k ) X λ = (1 − γLf )/(1 + γLf ). can choose βk = k−1 k+2 X convergence rate is O(1/k 2 ) F (z k ) − F? ≤ 4 kx0 − x ˜ k2 . γλ(k + 2)2 X linear convergence for f or g strongly convex 37 / 40 Sparse least-squares minimize 1 2 kAx 100 100 γ = 0.2Lf γ = γ? γ = 0.6Lf γ = 0.8Lf 10−2 rel. subopt. − bk22 + λkxk1 , 10−2 10−4 10−4 10−6 10−6 10−8 10−8 0 1,000 2,000 iterations 3,000 DRS Fast DRS 600 1,200 iterations 1,800 38 / 40 Take home message I X proximal minimization = gradient method on Moreau envelope (70’s) X FBS = (variable metric) gradient method on FBE X DRS = (variable metric) gradient method on DRE (this talk) (this talk) 39 / 40 Take home message II X ALM = proximal minimization for dual Moreau envelope of dual = arg min Lγ (x, y) x∈IRn (Rockafellar, 1973) X AMM = FBS for dual FBE of dual = Lγ (x(y), z(y), y) where x(y) = arg min L0 (x, z, y) x∈IRn z(y) = arg min Lγ (x(y), z, y) z∈IRn to conclude X interpretation of operator splitting algorithms as gradient methods X splitting envelopes can lead to new exciting algorithms X examples: FBN, AMNM and ADRS 40 / 40
© Copyright 2024 ExpyDoc