Slides

Splitting Envelopes
Accelerated Second-order Proximal Methods
Panos Patrinos
(joint work with Lorenzo Stella, Alberto Bemporad)
September 8, 2014
Outline
X forward-backward envelope (FBE)
X forward-backward Newton method (FBN)
X dual FBE and Augmented Lagrangian
X alternating minimization Newton method (AMNM)
X Douglas Rachford envelope (DRE)
X accelerated Douglas Rachford splitting (ADRS)
based on
1. P. Patrinos and A. Bemporad. Proximal Newton methods for convex composite optimization. In Proc. 52nd IEEE
Conference on Decision and Control (CDC), pages 2358-2363, Florence, Italy, 2013.
2. P. Patrinos, L. Stella, and A. Bemporad, Forward-backward truncated Newton methods for convex composite
optimization. submitted, arXiv:1402.6655, 2014.
3. P. Patrinos, L. Stella, and A. Bemporad. Douglas-Rachford splitting: complexity estimates and accelerated variants. In
Proc. 53rd IEEE Conference on Decision and Control (CDC), Los Angeles, CA, arXiv:1407.6723, 2014.
4. L. Stella, P. Patrinos, and A. Bemporad, Alternating minimization Newton method for separable convex optimization,
2014 (submitted).
fixed point implementation for MPC
X A. Guiggiani, P. Patrinos, and A. Bemporad. Fixed-point implementation of a proximal Newton method for embedded
model predictive control. In 19th IFAC, South Africa, 2014.
2 / 40
Convex composite optimization
minimize F (x) = f (x) + g(x)
X f : IRn → IR convex, twice continuously differentiable with
k∇f (x) − ∇f (y)k ≤ Lf kx − yk,
for all x, y ∈ IRn
X g : IRn → IR convex, nonsmooth with inexpensive proximal mapping
n
o
1
kz − xk2
proxγg (x) = arg min g(z) + 2γ
z∈IRn
X many problem classes: QPs, cone programs, sparse least-squares,
rank minimization, total variation minimization,. . .
X applications: control, system identification, signal processing, image
analysis, machine learning,. . .
3 / 40
Proximal mappings
n
proxγg (x) = arg min g(z) +
z∈IRn
1
2γ kz
− xk2
o
γ>0
X resolvent of maximal monotone operator ∂g
proxγg (x) = (I + γ∂g)−1 (x)
X single-valued and (firmly) nonexpansive
X explicitly computable for many functions (see Parikh, Boyd ’14, Combettes,
Pesquet ’10)
X reduces to projection when g is indicator of convex set
proxγδC (x) = ΠC (x)
X z = proxγg (x) is implicit a subgradient step (0 ∈ ∂g(z) + γ −1 (z − x))
z = x − γv
v ∈ ∂g(z)
4 / 40
Proximal Minimization Algorithm
minimize g(x),
g : IRn → IR closed proper convex
given x0 ∈ IRn , repeat
xk+1 = proxγg (xk )
γ>0
X fixed point iteration for optimality conditions
0 ∈ ∂g(x? ) ⇐⇒ x? ∈ (I + γ∂g)(x? ) ⇐⇒ x? = proxγg (x? )
X special case of proximal point algorithm (Martinet ’70, Rockafellar ’76)
X converges under very general conditions
X mostly conceptual algorithm
5 / 40
Moreau envelope
Moreau envelope of closed proper convex g : IRn → IR
n
g γ (x) = inf n g(z) +
z∈IR
1
2γ kz
o
− xk2 ,
γ>0
X g γ is real-valued, convex, differentiable with 1/γ-Lipschitz gradient
∇g γ (x) = (1/γ)(x − proxγg (x))
X minimizing nonsmooth g is equivalent to minimizing smooth g γ
X proximal minimization algorithm = gradient method for g γ
xk+1 = xk − γ∇g γ (xk )
X can use any method of unconstrained smooth minimization for g γ
6 / 40
Forward-Backward Splitting (FBS)
minimize F (x) = f (x) + g(x)
X optimality condition: x? ∈ IRn is optimal if and only if
x? = proxγg (x? − γ∇f (x? )),
γ>0
X forward-backward splitting (aka proximal gradient)
xk+1 = proxγg (xk − γ∇f (xk )),
γ ∈ (0, 2/Lf )
X FBS is a fixed point iteration
X g = 0: gradient method, g = δC : gradient projection, f = 0: prox
min
X accelerated versions (Nesterov)
7 / 40
Forward-Backward Envelope
x − proxγg (x − γ∇f (x)) = 0
X use proxγg (y) = y − γ∇g γ (y) for y = x − γ∇f (x)
γ∇f (x) + γ∇g γ (x − γ∇f (x)) = 0
X multiply with γ −1 (I − γ∇2 f (x)) (positive definite for γ ∈ (0, 1/Lf ))
X gradient of the Forward Backward Envelope (FBE)
FγFB (x) = f (x) − γ2 k∇f (x)k22 + g γ (x − γ∇f (x))
X alternative expression for FBE
FγFB (x) = inf n {f (x) + h∇f (x), z − xi +g(z) +
{z
}
z∈IR |
1
2γ kz
− xk2 }
linearize f around x
8 / 40
Properties of FBE
X stationary points of FγFB = minimizers of F
X reformulates original nonsmooth problem into a smooth one
minimize
FγFB (x)
n
x∈IR
equivalent to
minimize
F (x)
n
x∈IR
X FγFB is real-valued, continuously differentiable
∇FγFB (x) = γ −1 (I − γ∇2 f (x))(x − proxγg (x − γ∇f (x))
X FBS is a variable metric gradient method for FBE
xk+1 = xk − γDk−1 ∇FγFB (xk )
9 / 40
Forward-Backward Newton Method (FBN)
Input: x0 ∈ IRn , γ ∈ (0, 1/Lf ), σ ∈ (0, 1/2)
for k = 0, 1, 2, . . . do
Newton direction
Choose H k ∈ ∂ˆ2 FγFB (xk ). Compute dk by solving (approximately)
H k d = −∇FγFB (xk )
Line search
Compute stepsize by backtracking
FγFB (xk + τk dk ) ≤ FγFB (xk ) + στk h∇FγFB (xk ), dk i
Update: xk+1 = xk + τk dk
end
10 / 40
Linear Newton approximation
FBE is C 1 but not C 2
Hd = −∇FγFB (x)
where
∇FγFB (x) = γ −1 (I − γ∇2 f (x))(x − proxγg (x − γ∇f (x))
and ∂ˆ2 Fγ (x) is an approximate generalized Hessian
H = γ −1 (I − γ∇2 f (x))(I − P (I − γ∇2 f (x))) ∈ ∂ˆ2 Fγ (x)
where P ∈ ∂C (proxγg ) (x − γ∇f (x))
| {z }
Clarke’s generalized
Jacobian
X preserves all favorable properties of the Hessian for C 2 functions
X “Gauss-Newton” generalized Hessian: we omit 3rd order terms
11 / 40
Generalized Jacobians of proximal mappings
X ∂C proxγg (x) is the following set of matrices
conv
(Clarke, 1983)
limits of (ordinary) Jacobians for every sequence that converges
to x, consisting of points where proxγg is differentiable
I
proxγg (x) simple to compute =⇒ P ∈ ∂C (proxγg )(x) for free
I
g (block) separable =⇒ P ∈ ∂C (proxγg )(x) (block) diagonal
example–`1 norm
X g(x) = kxk1
more examples in Patrinos, Stella, Bemporad (2014)


xi + γ,
proxγf (x)i = 0,


xi − γ,
X P ∈ ∂C (proxγg )(x) are diagonal


1,
Pii = ∈ [0, 1],


0,
if xi ≤ −γ,
if − γ ≤ xi ≤ γ
if xi ≥ γ
matrices with
if i ∈ {i | |xi | > γ},
if i ∈ {i | |xi | = γ},
if i ∈ {i | |xi | < γ}.
12 / 40
Convergence of FBN
X every limit point of {xk } converges to arg min F (x)
x∈IRn
X all H ∈ ∂ˆ2 Fγ (x? ) nonsingular =⇒ Q-quadratic asymptotic rate
extension: FBN II
X apply FB step after a Newton step
X same asymptotic rate + global complexity estimates
I
I
non-strongly convex f : sublinear rate for F (xk ) − F (x? )
F (xk ) − F (x? )
strongly convex f :
linear rate for
kxk − x? k2
13 / 40
FBN–CG
large problems
conjugate gradient (CG) on regularized Newton system until
k (H k + δk I)dk + ∇FγFB (xk ) k ≤ ηk k∇FγFB (xk )k
|
{z
}
residual
with ηk = O(k∇FγFB (xk )k), δk = O(k∇FγFB (xk )k)
properties
X no need to form ∇2 f (x) and H k – only matvec products
X same convergence properties
14 / 40
Box-constrained convex programs
minimize
f (x)
subject to ` ≤ x ≤ u
Newton direction solves
minimize
subject to
2
k
k
1
2 hd, ∇ f (x )di + h∇f (x ), di
di = `i − xki , i ∈ β1 , di = ui −
xki , i ∈ β2
where
β1 = {i | xki − γ∇i f (xk ) ≤ `i }
estimate of x?i = `i
β2 = {i | xki − γ∇i f (xk ) ≥ ui }
estimate of x?i = ui
Newton system becomes
Qδδ dδ = −(∇δ f (xk ) + ∇δβ f (xk )dβ ),
(β = β1 ∪ β2 , δ \ β = [n])
15 / 40
Example
1
2 hx, Qxi
minimize
+ hq, xi,
n = 1000
subject to ` ≤ x ≤ u
10
2
10
0
10
−2
10
−4
10
−6
10
−8
10
−10
10
−12
cond(Q) = 108
PNM
FBN
PGNMII
FBN
FGM
AFBS
0.2
0.4
0.6
0.8
1
1.2
1.4
time [sec]
GUROBI: 4.87 sec, CPLEX: 3.73 sec
F (x ν ) − F !
F (x ν ) − F !
cond(Q) = 104
10
2
10
0
10
−2
10
−4
10
−6
10
−8
10
−10
10
−12
PNM
FBN
PGNMII
FBN
FGM
AFBS
10
20
30
40
50
time [sec]
GUROBI: 5.96 sec, CPLEX: 4.83 sec
FBN: much less sensitive to bad conditioning
16 / 40
Sparse least-squares
minimize
1
2 kAx
− bk22 + λkxk1
X Newton system becomes
dβ
>
A·δ A·δ dδ
= −xβ
= −[A>
·δ (A·δ xδ − b) + λ sign(xδ − γ∇δ f (x))]
X δ is an estimate of the nonzero components of x?
δ = {i | |xi − γ∇i f (x)| > λγ}
X close to solution δ small
17 / 40
FBN Methods are robust
X 548 datasets taken from
http://wwwopt.mathematik.tu-darmstadt.de/spear/
X compared against AFBS, YALL1 (ADDM-based), SpaRSA
(Barzilai-Borwein), l1-ls (interior point)
X performance plot: point (x, y) indicates algorithm is at most x times
slower in a fraction of y problems
1
0.9
0.8
frequency
0.7
0.6
0.5
0.4
FBN−CG I
FBN−CG II
Accelerated FBS
YALL1
SpaRSA
l1−ls
0.3
0.2
0.1
0
0
5
10
15
performance ratio
20
25
30
18 / 40
Sparse Logistic Regression
minimize
x,y
m
X
|
log 1 + e−bi (ai x+y) + λkxk1 ,
λ>0
i=1
103
Fk − F?
AFBS
FBN
101
10−1
10−3
0
5
10
time (s)
15
20
19 / 40
Augmented Lagrangians and Moreau envelopes
minimize
f (x)
f : IRn → IR convex (can be nonsmooth),
subject to Ax = b
A ∈ IRp×n
Augmented Lagrangian
Lγ (x, y) = f (x) + hy, Ax − bi + γ2 kAx − bk2
Augmented Lagrangian method (ALM or Method of Multipliers)
xk = inf n Lγ (x, y k )
x∈IR
k
Hestenes (1969), Powell (1969)
y k+1 = y + γ(Axk − b)
ALM = proximal minimization for dual
Rockafellar (1973,1976)
= gradient method for Moreau envelope of dual
20 / 40
Separable convex problems
minimize
f (x) + g(z),
subject to Ax + Bz = b
f, g convex (can be nonsmooth)
A ∈ IRp×n , B ∈ IRp×m
X f is strongly convex with convexity parameter µf
X f , g are nice, e.g. separable. coupling introduced through constraints
X ALM not amenable to decomposition: x and z updates are coupled
Alternating Minimization Method (AMM): FBS applied to the dual
xk+1 = arg min L0 (x, z k , y k )
x∈IRn
z k+1 = arg min Lγ (xk+1 , z k , y k )
z∈IRm
k
γ ∈ (0, 2µf /kAk2 )
y k+1 = y + γ(Axk+1 + Bz k+1 − b)
Augmented Lagrangian
Lγ (x, z, λ) = f (x) + g(z) + hy, Ax + Bz − bi + γ2 kAx + Bz − bk2
21 / 40
Dual FBE
FBE for dual problem is augmented Lagrangian!
FγFB (y) = Lγ (x(y), z(y), y)
where x(y), z(y) are the AMM updates
x(y) = arg min f (x) + hy, Axi
x∈IRn
z(y) = arg min g(z) + hy, Bzi + γ2 kAx(y) + Bz − bk2
z∈IRm
22 / 40
Connection between AMM and FBE
dual problem is equivalent to
maximize
FγFB (y) = Lγ (x(y), z(y), y)
p
y∈IR
γ ∈ 0, µf /kAk2
X f ∈ C 2 (IRn ) =⇒ FγFB ∈ C 1 (IRp )
D(y)
∇FγFB (y)
}|
{
−1 >
= I − γA(∇ f (x(y))) A (Ax(y) + Bz(y) − b)
z
2
X AMM as a variable metric gradient method on dual FBE
y k+1 = y k + γD(y k )−1 ∇FγFB (y k )
X AMNM: FBN to dual
23 / 40
Strictly convex QPs
1
2 hx, Qxi
minimize
cond(Q) = 104
+ hq, xi
A ∈ IR2000×1000
−2
10
AMNM
PNM
AMNM
PGNM II
FGM
FAMM
−4
10
duality gap
primal infeasibility
subject to ` ≤ Ax ≤ u
10
−2
10
−4
10
−6
10
−8
PNM
AMNM
PGNM II
AMNM
FGM
FAMM
−6
10
10
20
30
10
40
time [sec]
GUROBI: 6.7 s
CPLEX: 60.8 s
20
30
40
time [sec]
Fast AMM: 41 s
AMNM: 8.1 s
AMNM II:11.3 s
24 / 40
Projection onto Convex Sets
m
minimize
x
X
1
kx − pk2 +
δCi (zi )
2
subject to x = zi ,
i=1
i = 1, . . . , M
X δC is the indicator of C
X x? is the projection of p onto C1 ∩ . . . ∩ Cm
Dykstra
Fast AMM
AMNM
25 / 40
Random problem with m = 100 random hyperplanes in IR120
101
100
kxk − x? k
10−1
10−2
10−3
10−4
10−5
10−6
0
AMM
1
Fast AMM
2
3
time (s)
Dykstra
4
ADMM
5
AMNM
26 / 40
Distributed MPC
minimize
−1
M N
X
X
kξi (t)k2Qi + kui (t)k2Ri + kξi (N )k2Pi
i=1 t=1
subject to ξi (0) = ξi0 ,
ξi (t + 1) =
i ∈ N[1,M ]
X
Φij ξj (t) + Γij uj (t), t ∈ N[0,N −1] , i ∈ N[1,M ]
j∈Ni
(ξi (t), ui (t)) ∈ Yi ,
t ∈ N[0,N −1] , i ∈ N[1,M ]
ξi (N ) ∈ Zi ,
i ∈ N[1,M ]
X solve an optimal control problem over a network of agents
X local constraint sets Yi , Zi are simple, coupling in the dynamics
X can handle complicated local and coupled constraints as well
27 / 40
DMPC simulations
M = 100 subsystems
23600 vars, 20600 cons
average over 20 instances
8 states, 3 inputs, N = 10
101
primal residual
Fast AMM
AMNM
M
5
10
20
50
100
200
10−2
10−5
10−8
0
1
2
time (s)
3
FAMM
local
29.5
47.0
65.7
104.8
139.2
159.3
AMNM
local global
2.1
2.1
2.2
2.1
2.4
2.4
3.3
3.3
3.9
3.8
4.4
4.3
communication rounds
(in thousands)
28 / 40
Douglas-Rachford Splitting
minimize F (x) = f (x) + g(x)
X optimality condition: x? is optimal if and only if x? = proxγf (˜
x),
where x
˜ solves
proxγg (2 proxγf (x) − x) − proxγf (x) = 0
X DRS
y k = proxγf (xk )
z k = proxγg (2y k − xk )
xk+1 = xk + λk (z k − y k )
P
X γ > 0 and λk ∈ [0, 2] with k∈N λk (2 − λk ) = +∞
X DRS is a relaxed fixed point iteration
29 / 40
ADMM
minimize
f (x) + g(z),
subject to Ax + Bz = b
f, g convex (can be nonsmooth)
A ∈ IRp×n , B ∈ IRp×m
Alternating Direction Method of Multipliers
xk+1 = arg min Lγ (x, z k , y k )
x∈IRn
z
k+1
y
k+1
= arg min Lγ (xk+1 , z, y k )
z∈IRm
k
= y + γ(Axk+1 + Bz k+1 − b)
DRS applied to the dual (Eckstein, Bertsekas, 1992)
30 / 40
Douglas Rachford Envelope
assume f convex, C 2 =⇒ Moreau envelope f γ is C 2
k∇f (x) − ∇f (y)k ≤ Lf kx − yk
for all x, y ∈ IRn
X optimality condition
proxγf (x) − proxγg (2 proxγf (x) − x) = 0
X use ∇hγ (x) = γ −1 (x − proxγh (x))
∇f γ (x) + ∇g γ (x − 2γ∇f γ (x)) = 0
X multiply by (I − 2γ∇2 f γ (x)), γ ∈ (0, 1/Lf ) and “integrate”
Douglas Rachford Envelope (DRE)
FγDR (x) = f γ (x) − γk∇f γ (x)k2 + g γ (x − 2γ∇f γ (x))
31 / 40
Properties of DRE
X γ ∈ (0, 1/Lf ): miniziming DRE = minimizing nonsmooth F
inf F (x) = inf FγDR (x)
arg min F (x) = proxγf (arg min FγDR (x))
X f ∈ C 2 (IRn ) =⇒ DRE is C 1 on IRn
X f quadratic =⇒ DRE is convex with (1/γ)-Lipschitz gradient
32 / 40
Connection between DRE and FBE
X partial linearization of F around x
`F (z; x) = f (x) + h∇f (x), z − xi + g(z)
X FBE is
n
FγFB (x) = minn `F (z; x) +
z∈IR
1
2γ kz
− xk2
o
X DRE is [use ∇f γ (x) = γ −1 (x − proxγf (x)) = ∇f (proxγf (x))]
n
FγDR (x) = minn `F (z; proxγf (x)) +
z∈IR
1
2γ kz
− proxγf (x)k2
o
X DRE is equal to FBE evaluated at proxγf (x)
FγDR (x) = FγFB (proxγf (x))
33 / 40
Connection between DRS and FBS
X FBS (with relaxation)
xk+1 = xk + λk proxγg xk − γ∇f (xk ) − xk
X DRS
y k = proxγf (xk )
xk+1 = xk + λk proxγg (2y k − xk ) − y k
X f ∈ C 1 (IRn )
y k = proxγf (xk ) = xk − γ∇f (proxγf (xk ))
X DRS becomes
y k = proxγf (xk )
xk+1 = xk + λk proxγg y k − γ∇f (y k ) − y k
X DRS iteration = FBS iteration at “shifted” point y k = proxγf (xk )
34 / 40
DRS as a variable metric method
DRS
xk+1 = xk + λk proxγg (2 proxγf (xk ) − xk ) − proxγf (xk )
gradient of DRE
∇FγDR (x) = I − 2γ∇2 f γ (xk ) proxγf (xk ) − proxγg (2 proxγf (xk ) − xk )
DRS: variable metric method applied to DRE
xk+1 = xk − λk Dk ∇FγDR (xk )
where Dk = (I − 2γ∇2 f γ (xk ))−1
X relaxation parameter λk of DRS =⇒ stepsize for gradient method
X can use backtracking for selecting λk
35 / 40
DRS – complexity estimates
X assume f quadratic =⇒ DRE is convex for γ ∈ (0, 1/Lf )
X DRS is preconditioned gradient method under change of variables
S = D1/2
x = Sw
X convergence rate of DRS
λk = λ = (1 − γLf )/(1 + γLf )
1
kx0 − x
˜k2
(2γλ)k
F (z k+1 ) − F? ≤
X optimal prox-parameter γ for DRE
√
γ? =
2−1
Lf
X linear convergence rate if F is strongly convex
36 / 40
Accelerated DRS
X DRE is convex with (1/γ)-Lipschitz gradient for f quadratic and
γ ∈ (0, 1/Lf )
X Nesterov’s FGM applied to (preconditioned) DRE: x0 = x−1 ∈ IRn
uk = xk + βk (xk − xk−1 )
y k = proxγf (uk )
z k = proxγg (2y k − uk )
xk+1 = uk + λ(z k − y k )
X λ = (1 − γLf )/(1 + γLf ). can choose βk =
k−1
k+2
X convergence rate is O(1/k 2 )
F (z k ) − F? ≤
4
kx0 − x
˜ k2 .
γλ(k + 2)2
X linear convergence for f or g strongly convex
37 / 40
Sparse least-squares
minimize
1
2 kAx
100
100
γ = 0.2Lf
γ = γ?
γ = 0.6Lf
γ = 0.8Lf
10−2
rel. subopt.
− bk22 + λkxk1 ,
10−2
10−4
10−4
10−6
10−6
10−8
10−8
0
1,000 2,000
iterations
3,000
DRS
Fast DRS
600 1,200
iterations
1,800
38 / 40
Take home message I
X proximal minimization = gradient method on Moreau envelope (70’s)
X FBS = (variable metric) gradient method on FBE
X DRS = (variable metric) gradient method on DRE
(this talk)
(this talk)
39 / 40
Take home message II
X ALM = proximal minimization for dual
Moreau envelope of dual = arg min Lγ (x, y)
x∈IRn
(Rockafellar, 1973)
X AMM = FBS for dual
FBE of dual = Lγ (x(y), z(y), y)
where
x(y) = arg min L0 (x, z, y)
x∈IRn
z(y) = arg min Lγ (x(y), z, y)
z∈IRn
to conclude
X interpretation of operator splitting algorithms as gradient methods
X splitting envelopes can lead to new exciting algorithms
X examples: FBN, AMNM and ADRS
40 / 40