Closed-Form Training of Conditional Random Fields

Closed-Form Training of
Conditional Random Fields
Alex Kolesnikov (IST Austria),
Matthieu Guillaumin (ETH Zurich),
Vittorio Ferrari (U Edinburgh),
Christoph H. Lampert (IST Austria)
to appear at ECCV 2014
16. Juni 2014
Computer Vision Goes Big
I
3D reconstruction:
I
I
Image classification:
I
I
I
80M Tiny Images: 80,000,000 (tiny) images
ANN-SIFT1B: 1,000,000,000 SIFT descriptors
Object detection:
I
I
ImageNet: >14,000,000 training images
(Image) retrieval:
I
I
>2,800,000 training images
ImageSearch-100k: 32,000,000 images
(Semantic) image segmentation (with CRFs/S-SVMs)
I
PASCAL VOC 2012: 9,993 images
Why Did Image Segmentation Stays Small?
We don’t have the data.
I
Collecting segmentation annotation is tedious and costly, even (or
especially) using MTurk.
→ nobody bothers to go beyond a few hundred images per category
But: mscoco on the horizon!
Structured Prediction Methods Don’t Scale
What if we did have a segmentation dataset with, e.g., 100,000 images?
Conditional Random Fields (CRFs)
Iterative training, e.g. Stochastic Gradient Descent:
I
gradient needs probabilistic inference (marginals)
I
at 1s per call →
I
many iterations
>24 hour for 1 pass through the data (epoch)
Structured Support Vector Machine training (SSVMs)
Iterative training, e.g. Stochastic Subgradient Method, or Frank-Wolfe
I
subgradient needs MAP prediction (energy minimization)
I
at 0.2s per call →
I
many many iterations
>5 hours per epoch
So are we’re stuck with training per-pixel classifiers?
For CRFs, not quite: pseudo-likelihood, piecewise training (see later)
A Dream: A Closed-Form Expression for Training CRFs
Standard setting: pairwise CRF:
I
(cyclic) graph: vertex set: V = {1, . . . , k},
edge set: E ⊂ V × V
I
label set: L
I
model conditional distribution of labelings y
given the input images x ,
p(y|x, w) =
E(x, y; w) =
1
e −E(x,y;w)
Z (x, w)
XX
Es;i (x; w)Jys = iK
s∈V i∈L
+
X
X
Est;ij (x; w)Jys = i ∧ yt = jK
s,t∈E i,j∈L×L
Es;i (x; w) = hwi , ϕ(xs )i,
Est;ij (x; w) = hwij , ϕ(xs , xt )i
Wouldn’t it be great if we had a closed-form formula: w = F(x, y)?
A Closed-Form Expression for Maximum-Likelihood Parameters
I
acyclic graph: vertex set: V = {1, . . . , k},
edge set: E ⊂ V × V
I
label set: L
I
no conditioning on x, just a distribution on y
1 −E(y;θ)
e
Z X
X
X
E(y; θ) =
θs;i Jys = iK +
p(y; θ) =
s∈V i∈L
X
s,t∈E i,j∈L×L
θst;ij Jys = i ∧ yt = jK
has a closed for expression for the maximum-likelihood parameters!
Given samples, y 1 , . . . , y n ,
θs;i = log µs;i
µs;i =
and
n
1X
Jy k = iK
n k=1 s
θst;ij = log
and
µst;ij
,
µs;i µt;j
µst;ij =
for
n
1X
Jy k = i ∧ ytk = jK.
n k=1 s
Adapt to the Conditional Situation
I
acyclic graph: (V, E)
I
conditional distribution
1
e −E(x,y;w)
p(y|x, w) = Z (x,w)
I
samples, (x 1 , y 1 ), . . . , (x n , y n ),
Est;ij (x; w) = log fij (xs , xt )
Es;i (x; w) = log gi (xs ),
gi (xs ) = hwi , ϕ(xs )i,
fij (xs ) = hwij , ϕ(xs , xt )i
Find parameters, wi , wij , such that for any sample (x k , y k ),
hwi , ϕs (x k )i ≈ µs;i (x k ),
and
hwij , ϕst (x k )i ≈ µst;ij (x k ).
→ regression problem!
Note: unless there’s duplicates in the training set, we have
µs;i (x k ) = Jysk = iK,
µst;ij (x k ) = Jysk = i ∧ ytk = jK.
Adapt to the CRF Situation
What loss function? Let’s try squared loss, i.e. ridge regression,
wi = argminw
wij = argminw
n X
X
k=1 s∈V
n X
X
2
hw, ϕs (x k )i − µs;i (x k )K
hw, ϕst (x k )i − µst;ij (x k )
+ λkwk2
2
+ λkwk2
k=1 st∈E
Closed-form solution: wij = (ΦΦ> + λ I)−1 Φµ,
(same form for wi ).
I
independent problem for each label or label pair → easy to parallelize
I
only matrix operations → a few lines of Python or C++ code
I
trains within minutes on millions of samples (i.e. pixels/superpixels)
Other losses and other (non-linear) regression functions possible:
I
paper: logistic loss + gradient boosted decision tree
Adapt to the CRF Situation
I
cyclic graph: (V, E)
I
conditional distribution:
1
e −E(x,y;w)
p(y|x, w) = Z (x,w)
For learning: same trick as for piecewise (PW) learning:
I
decompose graph into smaller parts (here: individual edges)
I
provides a bound on the partition function, Z (x, w) ≤ ZPW (x, w)
I
learn energy for each edge (unaries cancel out in construction)
For prediction:
I
optimize over parts, while
enforcing label consistency
I
same as putting pieces back
together
I
joint prediction, e.g. GraphCut
But... didn’t you say we don’t have data?
Also in the paper: two large datasets, DogSeg, HorseSeg.
Training set:
I 25,078 images of horses from ImageNet/PASCAL VOC,
I 156,368 images of dogs from ImageNet/PASCAL VOC.
I figure-ground annotation of three kinds:
I
I
I
147/249 (HorseSeg/DogSeg) manually annotated (PASCAL VOC)
6,044/42,763 generated using segmentation propagation (ECCV’12)
from bounding box annotation
18,887/113,356 also generated using segmentation propagation from
per-image labels
Test set:
I 241/306 ImageNet images, manually generated ground truth
Three possible use scenarios:
I benchmark techniques for large-scale CRF/SSVM training
I image segmentation with (partially) unreliable training masks
I weakly/semi-supervised segmentation (ignore generated labelings)
Example Images
Left: HorseSeg,
right: DogSeg
Annotation in each block:
I
left: manual (good and detailed)
I
middle: from bounding box (usually good but coarser)
I
right: from per-image label (can be very bad)
Results
DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs)
HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs)
a) Training time
(excluding superpixels/features)
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
linear
unary
<1m
<1m
1m
2.5m
7m
20m
linear
pairwise
<1m
<1m
2m
2.5m
14m
46m
nonlinear
unary
<1m
9m
88m
<1m
110m
519m
nonlinear
pairwise
1.5m
32m
354m
2m
348m
668m
b) Accuracy
linear
unary
81.4
82.0
81.4
77.8
78.5
78.1
linear
pairwise
82.2
83.6
82.5
78.5
80.9
80.0
nonlinear
unary
81.6
83.6
83.3
79.1
81.2
80.2
nonlinear
pairwise
83.8
86.4
84.5
80.8
83.8
82.2
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
Results
DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs)
HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs)
a) Training time
(excluding superpixels/features)
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
linear
unary
<1m
<1m
1m
2.5m
7m
20m
linear
pairwise
<1m
<1m
2m
2.5m
14m
46m
nonlinear
unary
<1m
9m
88m
<1m
110m
519m
nonlinear
pairwise
1.5m
32m
354m
2m
348m
668m
b) Accuracy
linear
unary
81.4
82.0
81.4
77.8
78.5
78.1
linear
pairwise
82.2
83.6
82.5
78.5
80.9
80.0
nonlinear
unary
81.6
83.6
83.3
79.1
81.2
80.2
nonlinear
pairwise
83.8
86.4
84.5
80.8
83.8
82.2
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
Results
DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs)
HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs)
a) Training time
(excluding superpixels/features)
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
linear
unary
<1m
<1m
1m
2.5m
7m
20m
linear
pairwise
<1m
<1m
2m
2.5m
14m
46m
nonlinear
unary
<1m
9m
88m
<1m
110m
519m
nonlinear
pairwise
1.5m
32m
354m
2m
348m
668m
b) Accuracy
linear
unary
81.4
82.0
81.4
77.8
78.5
78.1
linear
pairwise
82.2
83.6
82.5
78.5
80.9
80.0
nonlinear
unary
81.6
83.6
83.3
79.1
81.2
80.2
nonlinear
pairwise
83.8
86.4
84.5
80.8
83.8
82.2
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
Results
DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs)
HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs)
a) Training time
(excluding superpixels/features)
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
linear
unary
<1m
<1m
1m
2.5m
7m
20m
linear
pairwise
<1m
<1m
2m
2.5m
14m
46m
nonlinear
unary
<1m
9m
88m
<1m
110m
519m
nonlinear
pairwise
1.5m
32m
354m
2m
348m
668m
b) Accuracy
linear
unary
81.4
82.0
81.4
77.8
78.5
78.1
linear
pairwise
82.2
83.6
82.5
78.5
80.9
80.0
nonlinear
unary
81.6
83.6
83.3
79.1
81.2
80.2
nonlinear
pairwise
83.8
86.4
84.5
80.8
83.8
82.2
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
Results
DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs)
HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs)
a) Training time
(excluding superpixels/features)
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
linear
unary
<1m
<1m
1m
2.5m
7m
20m
linear
pairwise
<1m
<1m
2m
2.5m
14m
46m
nonlinear
unary
<1m
9m
88m
<1m
110m
519m
nonlinear
pairwise
1.5m
32m
354m
2m
348m
668m
b) Accuracy
linear
unary
81.4
82.0
81.4
77.8
78.5
78.1
linear
pairwise
82.2
83.6
82.5
78.5
80.9
80.0
nonlinear
unary
81.6
83.6
83.3
79.1
81.2
80.2
nonlinear
pairwise
83.8
86.4
84.5
80.8
83.8
82.2
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
Results
DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs)
HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs)
a) Training time
(excluding superpixels/features)
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
linear
unary
<1m
<1m
1m
2.5m
7m
20m
linear
pairwise
<1m
<1m
2m
2.5m
14m
46m
nonlinear
unary
<1m
9m
88m
<1m
110m
519m
nonlinear
pairwise
1.5m
32m
354m
2m
348m
668m
b) Accuracy
linear
unary
81.4
82.0
81.4
77.8
78.5
78.1
linear
pairwise
82.2
83.6
82.5
78.5
80.9
80.0
nonlinear
unary
81.6
83.6
83.3
79.1
81.2
80.2
nonlinear
pairwise
83.8
86.4
84.5
80.8
83.8
82.2
HorseSeg - manual (147 images)
HorseSeg - bbox (6K images)
HorseSeg - all (25K images)
DogSeg - manual (249 images)
DogSeg - bbox (43K images)
DogSeg - all (156K images)
Summary – Training
bg
bg
1.00
fg
fg
bg
0.75
bg
fg
bg
fg
fg
fg
bg
fg
0.50
bg
bg
0.25
bg
bg
bg
bg
bg
superpixel graph
0.00
pairwise decomposition
regression
Summary – Prediction
E12 fg bg
fg 7
1
2
bg 3 -1
E23 fg bg
fg 4 -3
2
bg 8 -1
E34 fg bg
fg -4 -1
bg 4
2
3
4
predicted energy
segmentation result
Thanks to...
Thanks to...
The reviewer who called the method: "as simple as beautiful".
Thanks to...
The reviewer who called the method: "as simple as beautiful".
The reviewer who wrote: "the approach is brutally simple."
Thanks to...
The reviewer who called the method: "as simple as beautiful".
The reviewer who wrote: "the approach is brutally simple."
Collaborators:
Alex Kolesnikov
Funding:
Vitto Ferrari
Matthieu Guillaumin
IST Austria, Vienna
I
Research institute, public funding
I
natural sciences, mathematics,
computer science
I
computer vision: Vladimir
Kolmogorov, Bernd Bickel, myself
PhD Programme
I
1(2) + 3 yr PhD program
I
join with MSc or BSc
PostDoc Fellowships/Positions
I
curiosity-driven research
Internships:
More information:
I
3 to 6 months
I
undergraduate or postgraduate
www.ist.ac.at
or ask me after the workshop