Closed-Form Training of Conditional Random Fields Alex Kolesnikov (IST Austria), Matthieu Guillaumin (ETH Zurich), Vittorio Ferrari (U Edinburgh), Christoph H. Lampert (IST Austria) to appear at ECCV 2014 16. Juni 2014 Computer Vision Goes Big I 3D reconstruction: I I Image classification: I I I 80M Tiny Images: 80,000,000 (tiny) images ANN-SIFT1B: 1,000,000,000 SIFT descriptors Object detection: I I ImageNet: >14,000,000 training images (Image) retrieval: I I >2,800,000 training images ImageSearch-100k: 32,000,000 images (Semantic) image segmentation (with CRFs/S-SVMs) I PASCAL VOC 2012: 9,993 images Why Did Image Segmentation Stays Small? We don’t have the data. I Collecting segmentation annotation is tedious and costly, even (or especially) using MTurk. → nobody bothers to go beyond a few hundred images per category But: mscoco on the horizon! Structured Prediction Methods Don’t Scale What if we did have a segmentation dataset with, e.g., 100,000 images? Conditional Random Fields (CRFs) Iterative training, e.g. Stochastic Gradient Descent: I gradient needs probabilistic inference (marginals) I at 1s per call → I many iterations >24 hour for 1 pass through the data (epoch) Structured Support Vector Machine training (SSVMs) Iterative training, e.g. Stochastic Subgradient Method, or Frank-Wolfe I subgradient needs MAP prediction (energy minimization) I at 0.2s per call → I many many iterations >5 hours per epoch So are we’re stuck with training per-pixel classifiers? For CRFs, not quite: pseudo-likelihood, piecewise training (see later) A Dream: A Closed-Form Expression for Training CRFs Standard setting: pairwise CRF: I (cyclic) graph: vertex set: V = {1, . . . , k}, edge set: E ⊂ V × V I label set: L I model conditional distribution of labelings y given the input images x , p(y|x, w) = E(x, y; w) = 1 e −E(x,y;w) Z (x, w) XX Es;i (x; w)Jys = iK s∈V i∈L + X X Est;ij (x; w)Jys = i ∧ yt = jK s,t∈E i,j∈L×L Es;i (x; w) = hwi , ϕ(xs )i, Est;ij (x; w) = hwij , ϕ(xs , xt )i Wouldn’t it be great if we had a closed-form formula: w = F(x, y)? A Closed-Form Expression for Maximum-Likelihood Parameters I acyclic graph: vertex set: V = {1, . . . , k}, edge set: E ⊂ V × V I label set: L I no conditioning on x, just a distribution on y 1 −E(y;θ) e Z X X X E(y; θ) = θs;i Jys = iK + p(y; θ) = s∈V i∈L X s,t∈E i,j∈L×L θst;ij Jys = i ∧ yt = jK has a closed for expression for the maximum-likelihood parameters! Given samples, y 1 , . . . , y n , θs;i = log µs;i µs;i = and n 1X Jy k = iK n k=1 s θst;ij = log and µst;ij , µs;i µt;j µst;ij = for n 1X Jy k = i ∧ ytk = jK. n k=1 s Adapt to the Conditional Situation I acyclic graph: (V, E) I conditional distribution 1 e −E(x,y;w) p(y|x, w) = Z (x,w) I samples, (x 1 , y 1 ), . . . , (x n , y n ), Est;ij (x; w) = log fij (xs , xt ) Es;i (x; w) = log gi (xs ), gi (xs ) = hwi , ϕ(xs )i, fij (xs ) = hwij , ϕ(xs , xt )i Find parameters, wi , wij , such that for any sample (x k , y k ), hwi , ϕs (x k )i ≈ µs;i (x k ), and hwij , ϕst (x k )i ≈ µst;ij (x k ). → regression problem! Note: unless there’s duplicates in the training set, we have µs;i (x k ) = Jysk = iK, µst;ij (x k ) = Jysk = i ∧ ytk = jK. Adapt to the CRF Situation What loss function? Let’s try squared loss, i.e. ridge regression, wi = argminw wij = argminw n X X k=1 s∈V n X X 2 hw, ϕs (x k )i − µs;i (x k )K hw, ϕst (x k )i − µst;ij (x k ) + λkwk2 2 + λkwk2 k=1 st∈E Closed-form solution: wij = (ΦΦ> + λ I)−1 Φµ, (same form for wi ). I independent problem for each label or label pair → easy to parallelize I only matrix operations → a few lines of Python or C++ code I trains within minutes on millions of samples (i.e. pixels/superpixels) Other losses and other (non-linear) regression functions possible: I paper: logistic loss + gradient boosted decision tree Adapt to the CRF Situation I cyclic graph: (V, E) I conditional distribution: 1 e −E(x,y;w) p(y|x, w) = Z (x,w) For learning: same trick as for piecewise (PW) learning: I decompose graph into smaller parts (here: individual edges) I provides a bound on the partition function, Z (x, w) ≤ ZPW (x, w) I learn energy for each edge (unaries cancel out in construction) For prediction: I optimize over parts, while enforcing label consistency I same as putting pieces back together I joint prediction, e.g. GraphCut But... didn’t you say we don’t have data? Also in the paper: two large datasets, DogSeg, HorseSeg. Training set: I 25,078 images of horses from ImageNet/PASCAL VOC, I 156,368 images of dogs from ImageNet/PASCAL VOC. I figure-ground annotation of three kinds: I I I 147/249 (HorseSeg/DogSeg) manually annotated (PASCAL VOC) 6,044/42,763 generated using segmentation propagation (ECCV’12) from bounding box annotation 18,887/113,356 also generated using segmentation propagation from per-image labels Test set: I 241/306 ImageNet images, manually generated ground truth Three possible use scenarios: I benchmark techniques for large-scale CRF/SSVM training I image segmentation with (partially) unreliable training masks I weakly/semi-supervised segmentation (ignore generated labelings) Example Images Left: HorseSeg, right: DogSeg Annotation in each block: I left: manual (good and detailed) I middle: from bounding box (usually good but coarser) I right: from per-image label (can be very bad) Results DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs) HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs) a) Training time (excluding superpixels/features) HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) linear unary <1m <1m 1m 2.5m 7m 20m linear pairwise <1m <1m 2m 2.5m 14m 46m nonlinear unary <1m 9m 88m <1m 110m 519m nonlinear pairwise 1.5m 32m 354m 2m 348m 668m b) Accuracy linear unary 81.4 82.0 81.4 77.8 78.5 78.1 linear pairwise 82.2 83.6 82.5 78.5 80.9 80.0 nonlinear unary 81.6 83.6 83.3 79.1 81.2 80.2 nonlinear pairwise 83.8 86.4 84.5 80.8 83.8 82.2 HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) Results DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs) HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs) a) Training time (excluding superpixels/features) HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) linear unary <1m <1m 1m 2.5m 7m 20m linear pairwise <1m <1m 2m 2.5m 14m 46m nonlinear unary <1m 9m 88m <1m 110m 519m nonlinear pairwise 1.5m 32m 354m 2m 348m 668m b) Accuracy linear unary 81.4 82.0 81.4 77.8 78.5 78.1 linear pairwise 82.2 83.6 82.5 78.5 80.9 80.0 nonlinear unary 81.6 83.6 83.3 79.1 81.2 80.2 nonlinear pairwise 83.8 86.4 84.5 80.8 83.8 82.2 HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) Results DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs) HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs) a) Training time (excluding superpixels/features) HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) linear unary <1m <1m 1m 2.5m 7m 20m linear pairwise <1m <1m 2m 2.5m 14m 46m nonlinear unary <1m 9m 88m <1m 110m 519m nonlinear pairwise 1.5m 32m 354m 2m 348m 668m b) Accuracy linear unary 81.4 82.0 81.4 77.8 78.5 78.1 linear pairwise 82.2 83.6 82.5 78.5 80.9 80.0 nonlinear unary 81.6 83.6 83.3 79.1 81.2 80.2 nonlinear pairwise 83.8 86.4 84.5 80.8 83.8 82.2 HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) Results DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs) HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs) a) Training time (excluding superpixels/features) HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) linear unary <1m <1m 1m 2.5m 7m 20m linear pairwise <1m <1m 2m 2.5m 14m 46m nonlinear unary <1m 9m 88m <1m 110m 519m nonlinear pairwise 1.5m 32m 354m 2m 348m 668m b) Accuracy linear unary 81.4 82.0 81.4 77.8 78.5 78.1 linear pairwise 82.2 83.6 82.5 78.5 80.9 80.0 nonlinear unary 81.6 83.6 83.3 79.1 81.2 80.2 nonlinear pairwise 83.8 86.4 84.5 80.8 83.8 82.2 HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) Results DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs) HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs) a) Training time (excluding superpixels/features) HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) linear unary <1m <1m 1m 2.5m 7m 20m linear pairwise <1m <1m 2m 2.5m 14m 46m nonlinear unary <1m 9m 88m <1m 110m 519m nonlinear pairwise 1.5m 32m 354m 2m 348m 668m b) Accuracy linear unary 81.4 82.0 81.4 77.8 78.5 78.1 linear pairwise 82.2 83.6 82.5 78.5 80.9 80.0 nonlinear unary 81.6 83.6 83.3 79.1 81.2 80.2 nonlinear pairwise 83.8 86.4 84.5 80.8 83.8 82.2 HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) Results DogSeg: approx. 90K/3.5M/13M data samples (superpixel pairs) HorseSeg: approx. 45K/500K/2M data samples (superpixel pairs) a) Training time (excluding superpixels/features) HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) linear unary <1m <1m 1m 2.5m 7m 20m linear pairwise <1m <1m 2m 2.5m 14m 46m nonlinear unary <1m 9m 88m <1m 110m 519m nonlinear pairwise 1.5m 32m 354m 2m 348m 668m b) Accuracy linear unary 81.4 82.0 81.4 77.8 78.5 78.1 linear pairwise 82.2 83.6 82.5 78.5 80.9 80.0 nonlinear unary 81.6 83.6 83.3 79.1 81.2 80.2 nonlinear pairwise 83.8 86.4 84.5 80.8 83.8 82.2 HorseSeg - manual (147 images) HorseSeg - bbox (6K images) HorseSeg - all (25K images) DogSeg - manual (249 images) DogSeg - bbox (43K images) DogSeg - all (156K images) Summary – Training bg bg 1.00 fg fg bg 0.75 bg fg bg fg fg fg bg fg 0.50 bg bg 0.25 bg bg bg bg bg superpixel graph 0.00 pairwise decomposition regression Summary – Prediction E12 fg bg fg 7 1 2 bg 3 -1 E23 fg bg fg 4 -3 2 bg 8 -1 E34 fg bg fg -4 -1 bg 4 2 3 4 predicted energy segmentation result Thanks to... Thanks to... The reviewer who called the method: "as simple as beautiful". Thanks to... The reviewer who called the method: "as simple as beautiful". The reviewer who wrote: "the approach is brutally simple." Thanks to... The reviewer who called the method: "as simple as beautiful". The reviewer who wrote: "the approach is brutally simple." Collaborators: Alex Kolesnikov Funding: Vitto Ferrari Matthieu Guillaumin IST Austria, Vienna I Research institute, public funding I natural sciences, mathematics, computer science I computer vision: Vladimir Kolmogorov, Bernd Bickel, myself PhD Programme I 1(2) + 3 yr PhD program I join with MSc or BSc PostDoc Fellowships/Positions I curiosity-driven research Internships: More information: I 3 to 6 months I undergraduate or postgraduate www.ist.ac.at or ask me after the workshop
© Copyright 2025 ExpyDoc