Dictionary learning for fast classification based on soft-thresholding Alhussein Fawzi Joint work with Mike Davies (University of Edinburgh) and Pascal Frossard (EPFL) EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch Sparse dictionary learning l 2 Dictionary learning is a technique used in signal processing to learn a representation of the data that ‘sparsifies’ the vectors. min D,ci Dictionary n X i=1 kxi Dci k22 +↵ (ci ) | {z } | {z } Reconstruction Sparsity Sparse codes [Olshausen & Field, ’96] l It has been successfully applied in many image processing applications such as denoising and inpainting, leading to state-of-the-art results. [Mairal, Bach, Ponce, Sapiro, and Zisserman, ’09] [Mairal, Sapiro, and Elad, ’09] EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch [Mairal, Elad, and Sapiro, ’08] Dictionary learning for classification l 3 Recently, dictionary learning is applied to classification problems with the following setup: DL Dictionary D x= c⇤ (x|D, ↵) Label Classifier with c⇤ (x|D, ↵) = arg min kx c l How to train the dictionary? l l l Dck22 + ↵ (c). Reconstructive approach: might not lead to a good discrimination. Discriminative approach: shown to achieve state-of-the-art results on some datasets (e.g., [Mairal, Bach, Ponce, ’10]). However, the above scheme computes the sparse coding map c⇤ (x|D, ↵) for each test point, which is computationally expensive. EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch Sparse coding map l 4 Consider the non-negative `1 sparse regularizer. The sparse coding map becomes: c⇤ (x|D, ↵) = arg min kx Dck22 + ↵kck1 subject to c 0. c l To solve the sparse coding problem, one can use the proximal gradient method: Iterate to convergence: ck+1 = max(0, ck + tDT (x Dck ) ↵t) ⇠ 1/✏ iterations to reach an accuracy of ✏ c1 c2 x DT I DT D + I DT D + We consider instead only one iteration of the proximal gradient algorithm, for efficiency reasons. EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch c3 Classification architecture D T 5 h↵ (z) = max(0, z wT h↵ = Feature extraction ↵) Label Linear classifier Equivalent representation: neural network with one hidden layer, and activation function h↵ . Output Activation function: Normal vector w Hidden layer Dictionary D Input Goal: Jointly learn the dictionary D and linear classifier w to maximize classification performance EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch Problem formulation 6 m m {y } 2 { 1, 1} Consider a binary classification task with {xi }m the training points, and i i=1 i=1 their associated labels. m X ⌫ argmin L(yi w h↵ (D xi )) + kwk22 2 D,w i=1 T T Loss term Regularization The parameter ↵ (controlling the sparsity of the features) can be set without loss of generality to 1 as the objective function is equal to m X m 0 X ⌫ ⌫ ˜ T xi )) + kwk L(yi ↵wT h1 (DT xi /↵)) + kwk22 = L(yi w ˜ T h1 ( D ˜ 22 2 2 i=1 i=1 The learning problem is nonconvex and difficult to solve for in general. EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch Problem formulation (Cont’d) 7 l We rewrite the problem in an appropriate form for optimization. l Change of variables: |wj |dj , vj uj l |wj |, sj sgn(wj ). Then, the learning problem is equivalent to: argmin U,v,s m X i=1 0 L @y i N X sj max(0, uT j xi j=1 subject to v > 0. l 1 vj )A + ⌫ kvk22 , 2 Approximations: - s is known a priori: the proportion of class 1 and class -1 atoms in the desired dictionary is known. - We replace max(0, z) with a smooth approximation q(z) 1 0.8 (P) : arg min U,v m X i=1 0 L @y i N X sj q(uT j xi j=1 subject to v ✏. 1 ⌫ vj )A + kvk22 , 2 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −2 EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch −1.5 −1 −0.5 0 0.5 1 1.5 2 Difference-of-convex (DC) decomposition l A function f is called DC if it can be written in the form f =g h, where g and h are convex functions. Note that the decomposition is not unique! = - f l l g h One can show that, for any convex loss function L, and any convex function q, problem (P) is a DC program. Moreover, when L is the hinge loss, we have the following decompostion: m ⇣ X ⌘ X X ⌫ T g = kvk22 + max q(uT x v ), 1 + q(u vj ) , i j j j xi 2 j:s =y i=1 j h= m X X q(uT j xi i j:sj 6=yi vj ). i=1 j:sj =yi EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch 8 Learning Algorithm for Soft-Thresholding classifier (LAST) l l The objective function of (P) can be written g(U, v) h(U, v). We solve (P) by solving a sequence of convex programs: the concave part -h is linearized around the current point, and the resulting convex problem is solved iteratively [Tao and An, ’98]. Algorithm 1 LAST (Learning Algorithm for Soft-Thresholding classifier) 1. Choose any initial point: U0 and v0 ✏. 2. For k = 0, . . . , K 1, 2.1 Compute (A, b) = rh(Uk , vk ). 2.2 Solve the convex optimization problem: (Uk+1 , vk+1 ) arg min {g(U, v) (U,v) subject to v T r(UT A) vT b} ✏. 2.3 If (Uk+1 , vk+1 ) ⇡ (Uk , vk ), return (Uk+1 , vk+1 ). l For large-scale problems, we use a stochastic gradient descent algorithm to solve the convex problem at each iteration of LAST. EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch 9 Experimental setup 10 l We perform experiments on textures, digits and natural images. l Extension to the multi-class setting using a one-vs-all approach. l Experimental validation in two steps: 1. Learning algorithm: We fix the scheme, and compare our learning algorithm to other learning methods (random samples, K-means, reconstructive dictionary learning, stochastic gradient descent). ? DT ? h↵ = wT Label 2. Whole classifier: Comparison of the soft-thresholding based classifier, where (D,w) are trained using LAST to linear SVM, RBF SVM, sparse coding classifiers, and deep rectifier networks. EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch Evaluation of the learning algorithm 11 Binary texture classification task (textures from 32 Brodatz) Deer vs. horse classification task (images from CIFAR-10) Accuracy vs. dictionary size Accuracy vs. dictionary size 0.85 0.95 0.9 0.8 0.85 Classification accuracy Classification accuracy 0.8 0.75 0.7 0.75 0.7 0.65 0.6 Supervised random samples Supervised K−Means DL for l1 sparse coding SGD LAST 0.55 0.5 0 50 100 150 200 Dictionary size 250 300 350 Supervised random samples Supervised K−Means DL for l1 sparse coding SGD LAST 0.65 0 400 50 0.35 0.2 3000 0.15 0.1 Class 1 testpoints Class 2 testpoints 2500 0.05 2000 0.05 0.1 0.15 0.2 max(0, dT1 x − 1) 0.25 K-Means (N = 2) 0.3 0.35 T 0 max(0, d2 x − 1) max(0, dT2 x − 1) 0.25 0 150 200 Dictionary size 250 300 350 K-Means Class 1 testpoints Class 2 testpoints 0.3 100 1500 1000 LAST 500 0 0 0.1 0.2 0.3 max(0, dT x − 1) 0.4 0.5 1 LAST (N = 2) EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch 400 Classification performance on binary datasets Texture classification task (accuracy %) Linear SVM RBF kernel SVM Sparse coding (N = 50) Sparse coding (N = 400) NN (N = 50) NN (N = 400) LAST (N = 50) LAST (N = 400) Task 1 [%] 49.5 98.5 97.5 98.1 94.3 97.8 98.7 98.6 Task 2 [%] 49.1 90.1 85.5 90.9 84.1 86.6 87.3 93.5 Deer vs. horse classification task (accuracy %) Linear SVM RBF kernel SVM Sparse coding (N = 10) Sparse coding (N = 100) NN (N = 10) NN (N = 100) LAST (N = 10) LAST (N = 100) Table 1: Classification accuracy for binary texture classification tasks. “deer” vs. “horse” [%] 72.6 83.5 70.6 76.2 67.7 70.9 80.1 82.8 Table 1: Binary classification accuracy on the binary classification problem “deer” vs. “horse”. l l On these examples, the proposed algorithm gives a comparable (or better) performance with respect to RBF-SVM and sparse coding. LAST is however much faster at test time! EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch 12 Classification performance in multi-class scenarios 13 We extend our classifier to multi-class scenarios using a one-vs-all approach Digits datasets (MNIST and USPS) Error rate (%) Dictionary learning methods (sparse coding mapping) Neural networks methods Linear SVM RBF kernel SVM K-NN `2 LAST Sparse coding Huange and Aviyente (2006) SDL-G L (Mairal et al, 2008) SDL-D L (Mairal et al, 2008) Ramirez et al (2010) SGD 3 layers ReLU net (Glorot et al, 2011) MNIST 8.19 1.4 5.0 1.32 3.0 3.56 1.05 1.26 2.22 1.43 USPS 9.07 4.2 5.2 4.53 5.33 6.05 6.67 3.54 3.98 5.88 - Natural images (CIFAR-10) Error rate (%) Linear SVM LAST (N = 400) SGD (N = 400) 3 layers ReLU net 3 layers ReLU net + sup. pre-train CIFAR-10 59.70 46.56 52.96 50.86 49.96 Table 1: s Table 1: Classification error (percentage) on MNIST and USPS datasets. LAST gives comparable results with recent dictionary learning methods. The comparison with neural networks shows the advantage of our DC based optimization. Complexity and running time needed to classify 10,000 test examples (MNIST) Linear SVM RBF kernel SVM Sparse coding LAST classifier Complexity O(n) O(nm) ⇣ ⌘ O nN p ✏ O(nN ) Time [s] 0.4 154 14 1.0 Table 1: Computational complexity for classifying one test sample, and time EPFL – Signal Laboratory needed to predict the labels ofProcessing the 10000 test (LTS4) samples in the MNIST dataset. http://lts4.epfl.ch For reference, all the experiments are carried out on a 2.6 GHz Intel Core i7 machine with 16 GB RAM. Conclusions l l l 14 Supervised dictionary learning algorithm tailored for the soft-thresholding based classifier The proposed algorithm leverages the DC structure of the problem, and oupterforms more ‘generic’ learning procedures (e.g., SGD). Despite the simplicity of the classification scheme, LAST provides comparable results with sparse coding methods, while being much faster at test time. For more info, check our paper: Dictionary learning for fast classification based on soft-thresholding, to appear in International Journal on Computer Vision (IJCV). (available on arXiv) EPFL – Signal Processing Laboratory (LTS4) http://lts4.epfl.ch
© Copyright 2024 ExpyDoc