Dictionary learning for fast classification based on soft

Dictionary learning for fast classification
based on soft-thresholding
Alhussein Fawzi
Joint work with Mike Davies (University of Edinburgh) and Pascal Frossard (EPFL)
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
Sparse dictionary learning
l
2
Dictionary learning is a technique used in signal processing to learn a
representation of the data that ‘sparsifies’ the vectors.
min
D,ci
Dictionary
n
X
i=1
kxi Dci k22 +↵ (ci )
|
{z
}
| {z }
Reconstruction
Sparsity
Sparse codes
[Olshausen & Field, ’96]
l
It has been successfully applied in many image processing applications such as
denoising and inpainting, leading to state-of-the-art results.
[Mairal, Bach, Ponce, Sapiro, and Zisserman, ’09]
[Mairal, Sapiro, and Elad, ’09]
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
[Mairal, Elad, and Sapiro, ’08]
Dictionary learning for classification
l
3
Recently, dictionary learning is applied to classification problems with the following
setup:
DL
Dictionary D
x=
c⇤ (x|D, ↵)
Label
Classifier
with c⇤ (x|D, ↵) = arg min kx
c
l
How to train the dictionary?
l
l
l
Dck22 + ↵ (c).
Reconstructive approach: might not lead to a good discrimination.
Discriminative approach: shown to achieve state-of-the-art results on some
datasets (e.g., [Mairal, Bach, Ponce, ’10]).
However, the above scheme computes the sparse coding map c⇤ (x|D, ↵) for each
test point, which is computationally expensive.
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
Sparse coding map
l
4
Consider the non-negative `1 sparse regularizer. The sparse coding map
becomes:
c⇤ (x|D, ↵) = arg min kx Dck22 + ↵kck1 subject to c 0.
c
l
To solve the sparse coding problem, one can use the proximal gradient method:
Iterate to convergence:
ck+1 = max(0, ck + tDT (x
Dck )
↵t)
⇠ 1/✏ iterations to reach an accuracy of ✏
c1
c2
x
DT
I
DT D
+
I
DT D
+
We consider instead only one iteration of the proximal gradient algorithm,
for efficiency reasons.
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
c3
Classification architecture
D
T
5
h↵ (z) = max(0, z
wT
h↵ =
Feature extraction
↵)
Label
Linear classifier
Equivalent representation: neural network with one hidden layer, and activation
function h↵ .
Output
Activation
function:
Normal vector w
Hidden layer
Dictionary D
Input
Goal: Jointly learn the dictionary D and linear classifier w to maximize
classification performance
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
Problem formulation
6
m
m
{y
}
2
{
1,
1}
Consider a binary classification task with {xi }m
the
training
points,
and
i
i=1
i=1
their associated labels.
m
X
⌫
argmin
L(yi w h↵ (D xi )) + kwk22
2
D,w i=1
T
T
Loss term
Regularization
The parameter ↵ (controlling the sparsity of the features) can be set without loss of
generality to 1 as the objective function is equal to
m
X
m
0
X
⌫
⌫
˜ T xi )) + kwk
L(yi ↵wT h1 (DT xi /↵)) + kwk22 =
L(yi w
˜ T h1 ( D
˜ 22
2
2
i=1
i=1
The learning problem is nonconvex and difficult to solve for in general.
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
Problem formulation (Cont’d)
7
l
We rewrite the problem in an appropriate form for optimization.
l
Change of variables:
|wj |dj , vj
uj
l
|wj |, sj
sgn(wj ).
Then, the learning problem is equivalent to:
argmin
U,v,s
m
X
i=1
0
L @y i
N
X
sj max(0, uT
j xi
j=1
subject to v > 0.
l
1
vj )A +
⌫
kvk22 ,
2
Approximations:
-
s is known a priori: the proportion of class 1 and class -1 atoms in the desired dictionary is
known.
-
We replace max(0, z) with a smooth approximation q(z)
1
0.8
(P) :
arg min
U,v
m
X
i=1
0
L @y i
N
X
sj q(uT
j xi
j=1
subject to v
✏.
1
⌫
vj )A + kvk22 ,
2
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
−2
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
−1.5
−1
−0.5
0
0.5
1
1.5
2
Difference-of-convex (DC) decomposition
l
A function f is called DC if it can be written in the form
f =g
h,
where g and h are convex functions. Note that the decomposition is not unique!
=
-
f
l
l
g
h
One can show that, for any convex loss function L, and any convex function q,
problem (P) is a DC program.
Moreover, when L is the hinge loss, we have the following decompostion:
m
⇣ X
⌘
X
X
⌫
T
g = kvk22 +
max
q(uT
x
v
),
1
+
q(u
vj ) ,
i
j
j
j xi
2
j:s =y
i=1
j
h=
m
X
X
q(uT
j xi
i
j:sj 6=yi
vj ).
i=1 j:sj =yi
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
8
Learning Algorithm for Soft-Thresholding classifier (LAST)
l
l
The objective function of (P) can be written g(U, v)
h(U, v).
We solve (P) by solving a sequence of convex programs: the concave part -h is
linearized around the current point, and the resulting convex problem is solved
iteratively [Tao and An, ’98].
Algorithm 1 LAST (Learning Algorithm for Soft-Thresholding classifier)
1. Choose any initial point: U0 and v0 ✏.
2. For k = 0, . . . , K 1,
2.1 Compute (A, b) = rh(Uk , vk ).
2.2 Solve the convex optimization problem:
(Uk+1 , vk+1 )
arg min {g(U, v)
(U,v)
subject to v
T r(UT A)
vT b}
✏.
2.3 If (Uk+1 , vk+1 ) ⇡ (Uk , vk ), return (Uk+1 , vk+1 ).
l
For large-scale problems, we use a stochastic gradient descent algorithm to solve
the convex problem at each iteration of LAST.
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
9
Experimental setup
10
l
We perform experiments on textures, digits and natural images.
l
Extension to the multi-class setting using a one-vs-all approach.
l
Experimental validation in two steps:
1. Learning algorithm: We fix the scheme, and compare our learning algorithm to other
learning methods (random samples, K-means, reconstructive dictionary learning,
stochastic gradient descent).
?
DT
?
h↵ =
wT
Label
2. Whole classifier: Comparison of the soft-thresholding based classifier, where (D,w) are
trained using LAST to linear SVM, RBF SVM, sparse coding classifiers, and deep
rectifier networks.
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
Evaluation of the learning algorithm
11
Binary texture classification task
(textures from 32 Brodatz)
Deer vs. horse classification task
(images from CIFAR-10)
Accuracy vs. dictionary size
Accuracy vs. dictionary size
0.85
0.95
0.9
0.8
0.85
Classification accuracy
Classification accuracy
0.8
0.75
0.7
0.75
0.7
0.65
0.6
Supervised random samples
Supervised K−Means
DL for l1 sparse coding
SGD
LAST
0.55
0.5
0
50
100
150
200
Dictionary size
250
300
350
Supervised random samples
Supervised K−Means
DL for l1 sparse coding
SGD
LAST
0.65
0
400
50
0.35
0.2
3000
0.15
0.1
Class 1 testpoints
Class 2 testpoints
2500
0.05
2000
0.05
0.1
0.15
0.2
max(0, dT1 x − 1)
0.25
K-Means
(N = 2)
0.3
0.35
T
0
max(0, d2 x − 1)
max(0, dT2 x − 1)
0.25
0
150
200
Dictionary size
250
300
350
K-Means
Class 1 testpoints
Class 2 testpoints
0.3
100
1500
1000
LAST
500
0
0
0.1
0.2
0.3
max(0, dT x − 1)
0.4
0.5
1
LAST
(N = 2)
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
400
Classification performance on binary datasets
Texture classification task (accuracy %)
Linear SVM
RBF kernel SVM
Sparse coding (N = 50)
Sparse coding (N = 400)
NN (N = 50)
NN (N = 400)
LAST (N = 50)
LAST (N = 400)
Task 1 [%]
49.5
98.5
97.5
98.1
94.3
97.8
98.7
98.6
Task 2 [%]
49.1
90.1
85.5
90.9
84.1
86.6
87.3
93.5
Deer vs. horse classification task (accuracy %)
Linear SVM
RBF kernel SVM
Sparse coding (N = 10)
Sparse coding (N = 100)
NN (N = 10)
NN (N = 100)
LAST (N = 10)
LAST (N = 100)
Table 1: Classification accuracy for binary texture classification tasks.
“deer” vs. “horse” [%]
72.6
83.5
70.6
76.2
67.7
70.9
80.1
82.8
Table 1: Binary classification accuracy on the binary classification problem
“deer” vs. “horse”.
l
l
On these examples, the proposed algorithm gives a comparable (or better)
performance with respect to RBF-SVM and sparse coding.
LAST is however much faster at test time!
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch
12
Classification performance in multi-class scenarios
13
We extend our classifier to multi-class scenarios using a one-vs-all approach
Digits datasets (MNIST and USPS)
Error rate (%)
Dictionary learning methods
(sparse coding mapping)
Neural networks methods
Linear SVM
RBF kernel SVM
K-NN `2
LAST
Sparse coding
Huange and Aviyente (2006)
SDL-G L (Mairal et al, 2008)
SDL-D L (Mairal et al, 2008)
Ramirez et al (2010)
SGD
3 layers ReLU net (Glorot et al, 2011)
MNIST
8.19
1.4
5.0
1.32
3.0
3.56
1.05
1.26
2.22
1.43
USPS
9.07
4.2
5.2
4.53
5.33
6.05
6.67
3.54
3.98
5.88
-
Natural images (CIFAR-10)
Error rate (%)
Linear SVM
LAST (N = 400)
SGD (N = 400)
3 layers ReLU net
3 layers ReLU net + sup. pre-train
CIFAR-10
59.70
46.56
52.96
50.86
49.96
Table 1: s
Table 1: Classification error (percentage) on MNIST and USPS datasets.
LAST gives comparable results with recent dictionary learning methods.
The comparison with neural networks shows the advantage of our DC based optimization.
Complexity and running time
needed to classify 10,000 test examples (MNIST)
Linear SVM
RBF kernel SVM
Sparse coding
LAST classifier
Complexity
O(n)
O(nm)
⇣
⌘
O
nN
p
✏
O(nN )
Time [s]
0.4
154
14
1.0
Table 1: Computational complexity for classifying one test sample, and time
EPFL
– Signal
Laboratory
needed to predict the
labels
ofProcessing
the 10000
test (LTS4)
samples in the MNIST dataset.
http://lts4.epfl.ch
For reference, all the experiments
are carried out on a 2.6 GHz Intel Core i7
machine with 16 GB RAM.
Conclusions
l
l
l
14
Supervised dictionary learning algorithm tailored for the soft-thresholding based
classifier
The proposed algorithm leverages the DC structure of the problem, and
oupterforms more ‘generic’ learning procedures (e.g., SGD).
Despite the simplicity of the classification scheme, LAST provides comparable
results with sparse coding methods, while being much faster at test time.
For more info, check our paper: Dictionary learning for fast classification based on
soft-thresholding, to appear in International Journal on Computer Vision (IJCV).
(available on arXiv)
EPFL – Signal Processing Laboratory (LTS4)
http://lts4.epfl.ch