Accelerated learning for Restricted Boltzmann Machine with

ICSEng '14 International Conference on Systems Engineering (DRAFT VERSION)
Accelerated learning for Restricted Boltzmann
Machine with momentum term
Szymon Zaręba, Adam Gonczarek, Jakub M. Tomczak, Jerzy Świątek
Institute of Computer Science, Wroclaw University of Technology
Wyb. Wyspiańskiego 27, 50-370, Wrocław
Email:{szymon.zareba,jakub.tomczak,adam.gonczarek,jerzy.swiatek}@pwr.wroc.pl
Abstract. Restricted Boltzmann Machines are generative models which
can be used as standalone feature extractors, or as a parameter initialization for deeper models. Typically, these models are trained using
Contrastive Divergence algorithm, an approximation of the stochastic
gradient descent method. In this paper, we aim at speeding up the convergence of the learning procedure by applying the momentum method
and the Nesterov’s accelerated gradient technique. We evaluate these two
techniques empirically using the image dataset MNIST.
Keywords: Deep learning, Contrastive Divergence, stochastic gradient
descent, Nesterov’s momentum
1
Introduction
Deep learning has recently become a field of interest due to its ability of automatic high-level features extraction [1]. Deep architectures are powerful models
that achieve high performance on difficult pattern recognition problems, such as
image analysis [2], motion tracking [3], speech recognition [4], and other application, e.g., collaborative filtering [5], text analysis [6].
Typically, building blocks of deep models are Restricted Boltzmann Machines
(RBM). RBM are generative models with hidden variables which aim at modelling a distribution of visible variables. In the case of classification problems,
RBM are used as standalone feature extractors, or as a parameter initialization
for deeper models. However, when RBM are trained in an unsupervised fashion,
there is no guarantee they provide discriminative features. To address this problem information about class label can be involved in the RBM, which leads to
the Classification Restricted Boltzmann Machine (ClassRBM) [12].
Although the representational power of deep models, including RBM, is very
tempting, their broad applicability was limited until recent past because of the
difficulty of learning deep architectures. Deep learning became a topic of high
interest thanks to the breakthrough learning algorithm called Contrastive Divergence [13]. The Contrastive Divergence allows to efficiently perform approximate
stochastic gradient descent learning procedure. Since the Contrastive Divergence
serves as an elemental learning procedure, different techniques can be applied
in order to speed up the convergence or to obtain more reliable estimates. The
most common approach is to modify the learning objective by adding an extra
regularization term, e.g., `2 regularizer (weight decay) to keep the parameters
below some threshold, or sparse regularizer to enforce sparse activation of hidden units [14,15,16]. Another regularization procedure is based on randomly
dropping a subset of hidden units (or alternatively subset of weight parameters)
during one updating epoch. This leads to techniques such as dropout or its extensions [17,18,19,20]. A different approach is to apply optimization techniques,
e.g., momentum method [15] or the Nesterov’s accelerated gradient (Nesterov’s
momentum) [21].
In this paper, we aim at accelerating learning of RBM by applying the momentum term and the Nesterov’s momentum. We formulate the following research questions:
Q1: Does the application of the momentum term or the Nesterov’s momentum
speed up the convergence of learning?
Q2: Does the application of the momentum term or the Nesterov’s momentum
increase the classification accuracy?
Q3: Does the Nesterov’s momentum perform better than the momentum term?
In order to verify these issues we carry out experiments on the image dataset
MNIST.
2
Classification using Restricted Boltzmann Machine
Let x ∈ {0, 1}D be the D-dimensional vector of visible variables, h ∈ {0, 1}M be
the M -dimensional vector of hidden variables, and y be the vector coding the
class label using 1-of-K coding scheme, i.e., yk = 1 iff observation x belongs to
the class k, where k = 1, . . . , K. Joint dependency between observed and hidden
variables, (x, y, h), is described by the following energy function:
E(x, y, h|θ) = −bT x − cT h − dT y − hT Wx − hT Uy
(1)
where xi , yk and hj are binary state of observed unit i, k-th entry in the vector
coding class label, and the binary state of hidden unit j, respectively. Further, bi ,
dk and cj denote bias parameters associated with xi , yk and hj , respectively. Finally, Wji is the weight parameter modeling the relationship between xi and hj ,
whereas Ujk is the weight between hj and yk . For brevity, let θ = {b, c, d, W, U}
denotes the model parameters. ClassRBM defines the joint probability distribution over observed and hidden variables as follows:
p(x, y, h|θ) =
1 −E(x,y,h|θ)
e
,
Z
(2)
where Z is the partition function obtained
by summing out all possibles pairs of
PPP
observed and hidden variables, Z =
e−E(x,y,h|θ) .
x
y
h
The inputs, hidden, and label variables are conditionally independent given
the other variables, and the conditional probabilities can be written as follows:
p(xi = 1|h, θ) = sigm(bi + hT W·i )
(3)
p(hj = 1|x, y, θ) = sigm(cj + Wj· x + Ujl )
(4)
where sigm(·) is the logistic sigmoid function, and Wi· , W·j denote rows and
columns of matrix W, respectively. Finally, l denotes index such that l = {k :
yk = 1}.
It turns out that calculation of the exact form of the conditional probability
distribution p(y|x, θ) is tractable:
exp(dl )
p(yl = 1|x, θ) =
M
Q
(1 + exp(cj + Wj· x + Ujl ))
j=1
K
P
exp(dk )
M
Q
.
(5)
(1 + exp(cj + Wj· x + Ujk ))
j=1
k=1
Further, in the paper, we refer learning joint distribution p(x, y|θ) of ClassRBM to as generative ClassRBM, while conditional distribution p(y|x, θ) of
ClassRBM – discriminative ClassRBM.
3
Learning
Let D be the training set containing N observation-label pairs, D = {(xn , yn )}.
In generative learning we aim at minimizing the following negative log-likelihood
function:
LG (θ) = −
N
X
ln p(xn , yn |θ)
n=1
=−
N
X
ln p(yn |xn , θ) −
n=1
N
X
ln p(xn |θ)
(6)
n=1
Therefore, the generative learning objective consists of two components,
N
P
namely, the supervised learning objective
ln p(yn |xn , θ), where we fit pan=1
rameters to predict label given the observation, and the unsupervised objective
N
P
ln p(xn |θ), which can be seen as a data-dependent regularizer.
n=1
Hence, omitting the last component in (6) leads to the following objective:
LD (θ) = −
N
X
ln p(yn |xn , θ).
(7)
n=1
Notice that we obtained the negative conditional log-likelihood which is the
natural objective in discriminative learning.
4
Classical momentum term and Nesterov’s momentum
Application of the classical momentum term to modify the search direction is a
simple method for increasing the rate of convergence [15]. In general, the idea
is to modify the update step of the gradient-based method by adding previous
value of the gradient:
v(new) = αv(old) − η∆(θ),
(8)
where ∆(θ) is the step dependent on current parameters value, v denotes velocity, i.e., the accumulated change of the parameters, α is the momentum parameter determining the influence of the previous velocity included in calculating
new velocity.
Recently, it has been shown that the modified version of the momentum
term, which is based on the Nesterov’s accelerated gradient technique, gives
even better results [21]. The Nesterov’s accelerated gradient (henceforth called
Nesterov’s momentum) is based on the idea to include velocity in the objective
function which follows the gradient calculation. Such approach allows to avoid
instabilities and respond to inappropriately chosen direction. The Nesterov’s
momentum is calculated as follows:
v(new) = αv(old) − η∆(θ + αv(old) ).
4.1
(9)
Learning generative ClassRBM
In ClassRBM learning one optimizes the objective function (6) with respect
to the parameters θ = {W, U, b, c, d}. However, gradient-based optimization
methods cannot be directly applied because exact gradient calculation is intractable. Fortunately, we can adopt Constrastive Divergence algorithm which
approximates exact gradient using sampling methods.
In fact, the Contrastive Divergence algorithm aims at minimizing the difference between two Kullback-Leibler divergences [14]:
KL(Q||P ) − KL(Pτ ||P )
(10)
where Q is the empirical probability distribution, and Pτ is the probability distribution over visible variables after τ steps of the Gibbs sampler, P is the true
distribution.
The approximation error vanishes for τ → ∞, i.e., when Pτ becomes stationary distribution, and KL(Pτ |P ) = 0 with probability 1. Therefore, it would be
beneficial to choose a large value of τ , however, it has been noticed that choosing
τ = 1 works well in practice [1]. The procedure of the Contrastive Divergence
for the generative ClassRBM is presented in Algorithm 1.
Algorithm 1: Contrastive Divergence algorithm for generative ClassRBM
Input: data D, learning rate η, momentum parameter α
Output: parameters θ
for each example (x, y) do
1. Generate samples using Gibbs sampling:
Set (x(0) , y(0) ) := (x, y).
ˆ (0) , where h
ˆ (0) := p(hj = 1|x(0) , y(0) , θ temp ).
Calculate probabilities h
j
¯ (0) for given probabilities h
ˆ (0) .
Generate sample h
for t = 0 to τ − 1 do
(t+1)
¯ (t) , θ temp ).
ˆ (t+1) , where x
Calculate probabilities x
ˆi
:= p(xi = 1|h
(t+1)
¯ (t+1) for given probabilities x
ˆ
Generate sample x
.
(t+1)
ˆ (t+1) , where yˆk
Calculate probabilities y
:= p(yk = 1|¯
x(t) , θ temp ).
(t+1)
(t+1)
¯
ˆ
Generate sample y
for given probabilities y
.
(t+1)
ˆ
Calculate probabilities h
, where
ˆ (t+1) := p(hj = 1|ˆ
ˆ (t+1) , θ temp ).
h
x(t+1) , y
j
¯ (t+1) for given probabilities h
ˆ (t+1) .
Generate sample h
end for
2. Compute step ∆(θ temp ). In particular:
¯ (τ )
∆b := x(0) − x
(0)
ˆ −h
ˆ (τ )
∆c := h
(0)
¯ (τ )
∆d := y − y
(0) (0)T
ˆ x
ˆ (τ ) x
¯ (τ )T
∆W := h
−h
(0) (0)T
(τ ) (τ )T
ˆ y
ˆ y
¯
∆U := h
−h
3. Update momentum term v(new) := αv(old) − η∆(θ temp ) and set
v(old) := v(new) .
4. Update parameters:
θ := θ + v(new)
and set θ temp := θ in case of the classical momentum (8) or
θ temp := θ + αv(old) in case of the Nesterov’s momentum (9).
end for
4.2
Learning discriminative ClassRBM
In discriminative learning we need to minimize the objective (7) wrt the parameters θ = {W, U, b, c, d}. Unlike the generative model, here the gradient can be
calculated analytically. Hence, the parameters of the discriminative ClassRBM
can be determined using the stochastic gradient descent algorithm. The learning
procedure for the discriminative case is presented in Algorithm 2.
Algorithm 2: Stochastic gradient algorithm for discriminative ClassRBM
Input: data D, learning rate η, momentum parameter α
Output: parameters θ
for each example (x, y) do
1. Compute step ∆(θ temp ). In particular:
Set l := {k : yk = 1}.
ˆ , where yˆk := p(yk = 1|x, θ temp ).
Calculate probabilities y
temp
temp
Calculate σ, where σjk = sigm(ctemp
+ Wj·
x + Ujk
).
j
K
P
yˆk σ ·k − yˆl 1
∆c :=
k=1
ˆ−y
∆d := y
K
P
yˆk σ ·k xT − σ ·l xT
∆W :=
k=1
∆U := σ ·l (ˆ
y − 1)T
2. Update momentum term v(new) := αv(old) − η∆(θ temp ) and set
v(old) := v(new) .
3. Update parameters:
θ := θ + v(new)
and set θ temp := θ in case of the classical momentum (8) or
θ temp := θ + αv(old) in case of the Nesterov’s momentum (9).
end for
5
Experiments
5.1
Details
Dataset. We evaluate the presented learning techniques on the image corpora
MNIST1 . The MNIST dataset contains 50,000 training, 10,000 validation, and
10,000 test images (28 × 28 pixels) of ten hand-written digits (from 0 to 9).
Learning details. We performed learning using Constrastive Divergence with
mini-batches of 10 examples. Both generative and disriminative ClassRBM were
trained with and without momentum and the Nesterov’s momentum. The learning rate was set to η = 0.005 and η = 0.05 for generative and disriminative
ClassRBM, respectively. The momentum parameter was set to α ∈ {0.5, 0.9}.
The number of iterations over the training set was determined using early stopping according to the validation set classification error, with a look ahead of 15
iterations. The experiment was repeated five times.
Evaluation methodology. We use the classification accuracy as an evaluation
metric. We compare the learning procedure using Contrastive Divergence (CD),
stochastic gradient descent (SGD) and learning with the classical momentum
term (CM) and the Nesterov’s momentum (N). The classification accuracy (expressed in %) is calculated on test set.
1
http://yann.lecun.com/exdb/mnist/
5.2
Results and Discussion
The results (mean values and standard deviations over five repetitions) for the
generative ClassRBM are presented in Table 1 while for the discriminative ClassRBM in Table 2. The mean classification accuracy for generative ClassRBM is
presented in Figure 1 and for disriminative ClassRBM – in Figure 2.2
We notice that the application of the momentum term and the Nesterov’s
momentum indeed increases the speed of the convergence (see Figure 1 and
2). This phenomenon is especially noticeable in the case of the disriminative
ClassRBM and α = 0.9 (Figure 2). The Nesterov’s momentum performs slightly
better than the momentum term in terms of the speed of convergence only during
the first epochs (see Figure 2).
On the other hand, for larger number of hidden units, i.e., for M larger
than 400 in the case of the generative ClassRBM and M larger than 100 for the
discriminative ClassRBM, application of the momentum term and the Nesterov’s
momentum resulted in the increase of the classification accuracy and more stable
outcomes (smaller standard deviations). However, the Nesterov’s momentum is
slightly more robust than the momentum in terms of the standard deviations
(see Table 1 and 2). This effect can be explained in the following way. The
Nesterov’s momentum calculates a partial update of parameters first, and then
computes gradient of the objective wrt to partially updated parameters. Such
procedure allows to change the velocity quicker if undesirable increase in the
objective occurs. Our result is another empirical justification of this phenomenon,
previously reported in [21].
Table 1: Classification accuracy for generative ClassRBM with the Contrastive
Divergence (CD) and the classical momentum term (CM) and the Nesterov’s
momentum (N).
Hidden units
9
25
100
400
900
2
CD
mean std
54.79 4.15
81.21 0.33
90.81 0.36
94.02 0.29
95.08 0.23
CM α = 0.5
mean std
60.79 4.47
80.32 1.26
90.74 0.32
94.34 0.27
95.62 0.08
CM α = 0.9
mean std
63.60 1.33
82.21 1.01
90.70 0.48
94.51 0.35
95.53 0.42
N α = 0.5
mean std
63.69 2.09
80.51 0.48
90.67 0.30
94.54 0.20
95.45 0.19
In both cases the number of hidden units was equal 900.
N α = 0.9
mean std
63.04 3.13
81.76 0.60
91.11 0.50
94.61 0.23
95.25 0.23
Table 2: Classification accuracy for discriminative ClassRBM with the stochastic gradient descent (SGD) and the classical momentum term (CM) and the
Nesterov’s momentum (N).
Hidden units
9
25
100
400
900
SGD
mean std
92.39 0.14
95.74 0.32
97.36 0.09
97.63 0.07
97.52 0.11
CM α = 0.5
mean std
92.34 0.26
95.49 0.36
97.35 0.17
97.67 0.02
97.72 0.01
CM α = 0.9
mean std
91.91 0.32
95.05 0.14
97.46 0.05
97.96 0.05
97.85 0.14
N α = 0.5
mean std
92.30 0.46
95.51 0.29
97.40 0.06
97.77 0.10
97.73 0.04
N α = 0.9
mean std
91.41 0.35
95.15 0.09
97.67 0.11
97.86 0.06
97.82 0.02
Fig. 1: Convergence of the generative ClassRBM with the Contrastive Divergence
(CD), and with the classical momentum term (CM) and Nesterov’s momentum
(N) with α = 0.5 and α = 0.9 for the classification accuracy measured on the
test set.
Fig. 2: Convergence of the discriminative ClassRBM with the stochastic gradient
descent (SGD), and with the classical momentum term (CM) and Nesterov’s
momentum (N) with α = 0.5 and α = 0.9 for the classification accuracy measured on the test set.
6
Conclusions
In this paper, we have presented the accelerated learning of the Restricted Boltzmann Machine with the classical momentum and the Nesterov’s momentum
term. We have applied the outlined learning procedure to generative and discriminative Classification Restricted Boltzmann Machine. In order to evaluate
the approach and verify stated research questions we have performed experiments using MNIST image corpora for different number of hidden units. The
obtained results shows that the application of the momentum term and the Nesterov’s momentum indeed accelerates convergence of learning and increase the
classification accuracy. However, our comparative analysis does not indicate the
superiority of the Nesterov’s momentum over the momentum term; these two
techniques behave alike.
References
1. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine
Learning 2(1) (2009) 1–127
2. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with
neural networks. Science 313(5786) (2006) 504–507
3. Taylor, G.W., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary
latent variables. In Sch¨
olkopf, B., Platt, J.C., Hoffman, T., eds.: NIPS, MIT Press
(2006) 1345–1352
4. Mohamed, A.R., Hinton, G.E.: Phone recognition using restricted boltzmann machines. In: ICASSP, IEEE (2010) 4354–4357
5. Salakhutdinov, R., Mnih, A., Hinton, G.E.: Restricted boltzmann machines for
collaborative filtering. In Ghahramani, Z., ed.: ICML. Volume 227 of ACM International Conference Proceeding Series., ACM (2007) 791–798
6. Salakhutdinov, R., Hinton, G.E.: Replicated softmax: an undirected topic model.
In Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A., eds.:
NIPS, Curran Associates, Inc. (2009) 1607–1614
7. Neapolitan, R.E.: Probabilistic reasoning in expert systems - theory and algorithms. Wiley (1990)
8. Pearl, J.: Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan Kaufmann series in representation and reasoning. Morgan Kaufmann
(1989)
9. Hopfield, J.J.: Neural networks and physical systems with emergent collective
computational abilities. Proceedings of the National Academy of Sciences of the
United States of America 79(8) (1982) 2554–2558
10. Hopfield, J.J.: The effectiveness of neural computing. In: IFIP Congress. (1989)
503–507
11. Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann
Machines. Cognitive Science 9(1) (1985) 147–169
12. Larochelle, H., Bengio, Y.: Classification using discriminative restricted boltzmann
machines. In Cohen, W.W., McCallum, A., Roweis, S.T., eds.: ICML. Volume 307
of ACM International Conference Proceeding Series., ACM (2008) 536–543
13. Hinton, G.E.: Training products of experts by minimizing contrastive divergence.
Neural Computation 14(8) (2002) 1771–1800
14. Fischer, A., Igel, C.: An introduction to Restricted Boltzmann Machines. In
´
Alvarez,
L., Mejail, M., D´eniz, L.G., Jacobo, J.C., eds.: CIARP. Volume 7441 of
Lecture Notes in Computer Science., Springer (2012) 14–36
15. Hinton, G.E.: A practical guide to training restricted boltzmann machines. In:
Neural Networks: Tricks of the Trade (2nd ed.). (2012) 599–619
16. Swersky, K., Chen, B., Marlin, B.M., de Freitas, N.: A tutorial on stochastic
approximation algorithms for training restricted boltzmann machines and deep
belief nets. In: ITA, IEEE (2010) 80–89
17. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. CoRR
abs/1207.0580 (2012)
18. Wager, S., Wang, S., Liang, P.: Dropout training as adaptive regularization. CoRR
abs/1307.1493 (2013)
19. Wan, L., Zeiler, M.D., Zhang, S., LeCun, Y., Fergus, R.: Regularization of neural
networks using dropconnect. In: ICML (3). (2013) 1058–1066
20. Wang, S., Manning, C.D.: Fast dropout training. In: ICML (2). (2013) 118–126
21. Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of initialization and momentum in deep learning. In: ICML (3). Volume 28 of JMLR
Proceedings., JMLR.org (2013) 1139–1147