Slides

COMP 598 – Applied Machine Learning
Lecture 15: Neural Networks (cont’d)
!
Instructor: Joelle Pineau ([email protected])
TAs: Pierre-Luc Bacon ([email protected])
Angus Leigh ([email protected])
Class web page: www.cs.mcgill.ca/~jpineau/comp598
Quiz #5: SVMs
Consider learning an SVM to do binary classification on the dataset below.
1.  Draw an approximate solution for the each of the following kernel types:
–  a linear kernel
–  a quadratic kernel
–  a gaussian kernel
2. 
For the linear kernel, circle the 3 points
that are the support vectors.
3. 
For the linear kernel, what is the
equation corresponding to the
decision boundary?
COMP-598: Applied Machine Learning
2
Joelle Pineau
1
Feed-forward neural networks
Notation:
• 
wji denotes weight on connection
from unit i to unit j.
• 
• 
By convention, xj0 = 1, ∀ j
Output of unit j, denoted oj is
computed using a sigmoid:
oj = σ(wj· xj)
where
wj is vector of weights entering unit j
xj is vector of inputs to unit j
• 
By definition, xji = oi .
Given an input, how do we compute the output? How do we train the weights?
COMP-598: Applied Machine Learning
3
Joelle Pineau
Stochastic gradient descent
•  Initialize all weights to small random numbers.
Initialization
•  Repeat until convergence:
–  Pick a training example.
–  Feed example through network to compute output o = oN+H+1.
Forward
propagation
–  For the output unit, compute the correction:
–  For each hidden unit h, compute its share of the correction:
Backpropagation
–  Update each network weight: For h=1, …, H
Gradient
descent
COMP-598: Applied Machine Learning
4
Joelle Pineau
2
Organizing the training data
•  Stochastic gradient descent: Compute error on a single
example at a time (as in previous slide).
•  Batch gradient descent: Compute error on all examples.
–  Loop through the training data, accumulating weight changes.
–  Update all weights and repeat.
•  Mini-batch gradient descent: Compute error on small subset.
–  Randomly select a “mini-batch” (i.e. subset of training examples).
–  Calculate error on mini-batch, apply to update weights, and repeat.
COMP-598: Applied Machine Learning
5
Joelle Pineau
Expressiveness of feed-forward NN
A single sigmoid neuron?
•  Same representational power as a perceptron: Boolean AND, OR,
NOT, but not XOR.
A neural network with a single hidden layer?
COMP-598: Applied Machine Learning
6
Joelle Pineau
3
Learning the identity function
COMP-598: Applied Machine Learning
7
Joelle Pineau
Learning the identity function
•  Neural network structure:
•  Learned hidden
layer weights:
COMP-598: Applied Machine Learning
8
Joelle Pineau
4
Expressiveness of feed-forward NN
A single sigmoid neuron?
•  Same representational power as a perceptron: Boolean AND, OR,
NOT, but not XOR.
A neural network with a single hidden layer?
•  Can represent every boolean function, but might require a number
of hidden units that is exponential in the number of inputs.
•  Every bounded continuous function can be approximated with
arbitrary precision by a boolean function.
A neural network with two hidden layers?
•  Any function can be approximated to arbitrary accuracy by a
network with two hidden layers.
COMP-598: Applied Machine Learning
9
Joelle Pineau
Network architecture
•  Overfitting occurs if there are too many parameters compared to
the amount of data available.
•  Choosing the number of hidden units:
–  Too few hidden units do not allow the concept to be learned.
–  Too many lead to slow learning and overfitting.
–  If the m inputs are binary, log m is a good heuristic choice.
•  Choosing the number of layers
–  Always start with one hidden layer.
–  Don’t usually go beyond 2 hidden layers, unless the task structure.
suggests something different.
COMP-598: Applied Machine Learning
10
Joelle Pineau
5
Overtraining
•  Traditional overfitting is concerned with the number of
parameters vs. the number of instances
•  In neural networks there is an additional phenomenon called
overtraining which occurs when weights take on large
magnitudes, pushing the sigmoids into saturation
Overtraining in feed-forward networks
•  Effectively, as learning progresses, the network has more actual
Error versus weight updates (example 1)
parameters.
0.01
0.008
•  Use a validation set to decide
0.006
0.005
0.004
•  Regularization is also very effective.
COMP-598: Applied Machine Learning
• 
0.007
Error
when to stop training!
• 
Training set error
Validation set error
0.009
0.003
0.002
0
5000
11
10000
15000
Number of weight updates
20000
Joelle Pineau
• Traditional overfitting is concerned with the number of parameters
the number of instances
• In neural networks there is an additional phenomenon called overtraini
which occurs when weights take on large magnitudes, pushing t
Convergence
of backpropagation
sigmoids
into saturation
•Backpropagation
Effectively,
as descent
learning
the network has more actu
= gradient
over all progresses,
parameters in network.
parameters
If the learning rate is appropriate, the algorithm is guaranteed to
•converge
Use toa avalidation
tocostdecide
local minimumset
of the
function. when to stop training!
NOT the global minimum.is also very effective
•–  Regularization
–  Can be much WORSE than global minimum.
–  There can be MANY local minimum.
–  Solution:
random
= train
COMP-652,
Lecture
5 -restarts
September
20,multiple
2012 nets with different initial weights.
–  In practice, the solution found is often very good.
• 
Training can take thousands of iterations - VERY SLOW! But using
network after training is very fast.
• 
Can we find solution faster (i.e. in fewer iterations)?
COMP-598: Applied Machine Learning
12
Joelle Pineau
More application-specific tricks
• If there is too little data, it can be perturbed by random noise; this he
6
escape local minima and gives more robust results
• In image classification and pattern recognition tasks, extra data can
Adding momentum
Adding momentum
• On the t-th training sample,
instead
of the update:
Adding
momentum
• On the t-th training sample, instead of the update:
•  On the t-th training
update:
∆wijsample,
← αijinstead
δj xij , ofwethedo:
We do:
∆wij ← αij δj xij , we do:
∆wij (t) ← αij δj xij + β∆wij (t − 1)
second
term
called
∆w
(t)called
← isαmomentum
The secondThe
term
ijis
ij δj xmomentum
ij + β∆wij (t − 1)
Advantages:
The•second
term
is called momentum
Advantages:
–  Easy
to pass
smallminima
local minima.
– Easy to pass
small
local
• Advantages:
– to
Keeps
in areas
where
the
flat
–  weights
Keeps
the moving
weights
moving
in areas
where
theerror
error isisflat.
– Easy
passthe
small
local
minima
– Increases– the
speed
where
the
gradient
stays
unchanged
Increases
the speed
wherewhere
the gradient
– Keeps the weights
moving
in areas
the stays
errorunchanged
is flat
•
Disadvantages:
– Increases the
speed where the gradient stays unchanged
Disadvantages:
–
With
too
momentum, it can get out of a global maximum!
• Disadvantages: –  much
With too much momentum, it can get out of a global maximum!
– One more parameter to tune, and more chances of divergence
–  One
more parameter
to tune,
more
of divergence
– With too much
momentum,
it can
getand
out
of chances
a global
maximum!
– One more parameter to tune, and more chances of divergence
COMP-598: Applied Machine Learning
COMP-652, Lecture 5 - September 20, 2012
13
Joelle Pineau
COMP-652, Lecture 5 - September 20, 2012
29
29
Optimization: Repeated Line Searches
Choosing the learning rate
•  Given the optimization problem:
w* = argmaxw f(w)
learning
rate rate
• Backprop is– Choosing
very
sensitivethe
to the
choice of learning
Think of f(w) as your neural network.
– Too large ⇒ divergence
•  Consider an update: wj = argminw f(w1, w2, … wj-1, wj, wj+1, …, wm)
– Too
smallsensitive
⇒ VERYtoslow
• Backprop
is very
the learning
choice of learning rate
– 
Optimizing
a
single
dimension
wj, ability
holding the
constant.
–
The
learning
rate
also
influences
the
to rest
escape
local optima
– Too large ⇒ divergence
• ⇒Repeat
forslow
alllearning
weights.
• Very
often,
different
rates are used for units inn different layers
– Too
small
VERY
learning
• Itlearning
can be
thatinfluences
each
hasweights
its own
rate
– The
rate also
the
ability
tooptimal
escapelearning
localoptimal.
optima
•  shown
Then,
repeat
again unit
until all
are
simultaneously
• Very often, different learning rates are used for units inn different layers
• It can be shown that each unit has its own optimal learning rate
•  This works well when
COMP-598: Applied Machine Learning
14
Joelle Pineau
COMP-652, Lecture 5 - September 20, 2012
COMP-652, Lecture 5 - September 20, 2012
30
30
7
Optimization: Conjugate gradient descent
•  Condition is not always true:
•  Can we find a coordinate system where it is
satisfied?
•  Yes! The eigenvectors of the Hessian matrix:
•  Apply repeated line search in this coordinate
system.
COMP-598: Applied Machine Learning
15
Joelle Pineau
Optimization: Second-order method
•  Newton-Raphson method looks at both the first and secondorder derivatives of the error function.
•  Approximate first-order derivative:
•  Solve for G(wt+1)=0:
Some equations taken from: http://ttic.uchicago.edu/~dmcallester/ttic101-07/lectures/neural/neural.pdf
COMP-598: Applied Machine Learning
16
Joelle Pineau
8
Choosing the learning rate
•  Backprop is very sensitive to the choice of learning rate.
–  Too large ⇒ divergence.
–  Too small ⇒ VERY slow learning.
–  The learning rate also influences the ability to escape local optima.
•  Very often, different learning rates are used for units in different
layers. Hard to tune by hand!
•  Heuristic: Track performance on validation set, when it
stabilizes, divide learning rate by 2.
COMP-598: Applied Machine Learning
17
Joelle Pineau
Optimization method: Adagrad
•  Calculate adaptive learning rate per parameter.
•  Intuition: Adapt learning rate depending on previous updates to
that parameter.
–  Learn slowly for frequent features.
–  Learn faster for rare but informative features.
•  Can add regularization term.
See: Duchi, Hazan, Singer (2011) Adaptive subgradient methods for online learning
and stochastic optimization. JMLR.
COMP-598: Applied Machine Learning
18
Joelle Pineau
9
Encoding the input: Discrete inputs
•  Discrete inputs with k possible values are often encoded using a
1-hot or 1-of-k encoding:
–  k input bits are associated with the variable (one for each possible
value).
–  For any instance, all bits are 0 except the one corresponding to the
value found in the data, which is set to 1.
–  If the value is missing, all inputs are set to 0.
COMP-598: Applied Machine Learning
19
Joelle Pineau
Encoding the input: Real-valued inputs
•  For continuous (numeric) inputs, it is important to scale the
inputs so they are all in a common, reasonable range
•  One standard transformation is to “normalize” the data, i.e.,
make it such that it has mean 0, unit variance, by subtracting the
mean and dividing by the standard deviation
•  When is this good? When is this bad?
–  Good if the data is roughly normal, but bad if we have outliers
•  Alternatives:
–  1-to-n encoding: discretize the variable into a given number of intervals n.
–  Thermometer encoding: like 1-to-n but if the variable falls in the i=th interval,
all bits 1..i are set to 1.
–  The thermometer encoding is usually better than 1-to-n encoding.
COMP-598: Applied Machine Learning
20
Joelle Pineau
10
Encoding the output
•  A network can have several output units.
•  This can be useful when we want to predict a real-valued
variable, and it has several ranges.
–  E.g. Dean Pomerleau’s autonomous driving car.
•  Alternatively, we can allow one output unit, without a sigmoid
function.
•  Normalization can be used if the output data is roughly normal.
COMP-598: Applied Machine Learning
21
Joelle Pineau
More application-specific tricks
•  If there is too little data, it can be perturbed by random noise;
this helps escape local minima and gives more robust results.
–  In image classification and pattern recognition tasks, extra data can
be generated, e.g., by applying transformations that make sense.
•  Weight sharing can be used to indicate parameters that should
have the same value based on prior knowledge.
–  In this case, each update is computed separately using backprop,
then the tied parameters are updated with an average.
COMP-598: Applied Machine Learning
22
Joelle Pineau
11
Example applications
•  Speech phoneme recognition [Waibel] and synthesis [Nettalk].
•  Image classification [Kanade, Baluja, Rowley].
•  Digit recognition [Bengio, Bouttou, LeCun et al – LeNet].
•  Financial prediction.
•  Learning control [Pomerleau et al].
COMP-598: Applied Machine Learning
23
Joelle Pineau
When to consider using NNs
•  Input is high-dimensional discrete or real-valued (e.g. raw
sensor input).
•  Output is discrete or real valued, or a vector of values.
•  Possibly noisy data.
•  Training time is not important.
•  Form of target function is unknown.
•  Human readability of result is not important.
•  The computation of the output based on the input has to be fast.
COMP-598: Applied Machine Learning
24
Joelle Pineau
12
What you should know
• 
Definition / components of neural networks.
•  Training by backpropagation.
•  Overfitting (and how to avoid it).
•  When to use NNs.
•  Some strategies for successful application of NNs.
COMP-598: Applied Machine Learning
25
Joelle Pineau
13