COMP 598 – Applied Machine Learning Lecture 15: Neural Networks (cont’d) ! Instructor: Joelle Pineau ([email protected]) TAs: Pierre-Luc Bacon ([email protected]) Angus Leigh ([email protected]) Class web page: www.cs.mcgill.ca/~jpineau/comp598 Quiz #5: SVMs Consider learning an SVM to do binary classification on the dataset below. 1. Draw an approximate solution for the each of the following kernel types: – a linear kernel – a quadratic kernel – a gaussian kernel 2. For the linear kernel, circle the 3 points that are the support vectors. 3. For the linear kernel, what is the equation corresponding to the decision boundary? COMP-598: Applied Machine Learning 2 Joelle Pineau 1 Feed-forward neural networks Notation: • wji denotes weight on connection from unit i to unit j. • • By convention, xj0 = 1, ∀ j Output of unit j, denoted oj is computed using a sigmoid: oj = σ(wj· xj) where wj is vector of weights entering unit j xj is vector of inputs to unit j • By definition, xji = oi . Given an input, how do we compute the output? How do we train the weights? COMP-598: Applied Machine Learning 3 Joelle Pineau Stochastic gradient descent • Initialize all weights to small random numbers. Initialization • Repeat until convergence: – Pick a training example. – Feed example through network to compute output o = oN+H+1. Forward propagation – For the output unit, compute the correction: – For each hidden unit h, compute its share of the correction: Backpropagation – Update each network weight: For h=1, …, H Gradient descent COMP-598: Applied Machine Learning 4 Joelle Pineau 2 Organizing the training data • Stochastic gradient descent: Compute error on a single example at a time (as in previous slide). • Batch gradient descent: Compute error on all examples. – Loop through the training data, accumulating weight changes. – Update all weights and repeat. • Mini-batch gradient descent: Compute error on small subset. – Randomly select a “mini-batch” (i.e. subset of training examples). – Calculate error on mini-batch, apply to update weights, and repeat. COMP-598: Applied Machine Learning 5 Joelle Pineau Expressiveness of feed-forward NN A single sigmoid neuron? • Same representational power as a perceptron: Boolean AND, OR, NOT, but not XOR. A neural network with a single hidden layer? COMP-598: Applied Machine Learning 6 Joelle Pineau 3 Learning the identity function COMP-598: Applied Machine Learning 7 Joelle Pineau Learning the identity function • Neural network structure: • Learned hidden layer weights: COMP-598: Applied Machine Learning 8 Joelle Pineau 4 Expressiveness of feed-forward NN A single sigmoid neuron? • Same representational power as a perceptron: Boolean AND, OR, NOT, but not XOR. A neural network with a single hidden layer? • Can represent every boolean function, but might require a number of hidden units that is exponential in the number of inputs. • Every bounded continuous function can be approximated with arbitrary precision by a boolean function. A neural network with two hidden layers? • Any function can be approximated to arbitrary accuracy by a network with two hidden layers. COMP-598: Applied Machine Learning 9 Joelle Pineau Network architecture • Overfitting occurs if there are too many parameters compared to the amount of data available. • Choosing the number of hidden units: – Too few hidden units do not allow the concept to be learned. – Too many lead to slow learning and overfitting. – If the m inputs are binary, log m is a good heuristic choice. • Choosing the number of layers – Always start with one hidden layer. – Don’t usually go beyond 2 hidden layers, unless the task structure. suggests something different. COMP-598: Applied Machine Learning 10 Joelle Pineau 5 Overtraining • Traditional overfitting is concerned with the number of parameters vs. the number of instances • In neural networks there is an additional phenomenon called overtraining which occurs when weights take on large magnitudes, pushing the sigmoids into saturation Overtraining in feed-forward networks • Effectively, as learning progresses, the network has more actual Error versus weight updates (example 1) parameters. 0.01 0.008 • Use a validation set to decide 0.006 0.005 0.004 • Regularization is also very effective. COMP-598: Applied Machine Learning • 0.007 Error when to stop training! • Training set error Validation set error 0.009 0.003 0.002 0 5000 11 10000 15000 Number of weight updates 20000 Joelle Pineau • Traditional overfitting is concerned with the number of parameters the number of instances • In neural networks there is an additional phenomenon called overtraini which occurs when weights take on large magnitudes, pushing t Convergence of backpropagation sigmoids into saturation •Backpropagation Effectively, as descent learning the network has more actu = gradient over all progresses, parameters in network. parameters If the learning rate is appropriate, the algorithm is guaranteed to •converge Use toa avalidation tocostdecide local minimumset of the function. when to stop training! NOT the global minimum.is also very effective •– Regularization – Can be much WORSE than global minimum. – There can be MANY local minimum. – Solution: random = train COMP-652, Lecture 5 -restarts September 20,multiple 2012 nets with different initial weights. – In practice, the solution found is often very good. • Training can take thousands of iterations - VERY SLOW! But using network after training is very fast. • Can we find solution faster (i.e. in fewer iterations)? COMP-598: Applied Machine Learning 12 Joelle Pineau More application-specific tricks • If there is too little data, it can be perturbed by random noise; this he 6 escape local minima and gives more robust results • In image classification and pattern recognition tasks, extra data can Adding momentum Adding momentum • On the t-th training sample, instead of the update: Adding momentum • On the t-th training sample, instead of the update: • On the t-th training update: ∆wijsample, ← αijinstead δj xij , ofwethedo: We do: ∆wij ← αij δj xij , we do: ∆wij (t) ← αij δj xij + β∆wij (t − 1) second term called ∆w (t)called ← isαmomentum The secondThe term ijis ij δj xmomentum ij + β∆wij (t − 1) Advantages: The•second term is called momentum Advantages: – Easy to pass smallminima local minima. – Easy to pass small local • Advantages: – to Keeps in areas where the flat – weights Keeps the moving weights moving in areas where theerror error isisflat. – Easy passthe small local minima – Increases– the speed where the gradient stays unchanged Increases the speed wherewhere the gradient – Keeps the weights moving in areas the stays errorunchanged is flat • Disadvantages: – Increases the speed where the gradient stays unchanged Disadvantages: – With too momentum, it can get out of a global maximum! • Disadvantages: – much With too much momentum, it can get out of a global maximum! – One more parameter to tune, and more chances of divergence – One more parameter to tune, more of divergence – With too much momentum, it can getand out of chances a global maximum! – One more parameter to tune, and more chances of divergence COMP-598: Applied Machine Learning COMP-652, Lecture 5 - September 20, 2012 13 Joelle Pineau COMP-652, Lecture 5 - September 20, 2012 29 29 Optimization: Repeated Line Searches Choosing the learning rate • Given the optimization problem: w* = argmaxw f(w) learning rate rate • Backprop is– Choosing very sensitivethe to the choice of learning Think of f(w) as your neural network. – Too large ⇒ divergence • Consider an update: wj = argminw f(w1, w2, … wj-1, wj, wj+1, …, wm) – Too smallsensitive ⇒ VERYtoslow • Backprop is very the learning choice of learning rate – Optimizing a single dimension wj, ability holding the constant. – The learning rate also influences the to rest escape local optima – Too large ⇒ divergence • ⇒Repeat forslow alllearning weights. • Very often, different rates are used for units inn different layers – Too small VERY learning • Itlearning can be thatinfluences each hasweights its own rate – The rate also the ability tooptimal escapelearning localoptimal. optima • shown Then, repeat again unit until all are simultaneously • Very often, different learning rates are used for units inn different layers • It can be shown that each unit has its own optimal learning rate • This works well when COMP-598: Applied Machine Learning 14 Joelle Pineau COMP-652, Lecture 5 - September 20, 2012 COMP-652, Lecture 5 - September 20, 2012 30 30 7 Optimization: Conjugate gradient descent • Condition is not always true: • Can we find a coordinate system where it is satisfied? • Yes! The eigenvectors of the Hessian matrix: • Apply repeated line search in this coordinate system. COMP-598: Applied Machine Learning 15 Joelle Pineau Optimization: Second-order method • Newton-Raphson method looks at both the first and secondorder derivatives of the error function. • Approximate first-order derivative: • Solve for G(wt+1)=0: Some equations taken from: http://ttic.uchicago.edu/~dmcallester/ttic101-07/lectures/neural/neural.pdf COMP-598: Applied Machine Learning 16 Joelle Pineau 8 Choosing the learning rate • Backprop is very sensitive to the choice of learning rate. – Too large ⇒ divergence. – Too small ⇒ VERY slow learning. – The learning rate also influences the ability to escape local optima. • Very often, different learning rates are used for units in different layers. Hard to tune by hand! • Heuristic: Track performance on validation set, when it stabilizes, divide learning rate by 2. COMP-598: Applied Machine Learning 17 Joelle Pineau Optimization method: Adagrad • Calculate adaptive learning rate per parameter. • Intuition: Adapt learning rate depending on previous updates to that parameter. – Learn slowly for frequent features. – Learn faster for rare but informative features. • Can add regularization term. See: Duchi, Hazan, Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. JMLR. COMP-598: Applied Machine Learning 18 Joelle Pineau 9 Encoding the input: Discrete inputs • Discrete inputs with k possible values are often encoded using a 1-hot or 1-of-k encoding: – k input bits are associated with the variable (one for each possible value). – For any instance, all bits are 0 except the one corresponding to the value found in the data, which is set to 1. – If the value is missing, all inputs are set to 0. COMP-598: Applied Machine Learning 19 Joelle Pineau Encoding the input: Real-valued inputs • For continuous (numeric) inputs, it is important to scale the inputs so they are all in a common, reasonable range • One standard transformation is to “normalize” the data, i.e., make it such that it has mean 0, unit variance, by subtracting the mean and dividing by the standard deviation • When is this good? When is this bad? – Good if the data is roughly normal, but bad if we have outliers • Alternatives: – 1-to-n encoding: discretize the variable into a given number of intervals n. – Thermometer encoding: like 1-to-n but if the variable falls in the i=th interval, all bits 1..i are set to 1. – The thermometer encoding is usually better than 1-to-n encoding. COMP-598: Applied Machine Learning 20 Joelle Pineau 10 Encoding the output • A network can have several output units. • This can be useful when we want to predict a real-valued variable, and it has several ranges. – E.g. Dean Pomerleau’s autonomous driving car. • Alternatively, we can allow one output unit, without a sigmoid function. • Normalization can be used if the output data is roughly normal. COMP-598: Applied Machine Learning 21 Joelle Pineau More application-specific tricks • If there is too little data, it can be perturbed by random noise; this helps escape local minima and gives more robust results. – In image classification and pattern recognition tasks, extra data can be generated, e.g., by applying transformations that make sense. • Weight sharing can be used to indicate parameters that should have the same value based on prior knowledge. – In this case, each update is computed separately using backprop, then the tied parameters are updated with an average. COMP-598: Applied Machine Learning 22 Joelle Pineau 11 Example applications • Speech phoneme recognition [Waibel] and synthesis [Nettalk]. • Image classification [Kanade, Baluja, Rowley]. • Digit recognition [Bengio, Bouttou, LeCun et al – LeNet]. • Financial prediction. • Learning control [Pomerleau et al]. COMP-598: Applied Machine Learning 23 Joelle Pineau When to consider using NNs • Input is high-dimensional discrete or real-valued (e.g. raw sensor input). • Output is discrete or real valued, or a vector of values. • Possibly noisy data. • Training time is not important. • Form of target function is unknown. • Human readability of result is not important. • The computation of the output based on the input has to be fast. COMP-598: Applied Machine Learning 24 Joelle Pineau 12 What you should know • Definition / components of neural networks. • Training by backpropagation. • Overfitting (and how to avoid it). • When to use NNs. • Some strategies for successful application of NNs. COMP-598: Applied Machine Learning 25 Joelle Pineau 13
© Copyright 2024 ExpyDoc