Deep Learning

Deep Learning Some Slides from NAACL2013 Deep Learning for NLP tutorial, Winter School Deep learning,
Deview 2013 Deep Learning
1
Deep Networks •  A deep network is a neural network with mul7ple levels of nonlinear opera7ons. 2
Why Deep Networks? •  Similar to human percep7on process –  Abstrac(on from low level representa(on to high level representa(on. 3
Why Deep Networks?: Integrated Learning •  Deep networks op7mize both feature extractor and classifier simultaneously. •  Conven(onal system –  HandcraDing features is 7me-­‐consuming •  Deep Network 4
Why Deep Networks?: Unsupervised feature and weight learning •  Today, most prac7cal, good NLP & ML methods require labeled training data (i.e., supervised learning) –  But almost all data is unlabeled •  Most informa7on must be acquired unsupervised –  Fortunately, a good model of observed data can really help you learn classifica7on decisions 5
Difficul7es with Deep Networks •  In many cases, deep networks are hard to op7mize –  Conven7onal back-­‐propaga(on algorithm does not work well for deep networks –  Par7cularly, the boTom layer is very hard to train. •  Error signal doesn’t reach low-­‐level layers very well. •  Requires heavy computa7on •  Many parameters (over-­‐fiXng problems) •  Requires much effort to implement and debug 6
Why Now? •  DNN-­‐Back propaga7on (1980s) –  OverfiXng, Slow training 7me –  Before 2006 training deep architectures was unsuccessful •  What has changed? –  New methods for unsupervised pre-­‐training have been developed (Restricted Boltzmann Machines, autoencoders, …) –  More efficient parameter es7ma7on methods –  BeTer understanding of model regulariza7on –  Big data, GPU, … 7
Deep Belief Network [Hinton06] •  Key idea –  Pre-­‐train layers with an unsupervised learning algorithm in phases –  Then, fine-­‐tune the whole network by supervised learning •  DBN are stacks of restricted Boltzmann machines 8
Restricted Boltzmann Machine •  A Restricted Boltzmann machine (RBM) is a genera7ve stochas7c neural network that can learn a probability distribu(on over its set of inputs •  Major applica7ons –  Dimensionality reduc7on –  Topic modeling, … 9
Training DBN: Pre-­‐Training •  1. Layer-­‐wise greedy unsupervised pre-­‐training –  Train layers in phase from the boTom layer 10
Training DBN: Fine-­‐Tuning •  2. Supervised fine-­‐tuning for the classifica7on task 11
Other Pre-­‐training Methods • 
• 
• 
• 
PCA, ICA Stacked autoencoder Stacked denosing autoencoder … 12
Autoencoder •  Auto-­‐encoder is an ANN whose desired output is the same as the input. –  The aim of an auto-­‐encoder is to learn a compressed representa(on (encoding) for a set of data. –  Find weight vectors A and B that minimize: Σi(yi-­‐xi)2 13
Stacked Autoencoders •  ADer training, the hidden node extracts features from the input nodes •  Stacking autoencoders constructs a deep network 14
Denoising Autoencoder [Vincent08] 15
Experiments – MNIST (Larochelle et al. 2009)
16
Why is unsupervised pre-­‐training working so well? •  Regulariza7on hypothesis: –  Representa7ons good for P(x) are good for P(y|x) •  Op7miza7on hypothesis: –  Unsupervised ini7aliza7ons start near beTer local minimum of supervised training error –  Minima otherwise not achievable by random ini7aliza7on 17
Drop-­‐Out [Hinton 2012] •  In training, randomly dropout hidden units with probability p. 18
Drop-­‐Out – cont’d •  OverfiNng can be reduced by using dropout –  Drop out 20% of input units and 50% of the hidden units 19
Rec7fied Linear Hidden Unit (ReLU) •  Sparse Coding –  Allow only a small number of computa7onal units to have non-­‐zero values 20
ReLU – cont’d 21
Deep Learning Tip: Drop-­‐Out or RBM •  Drop-­‐Out is beTer than RBM –  OverfiXng 을 줄일 pre-­‐training 목적이라면 굳이
RBM을 쓸 필요 없음 (Naver Labs 음성 인식) –  Drop-­‐Out과 함께 ReLU –  RMB의 효용성 •  Unsupervised training •  Feature extractor 22
Deep Learning Tip: Noisy Input Data •  Noisy한 input data가 많다면 굳이 RBM도
Drop-­‐Out도 쓰지 않아도 됨 (Naver Labs 음성인
식) –  Noisy data의 역할 •  Local minimum에 빠지는 것을 방지 •  Regularizer by Randomness의 역할 –  Noisy data가 없다면 à 만들어서라도 넣어라 •  Image à 좌우 반전, 약간의 왜곡 •  음성 à 임의의 배경 잡음 삽입 23
Deep Learning: Parameters •  DNN Structure – 
– 
– 
– 
Input layer의 node 개수 Output layer의 node 개수 Hidden layer의 개수 Hidden layer의 node 개수 •  Training Parameters – 
– 
– 
– 
– 
Momentum Learning Rate Weight Ini7al value Drop Out Mini-­‐batch size 24
Deep Learning: Parameters DNN Structure •  Input layer의 node 개수 •  Output layer의 node 개수 •  Hidden layer의 개수 •  Hidden layer의 node 개수 Training Parameters •  Momentum •  Learning Rate •  Weight Ini7al value •  Drop Out •  Mini-­‐batch size 25
Neural Language Model •  A Neural Probabilis7c Language Model –  Each word represented by a distributed con7nuous-­‐valued code –  Generalizes to sequences of words that are seman7cally similar to training sequences 26
Word Embedding 27
Korean Word Embedding (KNU) 28
NLP (Almost) From Scratch (Collobert and Weston, 2011) 29
Collobert and Weston, 2011 30