MKL NIPS Spotlight - Microsoft Research

Local Deep Kernel Learning
for Efficient Non-linear
SVM Prediction
Suraj Jain
Vinayak Agrawal
Cijo Jose
MSR India
IIT Delhi
EPFL
Prasoon Goyal
Manik Varma
IIT Delhi
MSR India
Non-linear SVM Prediction
• Non-linear SVM prediction can be accurate
Non-linear SVM Prediction
• Cost of prediction = O(DN)
Linear SVM Prediction
• Cost of prediction = O(D)
Local Deep Kernel Learning Prediction
• Cost of prediction = O(D log N)
Our Contributions
• Model
• We learn a locally linear composite kernel
• We learn a tree structured kernel for logarithmic prediction
• Each node in the tree generates real-valued feature
• Optimization
• We learn all the tree parameters jointly
• We optimize efficiently using primal SGD
• We scale to problems with millions of data points
• Prediction speedup and accuracy
• 13,000 times faster than the RBF-SVM in some cases
• Significantly higher accuracy over the state-of-the-art
Localized Multiple Kernel Learning
• Prediction function
𝑝 𝐰𝑘 𝐱 𝐰𝑘𝑡 𝛟𝑘 𝐱 + 𝑏
𝑦 𝐱 = sign
𝑘
• Kernel function
𝐾𝑝 𝐱 𝑖 , 𝐱𝑗 =
𝑝 𝐰𝑘 𝐱 𝑖 𝐾𝑘 𝐱𝑖 , 𝐱𝑗 𝑝 𝐰𝑘 𝐱𝑗
𝑘
• Dual optimization
Minp Max 1t – ½tYKpY
s. t. 1tY = 0
0C
Gonen & Alpaydin
ICML 2008
Non-linear SVM Prediction
• Non-linear SVM prediction
𝑁
𝑦 𝐱 = sign
𝛼𝑖 𝑦𝑖 𝐾(𝐱, 𝐱 𝑖 )
𝑖=1
• RBF kernel: 𝐾 𝐱, 𝐱𝑖 =
2
−𝛾||𝐱−𝐱
||
𝑖
2
𝑒
Univariate Gaussian
0.5
0.45
0.4
0.35
p(x)
0.3
0.25
0.2
0.15
0.1
0.05
0
x
• Cost of prediction = O(DN)
Composite Kernel Prediction
• Non-linear SVM prediction
𝑁
𝑦 𝐱 = sign
𝛼𝑖 𝑦𝑖 𝐾(𝐱, 𝐱 𝑖 )
𝑖=1
• LDKL’s composite kernel
𝐾 𝐱, 𝐱𝑖 = 𝐾𝑁 𝐱, 𝐱 𝑖 𝐾𝐿 𝐱, 𝐱𝑖
= 𝛟𝑡 𝐱 𝛟 𝐱 𝑖 𝐱 𝑡 𝐱 𝑖
• LDKL’s prediction function
𝑀
𝝓𝑘 𝐱 𝐰𝑘𝑡 𝐱
𝑦 𝐱 = sign
𝑘=1
Composite Kernel Prediction
• LDKL’s prediction function: 𝑦 𝐱 = sign
W2
W3
W1
𝑀
𝑘=1 𝛟𝑘
𝐱 𝐰𝑘𝑡 𝐱
Composite Kernel Prediction
• LDKL’s prediction function: 𝑦 𝐱 = sign
𝛉2
𝛉1
𝛉3
𝑀
𝑘=1 𝛟𝑘
𝐱 𝐰𝑘𝑡 𝐱
Composite Kernel Prediction
• LDKL’s prediction function: 𝑦 𝐱 = sign
𝛉2
𝛉1
W2
W3
W1
𝛉3
𝑀
𝑘=1 𝛟𝑘
𝐱 𝐰𝑘𝑡 𝐱
Composite Kernel Prediction
• LDKL’s prediction function: 𝑦 𝐱 = sign
𝑀
𝑘=1 𝛟𝑘
𝐱 𝐰𝑘𝑡 𝐱
Composite Kernel Prediction
• LDKL’s prediction function: 𝑦 𝐱 = sign
1
Tan Hyperbolic  = 1
𝝓𝑘 𝐱 = tanh(𝜎𝜽𝒕𝒌 x)
 (x)
0.5
0
-0.5
-1
-4
-2
0
x
2
4
𝑀
𝑘=1 𝜙𝑘
𝐱 𝐰𝑘𝑡 𝐱
A Shallow Architecture
∗
𝐰1𝑡 𝐱
∗
𝜽1𝑡 𝐱
𝐰2𝑡 𝐱
∗
𝜽𝑡2 𝐱
⋯
𝐱
∗
⋯
𝑡
𝐰𝑀
𝐱
𝜽𝑡𝑀 𝐱
Learning Tree Structured Features
𝐱
• LDKL’s prediction function
𝑀
𝝓𝑘 𝐱 𝐰𝑘𝑡 𝐱
𝑦 𝐱 = sign
𝜙1
𝑘=1
Yes
𝐱
∗
No
𝜙2
Yes
𝜙4
𝜙3
No
𝜙5
𝜙6
𝜙7
Learning Tree Structured Features
• LDKL’s prediction function
𝑦 𝐱 = sign 𝐰1𝑡 𝐱
𝐱
𝐰1
𝐱
∗
Learning Tree Structured Features
𝐱
• LDKL’s prediction function
𝑦 𝐱 = sign 𝐰1𝑡 𝐱 + 𝐰2𝑡 𝐱
𝐰1
𝛉1𝑡 𝐱 > 0
𝛉1
Yes
𝐱
No
∗
𝐰2
𝐰3
Learning Tree Structured Features
𝐱
• LDKL’s prediction function
𝑦 𝐱 = sign 𝐰1𝑡 𝐱 + 𝐰2𝑡 𝐱 + 𝐰4𝑡 𝐱
𝐰1
𝛉1𝑡 𝐱 > 0
𝛉1
Yes
𝐱
∗
No
𝐰2
𝛉𝑡2 𝐱 > 0
𝛉3
𝐰3
𝛉𝑡3 𝐱 > 0
Yes
No
𝐰4
𝐰5
𝛉2
𝐰6
𝐰7
Comparison to a Perceptron Tree
Comparison to a Decision Tree
Comparison to a Decision Tree
Learning Tree Structured Features
𝐱
• LDKL’s prediction function
tanh 𝜎𝐯1𝑡 𝐱 𝐰1𝑡 𝐱
𝑦 𝐱 = sign + tanh 𝜎𝐯2𝑡 𝐱 𝐰2𝑡 𝐱
+ tanh 𝜎𝐯4𝑡 𝐱 𝐰4𝑡 𝐱
𝐰1 , 𝐯1
𝛉1𝑡 𝐱 > 0
𝛉1
Yes
𝐱
∗
No
𝐰2 , 𝐯2
𝛉𝑡2 𝐱 > 0
𝐰3 , 𝐯3
𝛉𝑡3 𝐱 > 0
Yes
No
𝐰4 , 𝐯4
𝐰5 , 𝐯5
𝛉3
𝛉2
𝐰6 , 𝐯6
𝐰7 , 𝐯7
LDKL’s Decision Boundaries
𝐱
𝐰1 , 𝐯1
𝛉1𝑡 𝐱 > 0
𝐰1𝑡 𝐱
𝐰2 , 𝐯2
𝛉𝑡2 𝐱 > 0
…
𝐰4 , 𝐯4
…
…
∗
𝐯1𝑡 𝐱
𝐰3 , 𝐯3
𝛉𝑡3 𝐱 > 0
∗
…
∗
…
𝐰5 , 𝐯5
…
…
∗
𝐰6 , 𝐯6
…
…
∗
…
∗
𝐰7 , 𝐯7
…
…
∗
Training LDKL
𝐱
• LDKL’s prediction function
tanh 𝜎𝐯1𝑡 𝐱 𝐰1𝑡 𝐱
𝑦 𝐱 = sign + tanh 𝜎𝐯2𝑡 𝐱 𝐰2𝑡 𝐱
+ tanh 𝜎𝐯4𝑡 𝐱 𝐰4𝑡 𝐱
𝐰1 , 𝐯1
𝛉1𝑡 𝐱 > 0
𝛉1
Yes
𝐱
∗
No
𝐰2 , 𝐯2
𝛉𝑡2 𝐱 > 0
𝐰3 , 𝐯3
𝛉𝑡3 𝐱 > 0
Yes
No
𝐰4 , 𝐯4
𝐰5 , 𝐯5
𝛉3
𝛉2
𝐰6 , 𝐯6
𝐰7 , 𝐯7
Training LDKL
𝐱
• Smoothing the tree
𝐰1 , 𝐯1
𝑝(𝛉1𝑡 𝐱)
𝛉1
𝐱
∗
𝐰2 , 𝐯2
𝑝(𝛉𝑡2 𝐱)
𝐰3 , 𝐯3
𝑝(𝛉𝑡3 𝐱)
𝛉3
𝛉2
𝐰4 , 𝐯4
𝐰5 , 𝐯5
𝐰6 , 𝐯6
𝐰7 , 𝐯7
Training LDKL
• Primal optimization via stochastic sub-gradient descent
𝜆𝐖
Tr
2
Min𝐕,𝐖,Θ 𝑃 =
+
𝐖𝑡 𝐖
𝑖 max(0,1
𝜆𝐕
𝜆𝚯
𝑡
+ Tr 𝐕 𝐕 + Tr
2
2
− 𝑦𝑖 𝛟𝒕𝑳 𝐱 𝑖 𝐖 𝒕 𝐱 𝑖 )
𝚯𝑡 𝚯
• Sub-gradients
𝛁𝐰𝑘 𝑃 𝐱 𝑖
𝛁𝛉𝑘 𝑃 𝐱 𝑖
𝛁𝐯𝑘 𝑃 𝐱 𝑖
= 𝜆𝐖 𝐰𝑘 − 𝛿𝑖 𝑦𝑖 𝜙𝐿𝑘 𝐱 𝑖 𝐱 𝑖
= 𝜆𝚯 𝜽𝑘 − 𝛿𝑖 𝑦𝑖 𝑙 tanh 𝜎𝐯𝑙𝑡 𝐱 𝑖 𝛁𝛉𝑘 𝐼𝑙 𝐱 𝑖 𝐰𝑙𝑡 𝐱 𝑖
= 𝜆𝐕 𝐯𝑘 − 𝛿𝑖 𝑦𝑖 𝜎 1 − tanh2 𝜎𝐯𝑘𝑡 𝐱 𝑖 𝐼𝑘 𝐱 𝑖 𝐰𝑘𝑡 𝐱 𝑖 𝐱 𝑖
Accuracy Comparison: RBF vs Linear
100
95
90
85
80
75
RBF
Linear
70
65
60
55
50
CoverType
Letter
MNIST
CIFAR
Banana
IJCNN
USPS
MAGIC4
RBF-SVM: Cost of Prediction
Normalized Prediction Time = RBF Time / Linear Time
1,00,000
10,000
1,000
100
10
1
LDKL’s Prediction Accuracy
100
95
90
85
80
75
70
65
60
55
50
RBF
LDKL
Linear
LDKL’s Prediction Cost
1,00,000
10,000
1,000
100
10
1
RBF Time (Normalized)
LDKL Time (Normalized)
Training Time Comparison: RBF vs LDKL
• Training time on a single core of a 2.68 Ghz Xeon processor
with 8 GB RAM.
Small Datasets
(Timing in seconds)
Large Datasets
(Timing in minutes)
1000
10000
1000
100
100
10
10
1
1
IJCNN
Letter
RBF
MAGIC4
LDKL
USPS
RBF
LDKL
Training Time Comparison: RBF vs LDKL
• Training time in minutes on a single core of a 2.68 Ghz Xeon
processor with 8 GB RAM.
Data Set
Linear SVM
RBF-SVM
LDKL
BANANA
CIFAR
CoverType
IJCNN
Letter
Magic04
MNIST
USPS
RCV1
MNIST8M
1.48E-05
6.76E-02
2.35
3.93E-02
5.75E-04
1.74E-03
2.86E-01
5.97E-03
0.13
0.7
0.006
1283.68
1369.96
0.45
0.43
0.17
39.12
0.75
-
0.01
0.278
7.990
0.090
2.200
0.047
1.376
0.096
0.5
65.21
Banana
90
80
70
60
50
CoverType
90
80
70
60
50
IJCNN
99
Axis Title
97
95
93
91
89
CIFAR10
79
74
69
64
59
54
49
Letter
100
90
80
70
60
50
MNIST
100
90
80
70
60
50
USPS
100
95
90
85
80
75
70
65
60
Magic04
90
85
80
75
70
65
60
LDKL’s Decision Boundaries
The RBF-SVM’s Decision Boundaries
LDKL: Mimicking the RBF-SVM
Pruning LDKL
Accuracy (%)
Dataset
Banana
CoverType
CIFAR
IJCNN
Letter
Magic
MNIST
USPS
Prediction Time
(ms)
# Nodes
Original Pruned Original Pruned Original Pruned
Tree
Tree
Tree
Tree
Tree
Tree
89.53
89.50
1.08
1.07
15
12
90.22
90.26
104.38
96.9
255
230
76.14
76.14
38.53
37.41
7
7
98.20
98.20
43.2
38.38
15
9
95.94
95.83
6.68
4.82
1023
115
85.93
86.01
1.49
1.28
15
9
97.27
97.26
128.43
123.48
31
24
95.51
95.46
6.95
6.58
15
13
Generating Training Data
• Sample points and have them labelled by the RBF-SVM
Training LDKL on the Extended Data Set
• Learn a balanced LDKL tree and then prune irrelevant nodes
Adding One Node
Adding Two Nodes
Adding Three Nodes
Adding Four Nodes
Adding Five Nodes
Adding Six Nodes
Adding Seven Nodes
Adding Eight Nodes
Adding Nine Nodes
LDKL’s Final Decision Boundaries
Conclusions
• Publications and code
• ICML 2013 paper
• Code: http://research.microsoft.com/~manik/code/LDKL/download.html
• LDKL learns a local, deep composite kernel for efficient nonlinear SVM prediction
• LDKL can be exponentially faster than the state-of-the-art
• Efficiency is important during both training and prediction
Acknowledgements
• Samy Bengio
• Purushottam Kar
• Prateek Jain
• Yann Lecun
• Vinod Nair
• Yashoteja Prabhu
• Nishal Shah