Local Deep Kernel Learning for Efficient Non-linear SVM Prediction Suraj Jain Vinayak Agrawal Cijo Jose MSR India IIT Delhi EPFL Prasoon Goyal Manik Varma IIT Delhi MSR India Non-linear SVM Prediction • Non-linear SVM prediction can be accurate Non-linear SVM Prediction • Cost of prediction = O(DN) Linear SVM Prediction • Cost of prediction = O(D) Local Deep Kernel Learning Prediction • Cost of prediction = O(D log N) Our Contributions • Model • We learn a locally linear composite kernel • We learn a tree structured kernel for logarithmic prediction • Each node in the tree generates real-valued feature • Optimization • We learn all the tree parameters jointly • We optimize efficiently using primal SGD • We scale to problems with millions of data points • Prediction speedup and accuracy • 13,000 times faster than the RBF-SVM in some cases • Significantly higher accuracy over the state-of-the-art Localized Multiple Kernel Learning • Prediction function 𝑝 𝐰𝑘 𝐱 𝐰𝑘𝑡 𝛟𝑘 𝐱 + 𝑏 𝑦 𝐱 = sign 𝑘 • Kernel function 𝐾𝑝 𝐱 𝑖 , 𝐱𝑗 = 𝑝 𝐰𝑘 𝐱 𝑖 𝐾𝑘 𝐱𝑖 , 𝐱𝑗 𝑝 𝐰𝑘 𝐱𝑗 𝑘 • Dual optimization Minp Max 1t – ½tYKpY s. t. 1tY = 0 0C Gonen & Alpaydin ICML 2008 Non-linear SVM Prediction • Non-linear SVM prediction 𝑁 𝑦 𝐱 = sign 𝛼𝑖 𝑦𝑖 𝐾(𝐱, 𝐱 𝑖 ) 𝑖=1 • RBF kernel: 𝐾 𝐱, 𝐱𝑖 = 2 −𝛾||𝐱−𝐱 || 𝑖 2 𝑒 Univariate Gaussian 0.5 0.45 0.4 0.35 p(x) 0.3 0.25 0.2 0.15 0.1 0.05 0 x • Cost of prediction = O(DN) Composite Kernel Prediction • Non-linear SVM prediction 𝑁 𝑦 𝐱 = sign 𝛼𝑖 𝑦𝑖 𝐾(𝐱, 𝐱 𝑖 ) 𝑖=1 • LDKL’s composite kernel 𝐾 𝐱, 𝐱𝑖 = 𝐾𝑁 𝐱, 𝐱 𝑖 𝐾𝐿 𝐱, 𝐱𝑖 = 𝛟𝑡 𝐱 𝛟 𝐱 𝑖 𝐱 𝑡 𝐱 𝑖 • LDKL’s prediction function 𝑀 𝝓𝑘 𝐱 𝐰𝑘𝑡 𝐱 𝑦 𝐱 = sign 𝑘=1 Composite Kernel Prediction • LDKL’s prediction function: 𝑦 𝐱 = sign W2 W3 W1 𝑀 𝑘=1 𝛟𝑘 𝐱 𝐰𝑘𝑡 𝐱 Composite Kernel Prediction • LDKL’s prediction function: 𝑦 𝐱 = sign 𝛉2 𝛉1 𝛉3 𝑀 𝑘=1 𝛟𝑘 𝐱 𝐰𝑘𝑡 𝐱 Composite Kernel Prediction • LDKL’s prediction function: 𝑦 𝐱 = sign 𝛉2 𝛉1 W2 W3 W1 𝛉3 𝑀 𝑘=1 𝛟𝑘 𝐱 𝐰𝑘𝑡 𝐱 Composite Kernel Prediction • LDKL’s prediction function: 𝑦 𝐱 = sign 𝑀 𝑘=1 𝛟𝑘 𝐱 𝐰𝑘𝑡 𝐱 Composite Kernel Prediction • LDKL’s prediction function: 𝑦 𝐱 = sign 1 Tan Hyperbolic = 1 𝝓𝑘 𝐱 = tanh(𝜎𝜽𝒕𝒌 x) (x) 0.5 0 -0.5 -1 -4 -2 0 x 2 4 𝑀 𝑘=1 𝜙𝑘 𝐱 𝐰𝑘𝑡 𝐱 A Shallow Architecture ∗ 𝐰1𝑡 𝐱 ∗ 𝜽1𝑡 𝐱 𝐰2𝑡 𝐱 ∗ 𝜽𝑡2 𝐱 ⋯ 𝐱 ∗ ⋯ 𝑡 𝐰𝑀 𝐱 𝜽𝑡𝑀 𝐱 Learning Tree Structured Features 𝐱 • LDKL’s prediction function 𝑀 𝝓𝑘 𝐱 𝐰𝑘𝑡 𝐱 𝑦 𝐱 = sign 𝜙1 𝑘=1 Yes 𝐱 ∗ No 𝜙2 Yes 𝜙4 𝜙3 No 𝜙5 𝜙6 𝜙7 Learning Tree Structured Features • LDKL’s prediction function 𝑦 𝐱 = sign 𝐰1𝑡 𝐱 𝐱 𝐰1 𝐱 ∗ Learning Tree Structured Features 𝐱 • LDKL’s prediction function 𝑦 𝐱 = sign 𝐰1𝑡 𝐱 + 𝐰2𝑡 𝐱 𝐰1 𝛉1𝑡 𝐱 > 0 𝛉1 Yes 𝐱 No ∗ 𝐰2 𝐰3 Learning Tree Structured Features 𝐱 • LDKL’s prediction function 𝑦 𝐱 = sign 𝐰1𝑡 𝐱 + 𝐰2𝑡 𝐱 + 𝐰4𝑡 𝐱 𝐰1 𝛉1𝑡 𝐱 > 0 𝛉1 Yes 𝐱 ∗ No 𝐰2 𝛉𝑡2 𝐱 > 0 𝛉3 𝐰3 𝛉𝑡3 𝐱 > 0 Yes No 𝐰4 𝐰5 𝛉2 𝐰6 𝐰7 Comparison to a Perceptron Tree Comparison to a Decision Tree Comparison to a Decision Tree Learning Tree Structured Features 𝐱 • LDKL’s prediction function tanh 𝜎𝐯1𝑡 𝐱 𝐰1𝑡 𝐱 𝑦 𝐱 = sign + tanh 𝜎𝐯2𝑡 𝐱 𝐰2𝑡 𝐱 + tanh 𝜎𝐯4𝑡 𝐱 𝐰4𝑡 𝐱 𝐰1 , 𝐯1 𝛉1𝑡 𝐱 > 0 𝛉1 Yes 𝐱 ∗ No 𝐰2 , 𝐯2 𝛉𝑡2 𝐱 > 0 𝐰3 , 𝐯3 𝛉𝑡3 𝐱 > 0 Yes No 𝐰4 , 𝐯4 𝐰5 , 𝐯5 𝛉3 𝛉2 𝐰6 , 𝐯6 𝐰7 , 𝐯7 LDKL’s Decision Boundaries 𝐱 𝐰1 , 𝐯1 𝛉1𝑡 𝐱 > 0 𝐰1𝑡 𝐱 𝐰2 , 𝐯2 𝛉𝑡2 𝐱 > 0 … 𝐰4 , 𝐯4 … … ∗ 𝐯1𝑡 𝐱 𝐰3 , 𝐯3 𝛉𝑡3 𝐱 > 0 ∗ … ∗ … 𝐰5 , 𝐯5 … … ∗ 𝐰6 , 𝐯6 … … ∗ … ∗ 𝐰7 , 𝐯7 … … ∗ Training LDKL 𝐱 • LDKL’s prediction function tanh 𝜎𝐯1𝑡 𝐱 𝐰1𝑡 𝐱 𝑦 𝐱 = sign + tanh 𝜎𝐯2𝑡 𝐱 𝐰2𝑡 𝐱 + tanh 𝜎𝐯4𝑡 𝐱 𝐰4𝑡 𝐱 𝐰1 , 𝐯1 𝛉1𝑡 𝐱 > 0 𝛉1 Yes 𝐱 ∗ No 𝐰2 , 𝐯2 𝛉𝑡2 𝐱 > 0 𝐰3 , 𝐯3 𝛉𝑡3 𝐱 > 0 Yes No 𝐰4 , 𝐯4 𝐰5 , 𝐯5 𝛉3 𝛉2 𝐰6 , 𝐯6 𝐰7 , 𝐯7 Training LDKL 𝐱 • Smoothing the tree 𝐰1 , 𝐯1 𝑝(𝛉1𝑡 𝐱) 𝛉1 𝐱 ∗ 𝐰2 , 𝐯2 𝑝(𝛉𝑡2 𝐱) 𝐰3 , 𝐯3 𝑝(𝛉𝑡3 𝐱) 𝛉3 𝛉2 𝐰4 , 𝐯4 𝐰5 , 𝐯5 𝐰6 , 𝐯6 𝐰7 , 𝐯7 Training LDKL • Primal optimization via stochastic sub-gradient descent 𝜆𝐖 Tr 2 Min𝐕,𝐖,Θ 𝑃 = + 𝐖𝑡 𝐖 𝑖 max(0,1 𝜆𝐕 𝜆𝚯 𝑡 + Tr 𝐕 𝐕 + Tr 2 2 − 𝑦𝑖 𝛟𝒕𝑳 𝐱 𝑖 𝐖 𝒕 𝐱 𝑖 ) 𝚯𝑡 𝚯 • Sub-gradients 𝛁𝐰𝑘 𝑃 𝐱 𝑖 𝛁𝛉𝑘 𝑃 𝐱 𝑖 𝛁𝐯𝑘 𝑃 𝐱 𝑖 = 𝜆𝐖 𝐰𝑘 − 𝛿𝑖 𝑦𝑖 𝜙𝐿𝑘 𝐱 𝑖 𝐱 𝑖 = 𝜆𝚯 𝜽𝑘 − 𝛿𝑖 𝑦𝑖 𝑙 tanh 𝜎𝐯𝑙𝑡 𝐱 𝑖 𝛁𝛉𝑘 𝐼𝑙 𝐱 𝑖 𝐰𝑙𝑡 𝐱 𝑖 = 𝜆𝐕 𝐯𝑘 − 𝛿𝑖 𝑦𝑖 𝜎 1 − tanh2 𝜎𝐯𝑘𝑡 𝐱 𝑖 𝐼𝑘 𝐱 𝑖 𝐰𝑘𝑡 𝐱 𝑖 𝐱 𝑖 Accuracy Comparison: RBF vs Linear 100 95 90 85 80 75 RBF Linear 70 65 60 55 50 CoverType Letter MNIST CIFAR Banana IJCNN USPS MAGIC4 RBF-SVM: Cost of Prediction Normalized Prediction Time = RBF Time / Linear Time 1,00,000 10,000 1,000 100 10 1 LDKL’s Prediction Accuracy 100 95 90 85 80 75 70 65 60 55 50 RBF LDKL Linear LDKL’s Prediction Cost 1,00,000 10,000 1,000 100 10 1 RBF Time (Normalized) LDKL Time (Normalized) Training Time Comparison: RBF vs LDKL • Training time on a single core of a 2.68 Ghz Xeon processor with 8 GB RAM. Small Datasets (Timing in seconds) Large Datasets (Timing in minutes) 1000 10000 1000 100 100 10 10 1 1 IJCNN Letter RBF MAGIC4 LDKL USPS RBF LDKL Training Time Comparison: RBF vs LDKL • Training time in minutes on a single core of a 2.68 Ghz Xeon processor with 8 GB RAM. Data Set Linear SVM RBF-SVM LDKL BANANA CIFAR CoverType IJCNN Letter Magic04 MNIST USPS RCV1 MNIST8M 1.48E-05 6.76E-02 2.35 3.93E-02 5.75E-04 1.74E-03 2.86E-01 5.97E-03 0.13 0.7 0.006 1283.68 1369.96 0.45 0.43 0.17 39.12 0.75 - 0.01 0.278 7.990 0.090 2.200 0.047 1.376 0.096 0.5 65.21 Banana 90 80 70 60 50 CoverType 90 80 70 60 50 IJCNN 99 Axis Title 97 95 93 91 89 CIFAR10 79 74 69 64 59 54 49 Letter 100 90 80 70 60 50 MNIST 100 90 80 70 60 50 USPS 100 95 90 85 80 75 70 65 60 Magic04 90 85 80 75 70 65 60 LDKL’s Decision Boundaries The RBF-SVM’s Decision Boundaries LDKL: Mimicking the RBF-SVM Pruning LDKL Accuracy (%) Dataset Banana CoverType CIFAR IJCNN Letter Magic MNIST USPS Prediction Time (ms) # Nodes Original Pruned Original Pruned Original Pruned Tree Tree Tree Tree Tree Tree 89.53 89.50 1.08 1.07 15 12 90.22 90.26 104.38 96.9 255 230 76.14 76.14 38.53 37.41 7 7 98.20 98.20 43.2 38.38 15 9 95.94 95.83 6.68 4.82 1023 115 85.93 86.01 1.49 1.28 15 9 97.27 97.26 128.43 123.48 31 24 95.51 95.46 6.95 6.58 15 13 Generating Training Data • Sample points and have them labelled by the RBF-SVM Training LDKL on the Extended Data Set • Learn a balanced LDKL tree and then prune irrelevant nodes Adding One Node Adding Two Nodes Adding Three Nodes Adding Four Nodes Adding Five Nodes Adding Six Nodes Adding Seven Nodes Adding Eight Nodes Adding Nine Nodes LDKL’s Final Decision Boundaries Conclusions • Publications and code • ICML 2013 paper • Code: http://research.microsoft.com/~manik/code/LDKL/download.html • LDKL learns a local, deep composite kernel for efficient nonlinear SVM prediction • LDKL can be exponentially faster than the state-of-the-art • Efficiency is important during both training and prediction Acknowledgements • Samy Bengio • Purushottam Kar • Prateek Jain • Yann Lecun • Vinod Nair • Yashoteja Prabhu • Nishal Shah
© Copyright 2025 ExpyDoc