Fast Methods for Kernel

Fast Methods for
Kernel-based Text Analysis
Taku Kudo
工藤 拓
Yuji Matsumoto 松本 裕治
NAIST (Nara Institute of Science and Technology)
41st Annual Meeting of the Association for
Computational Linguistics , Sapporo JAPAN
1
Background
Kernel methods (e.g., SVM) become
popular
Can incorporate prior knowledge
independently from the machine learning
algorithms by giving task dependent
kernel (generalized dot-product)
High accuracy
2
Problem
Too slow to use kernel-based text
analyzers to the real NL applications
(e.g., QA or text mining) because of
their inefficiency in testing
Some kernel-based parsers run only at
2 - 3 seconds/sentence
3
Goals
Build fast but still accurate kernelbased text analyzers
Make it possible to use them to wider
range of NL applications
4
Outline
Polynomial Kernel of degree d
Fast Methods for Polynomial kernel


PKI
PKE
Experiments
Conclusions and Future Work
5
Outline
Polynomial Kernel of degree d
Fast Methods for Polynomial kernels


PKI
PKE
Experiments
Conclusions and Future Work
6
Kernel Methods
L
f ( X )    i φ( X i )  φ( X )
i 1
L
 i K ( X i , X )
i 1
Training data
T  {X1 , X 2 ,, X L }
No need to represent example in an explicit
feature vector
Complexity of testing is O(L ・|X|)
7
Kernels for Sets (1/3)
Focus on the special case where examples
are represented as sets
The instances in NLP are usually
represented as sets (e.g., bag-of-words)
Feature set:
F  {i1 , i2 ,, iN }
Training data:
T  { X 1 , X 2 ,, X L }, X j  F
8
Kernels for Sets (2/3)
X1  {a, b, c, d},
X 2  {a, b, d , e}
Simple definition:
K ( X1 , X 2 )  | X1  X 2 |  | {a, b, d} |  3
Combinations (subsets) of features
2nd order {{a, b}, {a, d }, {b, d }}
3rd order{{a, b, d }}
9
Kernels for Sets (3/3)
Dependent (+1) or independent (-1) ?
I
PRP
ate
VBD
head
a
DT
cake
NN
modifier
Head-word:
ate
Head-POS:
Head-word:
ateVBD
Modifier-word:
cake
Head-POS:
VBD
Heuristic
X= Modifier-POS:
X= Modifier-word: cake
NN
selection
Head-POS/Modifier-POS:
Modifier-POS: NNVBD/NN
Head-word/Modifier-POS: ate/NN
…
Subsets (combinations) of basic features are critical
to improve overall accuracy in many NL tasks
Previous approaches select combinations heuristically10
Polynomial Kernel of degree d
Implicit form
Kd ( X1, X 2 )  | X1  X 2 | 1 d  {0,1,2..}
d
Explicit form
d
K d ( X 1 , X 2 )   cd (r ) | Pr ( X 1  X 2 ) |
r 0
Pr (X ) is a set of all subsets of X with
exactly relements in it
cd (r ) is prior weight to the subsets with size r

 d  r
r m
l r 
cd (r )      (1)  m   
l  r  l  m  r
 m
d
(subset weight)
11
Example (Cubic Kernel d=3 )
X1  {a, b, c, d},
X 2  {a, b, d , e}
Implicit form:
K3 ( X1, X 2 )  | X1  X 2 | 1  (3 1)  64
3
3
Explicit form:
c3 (0)  1,
P0 ( X 1  X 2 )  {}
c3 (1)  7,
P1 ( X 1  X 2 )  {{a}, {b}, {d }}
c3 (2)  12, P2 ( X 1  X 2 )  {{a, b}, {a, d }, {b, d }}
c3 (3)  6,
Up to 3 subsets
are used as new
features
P3 ( X 1  X 2 )  {{a, b, d }}
K3 ( X1, X 2 )  11  7  3  12 3  6 1  64
12
Outline
Polynomial Kernel of degree d
Fast Methods for Polynomial kernel


PKI
PKE
Experiments
Conclusions and Future Work
13
Toy Example
Feature Set:
Examples:
F={a,b,c,d,e}
α
j
Xj
1 1 {a, b, c}
2 0.5 {a, b, d}
3 -2 {b, c, d}
Kernel:
#SVs L =3
K3 ( X1, X 2 )  | X1  X 2 | 1 Test Example:
3
X={a,c,e}
14
PKB (Baseline)
K(X,X’) = (|X∩X’|+1)
α
1
2
3
1
0.5
-2
3
Xj
{a, b, c}
{a, b, d}
{b, c, d}
3
K(Xj,X)
Test Example
X={a,c,e}
3
3
f(X) = 1・(2+1) + 0.5・(1+1) - 2 (1+1)
= 15
Complexity is always O(L・|X|)
15
PKI (Inverted Representation)
K(X,X’) = (|X∩X’|+1)
α
1
2
3
3
Inverted Index
Xj
a
b
c
d
1 {a, b, c}
0.5 {a, b, d}
-2 {b, c, d}
3
B = Avg. size
{1,2}
{1,2,3}
{1,3}
{2,3}
3
Test Example
X= {a, c, e}
3
f(X)=1・(2+1) + 0.5・(1+1) - 2 (1+1)
Average complexity is O(B・|X|+L)
Efficient if feature space is sparse
Suitable for many NL tasks
= 15
16
PKE (Expanded Representation)
L
f ( X )   i K ( X i , X )
i 1
L
   i φ( X i )  φ( X )
i 1
 w  φ( X )
L
w    i φ( X i )
i 1
Convert into linear form by calculating vector w
φ( X ) projects X into its subsets space
17
PKE (Expanded Representation)
K(X,X’) = (|X∩X’|+1)
3
c3(0)=1, c3(1)=7,
c3(2)=12, c3(3)=6
αj
1
2
3
Xj
1
{a, b, c}
0.5 {a, b, d}
-2 {b, c, d}
w({b,d}) = 12 (0.5 – 2 ) = -18
W (Expansion Table)
C
φ
{a}
{b}
{c}
{d}
{a,b}
{a,c}
{a,d}
{b,c}
{b,d}
{c,d}
{a,b,c}
{a,b,d}
{a,c,d}
{b,c,d}
1
7
12
6
w
-0.5
10.5
-3.5
-7
-10.5
18
12
6
-12
-18
-24
6
3
0
-12
Test Example
X={a,c,e}
{φ,{a},{c}, {e},
{a,c},{a,e},
{c,e},{a,c,e}}
F(X)= - 0.5 + 10.5
– 7 + 12
= 15
d
Complexity is O(|X| ) ,
independent of the number of SVs (L)
Efficient if the number of SVs is large
18
PKE in Practice
Hard to calculate Expansion Table exactly
Use Approximated Expansion Table
Subsets with smaller |w| can be removed,
since |w| represents a contribution to the
final classification
Use subset mining (a.k.a. basket mining)
algorithm for efficient calculation
19
Subset Mining Problem
set
id
1
2
3
4
{a
{a
{a
{b
c
b
b
c
d}
c}
d}
e}
Transaction Database
 2
{a}:3
{b}:3
{c}:3 {d}:2
{a b}:2 {b c}: 2
{a c}:2 {a d}: 2
Results
Extract all subsets that occur in no less than
sets of the transaction database
  1 and no size constraints → NP-hard
Efficient algorithms have been proposed
(e.g., Apriori, PrefixSpan)

20
Feature Selection as Mining
σ=10
s
φ
{a}
{b}
{c}
{d}
{a,b}
{a,c}
{a,d}
{b,c}
{b,d}
{c,d}
{a,b,c}
{a,b,d}
{a,c,d}
{b,c,d}
w
-0.5
10.5
-3.5
-7
-10.5
12
12
6
-12
-18
-24
6
3
0
-12
1
2
3
αi
Xi
1
0.5
-2
{a, b, c}
{a, b, d}
{b, c, d}
Exhaustive generation and testing
→ Impractical!
Direct generation
with subset mining
s
{a}
{d}
{a,b}
{a,c}
{b,c}
{b,d}
{c,d}
{b,c,d}
W
10.5
-10.5
12
12
-12
-18
-24
-12
• Can efficiently build the approximated table
• σ controls the rate of approximation
21
Outline
Polynomial Kernel of degree d
Fast Methods for Polynomial kernel


PKI
PKE
Experiments
Conclusions and Future Work
22
Experimental Settings
Three NL tasks



English Base-NP Chunking (EBC)
Japanese Word Segmentation (JWS)
Japanese Dependency Parsing (JDP)
Kernel Settings


Quadratic kernel is applied to EBC
Cubic kernel is applied to JWS and JDP
23
Results
(English Base-NP Chunking)
Time
(Sec./Sent.)
PKB
PKI
PKE (σ=.01)
PKE (σ=.005)
PKE (σ=.001)
PKE (σ=.0005)
.164
.020
.0016
.0016
.0017
.0017
Speedup F-score
Ratio
1.0
8.3
105.2
101.3
97.7
96.8
93.84
93.84
93.79
93.85
93.84
93.84
24
Results
(Japanese Word Segmentation)
PKB
PKI
PKE (σ=.01)
PKE (σ=.005)
PKE (σ=.001)
PKE (σ=.0005)
Time
(Sec./Sent.)
Speedup Accuracy
Ratio
(%)
.85
.49
.0024
.0028
.0034
.0035
1.0
1.7
358.2
300.1
242.6
238.8
97.94
97.94
97.93
97.95
97.94
97.94
25
Results
(Japanese Dependency Parsing)
Time
(Sec./Sent.)
PKB
PKI
PKE (σ=.01)
PKE (σ=.005)
PKE (σ=.001)
PKE (σ=.0005)
.285
.0226
.0042
.0060
.0086
.0090
Speedup Accuracy
Ratio
(%)
1.0
12.6
66.8
47.8
33.3
31.8
89.29
89.29
88.91
89.05
89.26
89.29
26
Results
2 - 12 fold speed up in PKI
30 - 300 fold speed up in PKE
Preserve the accuracy when we set an
appropriate σ
27
Comparison with related work
XQK [Isozaki et al. 02]



Same concept as PKE
Designed only for the Quadratic Kernel
Exhaustively creates the expansion table
PKE


Designed for general Polynomial Kernels
Uses subset mining algorithms to create
the expansion table
28
Conclusions
Propose two fast methods for the
polynomial kernel of degree d


PKI (Inverted)
PKE (Expanded)
2-12 fold speed up in PKI, 30-300 fold
speed up in PKE
Preserve the accuracy
29
Future Work
Examine the effectiveness in a general
machine learning dataset
Apply PKE to other convolution kernels

Tree Kernel [Collins 00]
 Dot-product between trees
 Feature space is all sub-tree
 Apply sub-tree mining algorithm [Zaki 02]
30
English Base-NP Chunking
Extract Non-overlapping Noun Phrase from text
[NP He ] reckons [NP the current account deficit ] will narrow to
[NP only # 1.8 billion ] in [NP September ] .
BIO representation
(seeing as a tagging task)
B: beginning of chunk
I: non-initial chunk
O: outside
Pair-wise method to 3-class problem
training: wsj15-18, test: wsj20 (standard set)
31
Japanese Word Segmentation
Taro made Hanako read a book
Sentence: 太 郎 は 花 子 に 本 を 読 ま せ た
↑↑
↑↑↑↑ ↑↑
Boundaries:
If there is a boundary between i and i  1
Yi  1 , otherwise Yi  1
X i  {ci 2 , ci 1, ci, , ci 1, ci 2 , ci 3}
Distinguish the relative position
Use also the character types of Japanese
Training: KUC 01-08, Test: KUC 09
32
Japanese Dependency Parsing
私は
I-top
ケーキを
cake-acc.
食べる
eat
I eat a cake
Identify the correct dependency relations
between two bunsetsu (base phrase in English)
Linguistic features related to the modifier
and head (word, POS, POS-subcat, inflections, punctuations, etc)
Binary classification (+1 dependent, -1 independent)
Cascaded Chunking Model [kudo, et al. 02]
Training: KUC 01-08, Test: KUC 09
33
Kernel Methods (1/2)
Suppose a learning task: g : X  {1,1}
T  { X 1 , X L }
training examples
g ( X )  sgn( f ( X ))
L
f ( X )    i φ( X i )  φ( X )
i 1
X : example to be classified
Xi: training examples
i : weight for examples
φ : a function to map examples to another vectorial space
34
PKE (Expanded Representation)
d

f ( X )   i  cd (r ) | Pr ( X i  X ) |
i 1
 r 0

L
If we calculate in advance ( I
is the indicator function)
L
w( s)   i  cd (| s |)  I ( s  P|s| ( X i ))
i 1
for all subsets
d
s  d ( F )  r 0 Pr ( F )
f (X ) 
 w(s)
sd ( X )
d
d ( X )  r 0 Pr ( X )
35
TRIE representation
root
w
{a}
{d}
{a,b}
{a,c}
{b,c}
{b,d}
{c,d}
{b,c,d}
10.5
-10.5
12
12
-12
-18
-24
-12
a10.5 b
b
12
c
c c d
12
-12
-18
d
d
-10.5
-24
d
-12
Compress redundant structures
Classification can be done by simply
traversing the TRIE
36
Kernel Methods
L
f ( X )    i φ( X i )  φ( X )
i 1
L
 i K ( X i , X )
i 1
Training data
T  {X1 , X 2 ,, X L }
No need to represent example in an explicit
feature vector
Complexity of testing is O(L |X|)
37