Japanese Dependency Analysis using Cascaded

Japanese Dependency Analysis
using Cascaded Chunking
Taku Kudo 工藤 拓
Yuji Matsumoto 松本 裕治
Nara Institute Science and Technology,
JAPAN
Motivation

Kudo, Matsumoto 2000 (VLC)



Presented a state-of-the-art Japanese dependency
parser using SVMs (89.09% for standard dataset)
Could show the high generalization performance and
feature selection abilities of SVMs
Problems

Not scalable
• 2 weeks training using 7,958 sentences
• Hard to train with larger data

Slow in Parsing
• 2 ~ 3 sec./sentence
• Too slow to use it for actual NL applications
Goal
Improve the scalability and the
parsing efficiency without loosing
accuracy !
 How?




Apply Cascaded Chunking model to
dependency parsing and the selection of
training examples
Reduce the number of times SVMs are
consulted in parsing
Reduce the number of negative examples
learned
Outline
Japanese dependency analysis
 Two models

Probabilistic model (previous)
 Cascaded Chunking model (new!)

Features used for training and
classification
 Experiments and results
 Conclusion and future work

Japanese Dependency
Analysis (1/2)
Analysis of relationship between
phrasal units called bunsetsu
(segments), base phrases in English
 Two Constraints

Each segment modifies one of the
right-side segments (Japanese is head
final language)
 Dependencies do not cross each other

Japanese Dependency
Analysis (2/2)
Raw text
私は彼女と京都に行きます
I go to Kyoto with her.
Morphological analysis and
Bunsetsu identification
私は / 彼女と / 京都に / 行きます
I
with her
to Kyoto-loc
go
Dependency Analysis
私は / 彼女と / 京都に / 行きます
Probabilistic Model
Input
私は 1 / 彼女と 2 / 京都に 3 / 行きます
I-top / with her / to Kyoto-loc / go
1. Build a Dependency
Matrix ME, DT or SVMs
(How probable one
segment modifies another)
4
Modifiee
2. Search the optimal dependencies
which maximize the sentence
probabilities using CYK or Chart
2
Modifier
Output
3
4
1
0.1 0.2 0.7
2
0.2 0.8
3
1.0
Dependency Matrix
私は
1
/ 彼女と
2
/ 京都に
3
/ 行きます
4
Problems of Probabilistic
model(1/2)

Selection of training examples:
All candidates of two segments which have




Dependency relation→positive
No dependency relation→negative
This straightforward way of selection
requires a total n(n  1) / 2 (where n is # of segments
in a sentence) training examples per sentence
Difficult to combine probabilistic model
with SVMs which require polynomial
computational cost
Problems of Probabilistic
model(2/2)

3
O(n ) parsing time is necessary with
CYK or Chart
 Even if beam-search is applied,
2
parsing
O(n ) time is always necessary
 The classification cost of SVMs is
much more expensive than other ML
algorithms such as ME and DT
Cascaded Chunking Model
English parsing [Abney 1991]
 Parses a sentence deterministically
only deciding whether the current
segment modifies the segment on
its immediate right hand side
 Training examples are extracted
using this algorithm itself

Example: Training Phase
Annotated sentence
彼は1
He
彼女の2
her
温かい3
warm
真心に4 感動した。5
heart be moved
(He was moved by her warm heart.)
??
finish
彼は
1
1
彼は
彼は
11
D
O
O
彼女の
彼女の22
O
D
Pairs of tag (D or O)
and context(features)
are stored as training
data for SVMs
?
??
?
?
???
温かい3 真心に44 感動した。
感動した。55
O
D
D
O
O
Training
Data
SVMs
Tag is decided
by annotated
corpus
Example: Test Phase
Test sentence
彼は1
He
彼女の2
her
温かい3
warm
真心に4 感動した。5
heart be moved
(He was moved by her warm heart.)
??
finish
彼は
1
11
彼は
彼は
1
1
D
O
O
彼女の22
O
D
Tag is decided by
SVMs built in
training phase
?
??
?
?
???
温かい3 真心に44 感動した。
感動した。5555
O
D
O
D
SVMs
Advantages of Cascaded
Chunking model

Simple and Efficient
3
 Prob.: O(n )
v.s. cascaded chunking: O(n2 )
2
O
(
n
) since most of segments
 Lower than
modify segment on its immediate righthand-side
 Training examples is much smaller

Independent from ML algorithm
Can be combined with any ML algorithms
which work as a binary classifier
 Probabilities of dependency are not
necessary

Features
Modify or not?
B
彼の1 友人は2
His
friend-top
A
この本を3
C
持っている4 女性を5 探している6
this book-acc
have
modifier
lady-acc
be looking for
modifiee
His friend is looking for a lady who has this book.
 Static Features
 modifier/modifiee
 Head/Functional Word:
(surface,POS,POS-subcategory,inflectiontype,inflection-form), brackets, quotations, punctuations, position
 Between segments:
quotations, punctuations
distance, case-particles, brackets,
 Dynamic Features [Kudo, Matsumoto 2000]
 A,B : Static features of Functional word
 C:
Static features of Head word
Experimental Setting

Kyoto University Corpus 2.0/3.0

Standard Data Set
• Training: 7,958 sentences / Test: 1,246 sentences
• Same data as [Uchimoto et al. 98, Kudo, Matsumoto 00]

Large Data Set
• 2-fold Cross-Validation using all 38,383 sentences

Kernel Function: 3rd polynomial

Evaluation method


Dependency accuracy
Sentence accuracy
Results
Data Set
Standard
Model
Cascaded
Chunking
Large
Probabilistic
Cascaded
Chunking
Probabilistic
Dependency
Acc. (%)
89.29
89.09
90.04
N/A
Sentence
Acc. (%)
47.53
46.17
53.16
N/A
# of training
sentences
7,956
7,956
19,191
19,191
# of training
examples
110,355
459,105
251,254
1,074,316
Training
time (hours)
8
336
48
N/A
Parsing time
(sec./sent.)
0.5
2.1
0.7
N/A
Effect of Dynamic
Features(1/2)
Effect of Dynamic Features
(2/2)
Modify or not?
B
彼の1 友人は2
His
Friend-top
modifier
Deleted type of
dynamic features
A
この本を3
C
持っている4 女性を5 探している6
this book-acc
have
lady-acc
be looking for
modifiee
Difference from the model with all dynamic features
Dependency Acc.
Sentence Acc.
A
-0.28 %
-0.89 %
B
-0.10%
-0.89 %
C
-0.28 %
-0.56 %
AB
-0.33 %
-1.21 %
AC
-0.55 %
-0.97 %
BC
-0.54 %
-1.61 %
ABC
-0.58 %
-2.34 %
Probabilistic v.s. Cascaded
Chunking (1/2)
Probabilistic Model uses all candidates of dependency
relation as training data
彼は1
この本を2
He-top
this book-acc
modifier
持っている3 女性を4
have
lady-acc
探している5
be looking for
modifiee
(He is looking for a lady who has this book.)
Positive:
Negative:
この本を2 → 持っている3
この本を2 → 探している5
unnecessary
Probabilistic models commit a number of
unnecessary examples
Probabilistic v.s. Cascaded
Chunking (2/2)
Probabilistic
Cascaded Chunking
Strategy
Maximize
sentence
probability
Shift-Reduce
Deterministic
Merit
Can see all
candidates of
dependency
Simple, efficient and
scalable
Accurate as Prob. model
Demerit
Not efficient,
Commit to
unnecessary
training examples
Cannot see the all
(posterior) candidates
of dependency
Conclusion
A new Japanese dependency parser
using a cascaded chunking model
 It outperforms the previous
probabilistic model with respect to
accuracy, efficiency and scalability
 Dynamic features significantly
contribute to improve the performance

Future Work

Coordinate structure analysis


Coordinate structures frequently
appear in Japanese long sentences
and make analysis hard
Use posterior context

Hard to parse the following sentence
only using cascaded chunking model
僕の 母の
ダイヤの
My mother’s diamond
指輪
ring
Comparison with Related Work
Model
Training Corpus
sentences)
Acc.
(%)
Cascaded
Chunking +
SVMs
Kyoto Univ. (19,191)
90.46
Kyoto Univ. (7,956)
89.29
Kudo et al. 00
Prob. + SVMs
Kyoto Univ. (7,956)
89.09
Uchimoto et al.
00
Prob. + ME
Kyoto Univ. (7,956)
87.93
Kanayama et al.
00
Prob. + ME +
HPSG
EDR (192,778)
88.55
Haruno et al. 98 Prob. + DT +
Boosting
EDR (50,000)
85.03
Fujio et al. 98
EDR (190,000)
86.67
Our Model
Prob. + ML
(# of
Support Vector Machines
[Vapnik]
w  x  b  1
w  x  b  1
yi  1
d
d
yi  1
d
d
wx b  0
Maximize the margin d
d
| w  xi  b  1 | | w  xi  b  1 |
2


|| w ||
|| w ||
|| w ||
Min.:
s.t.:
L(w)  || w || / 2
yi [(w  xi )  b]  1
2
 Soft Margin
 Kernel Function