Conditional Random Fields を用いた日本語形態素解析

Applying Conditional
Random Fields
to Japanese Morphological
Analysis
Taku Kudo 1*, Kaoru Yamamoto 2, Yuji Matsumoto 1
1 Nara Institute of Science and Technology
2 CREST, Tokyo Institute of Technology
* Currently, NTT Communication Science Labs.
1
Backgrounds

Conditional Random Fields [Lafferty 01]



A variant of Markov Random Fields
Many applications
POS tagging [Lafferty01], shallow parsing [Sha 03], NE
recognition [McCallum 03], IE [Pinto 03, Peng 04]
Japanese Morphological Analysis



Must cope with word segmentation
Must incorporate many features
Must minimize the influence of the length bias
2
Japanese Morphological Analysis
INPUT:
東京都に住む (I live in Metropolis of Tokyo.)
東京 / 都 / に / 住む
東京
都
に
住む



(Tokyo)
(Metro.)
(in)
(live)
NOUN-PROPER-LOC-GENERAL
NOUN-SUFFIX-LOC
PARTICLE-GENERAL
VERB
BASE-FORM
word segmentation (no explicit spaces in Japanese)
POS tagging
lemmatization, stemming
3
Simple approach for JMA

Character-based begin / inside tagging

non standard method in JMA
 cannot directly reflect lexicons


over 90% accuracy can be achieved using the naïve
longest prefix matching with a lexicon
decoding is slow
東 京/ 都/ に/住む
B I B
B B I
4
Our approach for JMA
Assume that a lexicon is available
 word lattice


represents all candidate outputs
 reduces redundant outputs

Unknown word processing

invoked when no matching word can be
found in a lexicon
 character types
e.g., Chinese character, hiragana, katakana, number .. etc
5
lexicon
Problem Setting
Input:
“東京都に住む (I live in Metropolis of Tokyo)”
京都 (Kyoto)
Lattice:
に (in)
[noun]
BOS
東 (east) 京 (capital)
[noun]
都 (Metro.)
[noun]
[suffix]
東京 (Tokyo)
particle,
verb
東
noun
京
noun
東京 noun
京都 noun
…
に
[particle]
に (resemble)
住む (live)
[verb]
EOS
[verb]
[noun]
GOAL: select the optimal path out of all candidates
Input:
Output:
X
Y  ( w1 , t1 ,, w#Y , t#Y )
NOTE: the number of tokens #Y varies
6
Long-standing
Problems in JMA
7
Complex tagset

Hierarchical tagset

HMMs cannot capture them
 How to select the hidden classes?




京都 (Kyoto)
Noun
Proper
Loc
General
Kyoto
TOP level → lack of granularity
Bottom level → data sparseness
Some functional particles should be lexicalized
Semi-automatic hidden class selections
[Asahara 00]
8
Complex tagset, cont.

Must capture a variety of features
京都 (Kyoto)
に (in)
住む (live)
noun
proper
loc
general
Kyoto
particle
general
φ
φ
に
verb
independent
φ
φ
live
base-form
POS hierarchy
character types
prefix, suffix
lexicalization
These features are important to JMA
overlapping
features
inflections
9
JMA with MEMMs [Uchimoto 00-03]


Use discriminative model, e.g., maximum
entropy model, to capture a variety of features
sequential application of ME models
P(に, particle|都,suffix) > P(に,verb|都,suffix)
に (particle)
P(東|BOS) < P(東京|BOS)
BOS
東 (east)
都 (capital)
[noun]
[suffix]
東京 (Tokyo)
[particle]
に (resemble)
[verb]
[noun]
P(Y | X )   p( wi , ti | wi 1 , ti 1 )
i
10
Problems of MEMMs

P(Y | X )   p( wi , ti | wi 1 , ti 1 )
Label bias [Lafferty 01]
i
0.4
C
D
0.6
A
BOS
0.4
0.6
B
1.0
1.0
1.0
E
EOS
1.0
P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36 P(A,D|x) < P(B,E|x)
P(B, E | x) = 0.4 * 1.0 * 1.0 = 0.4
paths with low-entropy are preferred
11
Problems of MEMMs in JMA

P(Y | X )   p( wi , ti | wi 1 , ti 1 )
Length bias
i
0.4
C
D
0.6
A
BOS
0.4
0.6
B
1.0
1.0
EOS
1.0
P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36 P(A,D|x) < P(B|x)
P(B
| x) = 0.4 *
1.0 = 0.4
long words are preferred
length bias has been ignored in JMA !
12
Long-standing problems

must incorporate a variety of features

overlapping features, POS hierarchy,
lexicalization, character-types
 HMMs are not sufficient

must minimize the influence of length bias

another bias observed especially in JMA
 MEMMs are not sufficient
Can CRFs solve these problems?
Yes!
13
Use of CRFs to JMA
14
CRFs for word lattice
京都 (Kyoto)
Lattice:
BOS
に (in)
[noun]
東 (east)
[noun]
京 (capital)
都 (Metro.)
[noun]
[suffix]
東京 (Tokyo)
[particle]
住む (live)
に (resemble)
[verb]
EOS
[verb]
[noun]
encodes a variety of uni- or bi-gram features in a path
BOS - noun
noun - suffix
Global Feature F(Y,X) = (…
Parameter
Λ
= (…
noun / Tokyo
1… …1… …1…)
3 … … 20 …
20 ... )
exp(Λ  F(Y , X )) Z 
exp(Λ  F(Y ' , X ))
P(Y | X ) 

X
Y ' ( X )
Zx
(X ) : a set of all candidate paths
15
CRFs for word lattice, cont.

single exponential model for the entire paths
exp(Λ  F(Y , X ))
P(Y | X ) 
Zx

exp( k  f k ( wi 1 , ti 1 , wi , ti )
f1234 ( wi 1 , ti 1
i
k
ZX
1 : ti 1  noun, ti  paritcle
, wi , ti )  
otherwise
0 :
 fewer restrictions in the feature design
 can incorporate a variety of features
 can solve the problems of HMMs
16
Encoding

Maximum Likelihood estimation
ˆ  arg max L
Λ
Λ
 all candidate
paths are taken in encoding
K
Λ
 influence
N of length bias will be minimized
 solve
(Y j | X j ))of MEMMs
LΛcan
thePproblems
 log(
j 1



expΛ  F (Y j , X j )  Λ  F (Y ' , X j ) 
j log Y '


(
X
)
j


 A variant
of Forward-Backward [Lafferty 01] can
also be applied to word lattice
17
MAP estimation
L  C log(P(Y
N
Λ

j 1
j
1 ||  ||1
| X j ))  
2 ||  ||2
L2-CRF (Gaussian prior)
 non-sparse
solution (all features have non-zero weight)
 good if most given features are relevant
 non-constrained optimizers, e.g., L-BFGS, are used

L1-CRF (Laplacian prior)
 sparse
solution (most features have zero-weight)
 good if most given features are irrelevant
 constrained optimizers, e.g., L-BFGS-B, are used

C is a hyper-parameter
18
Decoding
Yˆ  arg max P(Y | X )
Y  ( X )
 arg max Λ  F(Y , X )
Y  ( X )
 Viterbi algorithm
 essentially the same architecture as HMMs and
MEMMs
19
Experiments
20
Data
KC and RWCP, widely-used Japanese annotated corpora
KC
source
Mainichi News Article ‘95
lexicon (size)
JUMAN 3.61 (1,983,173)
POS structure
2-levels POS,
c-form, c-type, base
# training sentences
7,958
# training tokens
198,514
# test sentences
1,246
# test tokens
31,302
# features
791,798
21
Features
f1234 ( wi 1 , ti 1
1 : ti 1  noun, ti  paritcle
, wi , ti )  
otherwise
0 :
京都 (Kyoto)
に (in)
住む (live)
noun
proper
loc
general
Kyoto
particle
general
φ
φ
に
verb
independent
φ
φ
live
base-form
POS hierarchy
character types
prefix, suffix
lexicalization
overlapping
features
inflections
22
Evaluation
F
=
recall =
precision=

2・recall・precision
recall + precision
# correct tokens
# tokens in test corpus
# correct tokens
# tokens in system output
three criteria of correctness
 seg:
word segmentation only
 top: word segmentation + top level of POS
23
 all: all information
Results
Significance Tests:
McNemar’s paired test on the labeling disagreements
seg
top
all
L2-CRFs
98.96
98.31
96.75
L1-CRFs
98.80
98.14
96.55
HMMs
96.22
94.99
91.85
MEMMs
96.44
95.81
94.28


L1/L2-CRFs outperform HMM and MEMM
L2-CRFs outperform L1-CRFs
24
Influence of the length bias
# long word err.
HMMs
# short word err.
306 (44%)
387 (56%)
L2-CRFs
79 (40%)
120 (60%)
MEMMs
416 (70%)
183 (30%)


HMM, CRFs: relative ratios are not much different
MEMM:
# of long word errors is large
→ influenced by the length bias
25
L1-CRFs v.s L2-CRFs

L2-CRFs > L1-CRFs

most given features are relevant
(POS hierarchies, suffixes/prefixes, character types)

L1-CRFs produce a compact model

# of active features


L2: 791,798 v.s L1: 90,163
11%
L1-CRFs are worth being examined
if there exist practical constraints
26
Conclusions
An application of CRFs to JMA
 Not use character-based begin / inside tags
but use word lattice with a lexicon
 CRFs offer an elegant solution to the
problems with HMMs and MEMMs


can use a wide variety of features
(hierarchical POS tags, inflections, character types, …etc)

can minimize the influence of the length bias
(length bias has been ignored in JMA!)
27
Future work

Tri-gram features

Use of all tri-grams is impractical as they make
the decoding speed significantly slower
 need to use a practical feature selection
e.g., [McCallum 03]

Apply to other non-segmented languages
e.g., Chinese or Thai
28
CRFs encoding

A variant of Forward-Backward [Lafferty 01] can
also be applied to word lattice
E P (Y | X ) Fk (Y , X ) 


w ',t ' , w,t
BOS
α

f k  w' , t ' , w, t 
w’,t’
exp()
w,t

w ',t '
β


 exp  k ' f k '  w' , t ' , w, t   
 k'

ZX
EOS
α’
w’,t’
α’
w’,t’
α’
w’,t’
w ,t
α
w,t

ZX

   ' exp  k  f k  w' , t ' , w, t 

k
bos  eos  1, Z X  eos  bos29

Influence of the length bias, cont.
MEMMs select
ロマンは
海
に
かけた
sea
particle
bet
romanticist
ロマン
は
romance
particle
The romance on the sea they bet is …
MEMMs select
ない心
荒波
に
負け
rough waves
particle
loose
one’s heart
ない
心
not
heart
A heart which beats rough waves is …
 caused rather by the influence of the length bias
(CRFs can correctly analyze these sentences)
30
Cause of label and length bias
MEMM only use correct path in encoding
 transition probabilities of unobserved
paths will be distributed uniformly

京都
に
[noun]
BOS
京
東
[noun]
[noun]
東京
都
[suffix]
[particle]
に
[particle]
[non]
31