Applying Conditional Random Fields to Japanese Morphological Analysis Taku Kudo 1*, Kaoru Yamamoto 2, Yuji Matsumoto 1 1 Nara Institute of Science and Technology 2 CREST, Tokyo Institute of Technology * Currently, NTT Communication Science Labs. 1 Backgrounds Conditional Random Fields [Lafferty 01] A variant of Markov Random Fields Many applications POS tagging [Lafferty01], shallow parsing [Sha 03], NE recognition [McCallum 03], IE [Pinto 03, Peng 04] Japanese Morphological Analysis Must cope with word segmentation Must incorporate many features Must minimize the influence of the length bias 2 Japanese Morphological Analysis INPUT: 東京都に住む (I live in Metropolis of Tokyo.) 東京 / 都 / に / 住む 東京 都 に 住む (Tokyo) (Metro.) (in) (live) NOUN-PROPER-LOC-GENERAL NOUN-SUFFIX-LOC PARTICLE-GENERAL VERB BASE-FORM word segmentation (no explicit spaces in Japanese) POS tagging lemmatization, stemming 3 Simple approach for JMA Character-based begin / inside tagging non standard method in JMA cannot directly reflect lexicons over 90% accuracy can be achieved using the naïve longest prefix matching with a lexicon decoding is slow 東 京/ 都/ に/住む B I B B B I 4 Our approach for JMA Assume that a lexicon is available word lattice represents all candidate outputs reduces redundant outputs Unknown word processing invoked when no matching word can be found in a lexicon character types e.g., Chinese character, hiragana, katakana, number .. etc 5 lexicon Problem Setting Input: “東京都に住む (I live in Metropolis of Tokyo)” 京都 (Kyoto) Lattice: に (in) [noun] BOS 東 (east) 京 (capital) [noun] 都 (Metro.) [noun] [suffix] 東京 (Tokyo) particle, verb 東 noun 京 noun 東京 noun 京都 noun … に [particle] に (resemble) 住む (live) [verb] EOS [verb] [noun] GOAL: select the optimal path out of all candidates Input: Output: X Y ( w1 , t1 ,, w#Y , t#Y ) NOTE: the number of tokens #Y varies 6 Long-standing Problems in JMA 7 Complex tagset Hierarchical tagset HMMs cannot capture them How to select the hidden classes? 京都 (Kyoto) Noun Proper Loc General Kyoto TOP level → lack of granularity Bottom level → data sparseness Some functional particles should be lexicalized Semi-automatic hidden class selections [Asahara 00] 8 Complex tagset, cont. Must capture a variety of features 京都 (Kyoto) に (in) 住む (live) noun proper loc general Kyoto particle general φ φ に verb independent φ φ live base-form POS hierarchy character types prefix, suffix lexicalization These features are important to JMA overlapping features inflections 9 JMA with MEMMs [Uchimoto 00-03] Use discriminative model, e.g., maximum entropy model, to capture a variety of features sequential application of ME models P(に, particle|都,suffix) > P(に,verb|都,suffix) に (particle) P(東|BOS) < P(東京|BOS) BOS 東 (east) 都 (capital) [noun] [suffix] 東京 (Tokyo) [particle] に (resemble) [verb] [noun] P(Y | X ) p( wi , ti | wi 1 , ti 1 ) i 10 Problems of MEMMs P(Y | X ) p( wi , ti | wi 1 , ti 1 ) Label bias [Lafferty 01] i 0.4 C D 0.6 A BOS 0.4 0.6 B 1.0 1.0 1.0 E EOS 1.0 P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36 P(A,D|x) < P(B,E|x) P(B, E | x) = 0.4 * 1.0 * 1.0 = 0.4 paths with low-entropy are preferred 11 Problems of MEMMs in JMA P(Y | X ) p( wi , ti | wi 1 , ti 1 ) Length bias i 0.4 C D 0.6 A BOS 0.4 0.6 B 1.0 1.0 EOS 1.0 P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36 P(A,D|x) < P(B|x) P(B | x) = 0.4 * 1.0 = 0.4 long words are preferred length bias has been ignored in JMA ! 12 Long-standing problems must incorporate a variety of features overlapping features, POS hierarchy, lexicalization, character-types HMMs are not sufficient must minimize the influence of length bias another bias observed especially in JMA MEMMs are not sufficient Can CRFs solve these problems? Yes! 13 Use of CRFs to JMA 14 CRFs for word lattice 京都 (Kyoto) Lattice: BOS に (in) [noun] 東 (east) [noun] 京 (capital) 都 (Metro.) [noun] [suffix] 東京 (Tokyo) [particle] 住む (live) に (resemble) [verb] EOS [verb] [noun] encodes a variety of uni- or bi-gram features in a path BOS - noun noun - suffix Global Feature F(Y,X) = (… Parameter Λ = (… noun / Tokyo 1… …1… …1…) 3 … … 20 … 20 ... ) exp(Λ F(Y , X )) Z exp(Λ F(Y ' , X )) P(Y | X ) X Y ' ( X ) Zx (X ) : a set of all candidate paths 15 CRFs for word lattice, cont. single exponential model for the entire paths exp(Λ F(Y , X )) P(Y | X ) Zx exp( k f k ( wi 1 , ti 1 , wi , ti ) f1234 ( wi 1 , ti 1 i k ZX 1 : ti 1 noun, ti paritcle , wi , ti ) otherwise 0 : fewer restrictions in the feature design can incorporate a variety of features can solve the problems of HMMs 16 Encoding Maximum Likelihood estimation ˆ arg max L Λ Λ all candidate paths are taken in encoding K Λ influence N of length bias will be minimized solve (Y j | X j ))of MEMMs LΛcan thePproblems log( j 1 expΛ F (Y j , X j ) Λ F (Y ' , X j ) j log Y ' ( X ) j A variant of Forward-Backward [Lafferty 01] can also be applied to word lattice 17 MAP estimation L C log(P(Y N Λ j 1 j 1 || ||1 | X j )) 2 || ||2 L2-CRF (Gaussian prior) non-sparse solution (all features have non-zero weight) good if most given features are relevant non-constrained optimizers, e.g., L-BFGS, are used L1-CRF (Laplacian prior) sparse solution (most features have zero-weight) good if most given features are irrelevant constrained optimizers, e.g., L-BFGS-B, are used C is a hyper-parameter 18 Decoding Yˆ arg max P(Y | X ) Y ( X ) arg max Λ F(Y , X ) Y ( X ) Viterbi algorithm essentially the same architecture as HMMs and MEMMs 19 Experiments 20 Data KC and RWCP, widely-used Japanese annotated corpora KC source Mainichi News Article ‘95 lexicon (size) JUMAN 3.61 (1,983,173) POS structure 2-levels POS, c-form, c-type, base # training sentences 7,958 # training tokens 198,514 # test sentences 1,246 # test tokens 31,302 # features 791,798 21 Features f1234 ( wi 1 , ti 1 1 : ti 1 noun, ti paritcle , wi , ti ) otherwise 0 : 京都 (Kyoto) に (in) 住む (live) noun proper loc general Kyoto particle general φ φ に verb independent φ φ live base-form POS hierarchy character types prefix, suffix lexicalization overlapping features inflections 22 Evaluation F = recall = precision= 2・recall・precision recall + precision # correct tokens # tokens in test corpus # correct tokens # tokens in system output three criteria of correctness seg: word segmentation only top: word segmentation + top level of POS 23 all: all information Results Significance Tests: McNemar’s paired test on the labeling disagreements seg top all L2-CRFs 98.96 98.31 96.75 L1-CRFs 98.80 98.14 96.55 HMMs 96.22 94.99 91.85 MEMMs 96.44 95.81 94.28 L1/L2-CRFs outperform HMM and MEMM L2-CRFs outperform L1-CRFs 24 Influence of the length bias # long word err. HMMs # short word err. 306 (44%) 387 (56%) L2-CRFs 79 (40%) 120 (60%) MEMMs 416 (70%) 183 (30%) HMM, CRFs: relative ratios are not much different MEMM: # of long word errors is large → influenced by the length bias 25 L1-CRFs v.s L2-CRFs L2-CRFs > L1-CRFs most given features are relevant (POS hierarchies, suffixes/prefixes, character types) L1-CRFs produce a compact model # of active features L2: 791,798 v.s L1: 90,163 11% L1-CRFs are worth being examined if there exist practical constraints 26 Conclusions An application of CRFs to JMA Not use character-based begin / inside tags but use word lattice with a lexicon CRFs offer an elegant solution to the problems with HMMs and MEMMs can use a wide variety of features (hierarchical POS tags, inflections, character types, …etc) can minimize the influence of the length bias (length bias has been ignored in JMA!) 27 Future work Tri-gram features Use of all tri-grams is impractical as they make the decoding speed significantly slower need to use a practical feature selection e.g., [McCallum 03] Apply to other non-segmented languages e.g., Chinese or Thai 28 CRFs encoding A variant of Forward-Backward [Lafferty 01] can also be applied to word lattice E P (Y | X ) Fk (Y , X ) w ',t ' , w,t BOS α f k w' , t ' , w, t w’,t’ exp() w,t w ',t ' β exp k ' f k ' w' , t ' , w, t k' ZX EOS α’ w’,t’ α’ w’,t’ α’ w’,t’ w ,t α w,t ZX ' exp k f k w' , t ' , w, t k bos eos 1, Z X eos bos29 Influence of the length bias, cont. MEMMs select ロマンは 海 に かけた sea particle bet romanticist ロマン は romance particle The romance on the sea they bet is … MEMMs select ない心 荒波 に 負け rough waves particle loose one’s heart ない 心 not heart A heart which beats rough waves is … caused rather by the influence of the length bias (CRFs can correctly analyze these sentences) 30 Cause of label and length bias MEMM only use correct path in encoding transition probabilities of unobserved paths will be distributed uniformly 京都 に [noun] BOS 京 東 [noun] [noun] 東京 都 [suffix] [particle] に [particle] [non] 31
© Copyright 2024 ExpyDoc