Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation Mamoru Komachi [email protected] 2008-06-04 Background Success of supervised ML methods depends on annotated corpus Hard to maintain Weakly supervised method requires only small amount of tagged data Can reduce amount of human effort Problems remained WSD is crucial to weakly supervised method Cf. semantic drift Parallel corpora (and dictionaries) may help disambiguate word senses WSD models in SMT systems gain much attention (Carpuat and Wu, 2007) WSD and SMT Improving Statistical Machine Translation using Word Sense Disambiguation (Carpuat and Wu, EMNLP 2007) SMT is known to suffer from inaccurate lexical choice (based on senseval style sense inventory) Domain adaptation problem Input is typically a word Limited contextual features Phrase-based WSD models for SMT Sense annotations are derived from phrase alignment learned during SMT training WSD senses are from the SMT phrasal translation lexicon “phrase table” Not only words but also phrases are to be disambiguated Supervised WSD (an ensemble methods of naïve Bayes, ME, boosting and a Kernel PCA) Phrase table is highly ambiguous Phrase table constructed from NTCIR-7 J-E parallel corpus 0.5~1GB (in gzip format) 2.53 candidates per phrase (3.24 candidates per phrase for phrases shorter than 5 words) Includes function words as well Plant ||| 工場 Plant ||| 植物 Plant ||| 設備 Plant ||| 発電 プラント Plant ||| 工場 内 Plant ||| 制御 対象 Plant ||| 動植物 Plant ||| 供給 プラント Plant has 120 translations in the phrase table! Motivation Propose a novel graph-based approach to phrase sense disambiguation Can exploit bilingual contextual patterns Evaluate phrase sense disambiguation on SMT framework Monolingual bootstrapping Pioneered by (Yarowsky, 1995) Learn decision lists from a small set of seed instances (input: instance I, output: classifier) ‘One sense per discourse’ constraint Bootstrapping Iteratively conduct pattern induction and instance extraction starting from seed instances Can fertilize small set of seed instances Instances vaio Toshiba satellite HP xb3000 2015/10/1 Query log (Corpus) Contextual patterns Compare vaio laptop Compare toshiba satellite laptop Compare HP xb3000 laptop 9 Compare # laptop #:slot Bilingual bootstrapping Word Translation Disambiguation Using Bilingual Bootstrapping (Li and Li, ACL-2002) … 工場 植物 … … Mill Plant Vegetable … corpus WSD classifier コーパス Formalization of bootstrapping Score vector of seed instance Pattern-instance matrix P 0 1 P 1 0 1 0 0 0 0 0 1 0 1 1 1 1 Iterate インスタンスの類似度行列をA=PTP として、このステップを再帰的に行うと in=Ani0 インスタンスを最終 スコア順に出力 in 1 P T pn 2015/10/1 行列の(p,i)要素はパターンp とインスタンスiの共起 pn Pin i0 0,...,1,...,0 Output ranked instances when stopping criterion met 11 Bilingual phrase sense disambiguation The (p,i) element of a pattern-instance matrix P is a co-occurrence between pattern p and instance I p: contextual features of both language sides, with phrase alignment from GIZA++ i: candidate (monolingual) phrase to disambiguate A = PTP Similarity is given by the regularized Laplacian kernel Regularized Laplacian kernel Predict final sense by k-NN given the target instance グラフGのラプラシアンL 次数対角行列Dのi番目の対角要素 L D A D(i,i) A(i, j) A:隣接行列 β:拡散係数 j 正則化ラプラシアン行列Rβ R n (L) n (I L)1 n 0 K A n A n ノイマンカーネル行列 n 0 においてAの代わりに-Lを使用、右辺第一項のAを削除 2015/10/1 13 NTCIR-7 Patent Translation Task Large-scale Japanese-English parallel corpus 2M sentences (comparable to A-E, C-E MT) Mainly technical documents Timeline 2008.01: dry run 2008.05: formal run 2008.12: final meeting NAIST-NTT at NTCIR-7 Bilingual dictionary extracted (solely) from Wikipedia Used langlinks from Wikipedia DB 1:n translations are expanded to n pairs of bilingual phrase (en, ja) Extracted ~200,000 pairs 12,000 pairs appear in the training corpus (8.8%) 44.7% of words (token) is covered by automatically constructed bilingual lexicon (GIZA++) Learned 1,193 (0.6%) novel translation Proposed method (but not yet finished training…) Extract translation pairs relevant to the given domain (patent translation task) Construct a pattern-instance matrix P Pattern features: bag-of-words feature and link features extracted from Wikipedia ja-en abstract Instance: translation pair (en, ja) Seed instances: 40 translation pairs from the target domain Apply Laplacian kernel Evaluation (BLEU score) -Wikipedia +Wikipedia Fmlrun-int.en.out 28.24 27.28 Fmlrun-int.ja.out 26.39 26.48 Fmlrun-int.ja.out.recased 25.34 25.47 Fmlrun-int.ja-out.detokenized 20.38 20.52 •E-J translation(en) gets worse with Wikipedia dictionary •J-E translation(ja) gives slightly better performance with Wikipedia dictionary than without Future work Implement bilingual phrase sense disambiguation Evaluate this method against IWSLT 2006 JE/E-J and NTCIR-7 J-E/E-J datasets Future work(2) Automatic extraction of biomedical lexicon starting from life-science dictionary (mining from MedLine, etc…) Summarization (Harendra’s work)…
© Copyright 2024 ExpyDoc