Forest-to-String SMT for Asian Language Translation: NAIST at WAT 2014 Graham Neubig Nara Institute of Science and Technology (NAIST), Japan Framework: F2S SMT x1 with x0 VP0-5 VP2-5 PP0-1 N0 友達 PP2-3 P1 と x1 x0 VP4-5 N2 P3 ご飯 を V4 SUF5 食べ た x1 x1 x0 ate a meal ate a meal my friend x0 Base Model Transliteration/Dictionaries Translation Model ● Synchronous tree substitution grammar ● 5 composed rules ● Kneser-Ney count smoothing ● Right binarization ja-zh, zh-ja ● Post: Convert Simplified ↔ Japanese Kanji Language Model ● KenLM trained 6-gram model ● with my friend Optimization ● Minimum error rate training ● Tested with BLEU or BLEU+RIBES Data Preparation Data Selection ● ja-zh TM: All data (672k) ● ja-en TM: ASPEC first 2M ● Optional dictionaries: EIJIRO, EDICT, Wiki I can eat an apple </s> LM: All data Parsing これ は テスト です 。 ● Parser: Egret ● en→Penn TB, zh→Penn CTB, ja→Ja. Word Dependency Corpus [Mori+14] ● For Japanese, convert w/ head rules [Mikolov+ 10] Improve robustness, longer context ● 500 hidden nodes, 300 classes ● Trained on first 500k sentences ● 10,000-best reranking ● Unknown Splitting (ja-en) [Koehn+ 03] ● Choose split that maximizes unigram prob: UNK これ は テスト です 。 これ を 球内部 に これ は テスト です 。 This is a test . Alignment ● ja-zh: GIZA++ ● ja-en: Nile (supervised syntactic aligner) Trained on KFTT aligned data P( 球 )P( 内部 ) > P( 球内 )P( 部 ) これ を 球 内部 に ● Test time only 臭気鉴定师 イチョウ黄葉 イチョウ黄叶 臭気鑑定師 ja-en ● Pre: Normalize 標題 to 表題 ● Post: If word exists in dictionary, translate ● ● Eijiro, Edict, Wiki Language Link Post: Romanize Hiragana/Katakana Words Japan Intekku Japan インテック ● Post: Delete remaining Japanese words Results 40 First place in all tasks! BLEU 50 30 農産業 → 農 産業 Tokenization ● en: Stanford Tokenizer (+ split “-” and “/”) ● zh: Stanford Segmenter ● ja: KyTea これ は テスト です 。 Use Kanconvit.pm script Recurrent Neural Network LM ● ● For ja, interpolated zh-ja and en-ja data ● +3.6 +2.2 +1.8 +2.7 20 10 en-ja ja-en zh-ja ja-zh 60 50 40 30 20 10 0 HUMAN +28.3 +13.0 +15.0 Other NAIST +3.8 en-ja ja-en zh-ja ja-zh RNNLM Helps! BLEU en-ja w/o RNN 36.50 w/ RNN 37.21 ja-en 23.76 24.72 zh-ja 39.82 40.61 ja-zh 29.27 29.78 Tuning w/ RIBES hurts human eval! Tune BLEU B+R Eval B R H B R H en-ja 37.2 80.2 56.3 37.2 80.7 51.5 zh-ja 41.3 83.5 50.8 40.8 83.8 38.0 ja-zh 30.5 81.8 17.8 29.8 83.0 1.3 Why? Too-short hypotheses. Scripts available! http://phontron.com/project/wat2014
© Copyright 2024 ExpyDoc