ja-en

Forest-to-String SMT for Asian Language Translation: NAIST at WAT 2014
Graham Neubig
Nara Institute of Science and Technology (NAIST), Japan
Framework: F2S SMT
x1 with x0
VP0-5
VP2-5
PP0-1
N0
友達
PP2-3
P1
と
x1 x0
VP4-5
N2
P3
ご飯
を
V4
SUF5
食べ た
x1
x1
x0
ate a meal
ate
a meal
my friend
x0
Base Model
Transliteration/Dictionaries
Translation Model
● Synchronous tree substitution grammar
● 5 composed rules
● Kneser-Ney count smoothing
● Right binarization
ja-zh, zh-ja
● Post: Convert Simplified ↔ Japanese Kanji
Language Model
● KenLM trained 6-gram model
●
with
my friend
Optimization
● Minimum error rate training
● Tested with BLEU or BLEU+RIBES
Data Preparation
Data Selection
● ja-zh TM: All data (672k)
● ja-en TM:
ASPEC first 2M
● Optional dictionaries: EIJIRO, EDICT, Wiki
I
can
eat
an
apple </s>
LM: All data
Parsing
これ は テスト です 。
● Parser: Egret
● en→Penn TB, zh→Penn CTB,
ja→Ja. Word Dependency Corpus [Mori+14]
● For Japanese, convert w/ head rules
[Mikolov+ 10]
Improve robustness, longer context
● 500 hidden nodes, 300 classes
● Trained on first 500k sentences
● 10,000-best reranking
●
Unknown Splitting (ja-en)
[Koehn+ 03]
● Choose split that maximizes unigram prob:
UNK
これ は テスト です 。
これ を 球内部 に
これ は テスト です 。
This is a test .
Alignment
● ja-zh: GIZA++
● ja-en: Nile (supervised syntactic aligner)
Trained on KFTT aligned data
P( 球 )P( 内部 ) > P( 球内 )P( 部 )
これ を 球 内部 に
●
Test time only
臭気鉴定师
イチョウ黄葉
イチョウ黄叶
臭気鑑定師
ja-en
● Pre: Normalize 標題 to 表題
● Post: If word exists in dictionary, translate
●
●
Eijiro, Edict, Wiki Language Link
Post: Romanize Hiragana/Katakana Words
Japan Intekku
Japan インテック
●
Post: Delete remaining Japanese words
Results
40
First place in all tasks!
BLEU
50
30
農産業 → 農 産業
Tokenization
● en: Stanford Tokenizer (+ split “-” and “/”)
● zh: Stanford Segmenter
● ja: KyTea
これ は テスト です 。
Use Kanconvit.pm script
Recurrent Neural Network LM
●
●
For ja, interpolated zh-ja and en-ja data
●
+3.6
+2.2
+1.8
+2.7
20
10
en-ja ja-en zh-ja ja-zh
60
50
40
30
20
10
0
HUMAN
+28.3
+13.0
+15.0
Other
NAIST
+3.8
en-ja ja-en zh-ja ja-zh
RNNLM Helps!
BLEU en-ja
w/o RNN 36.50
w/ RNN
37.21
ja-en
23.76
24.72
zh-ja
39.82
40.61
ja-zh
29.27
29.78
Tuning w/ RIBES hurts human eval!
Tune
BLEU
B+R
Eval
B
R
H
B
R
H
en-ja 37.2 80.2 56.3 37.2 80.7 51.5
zh-ja 41.3 83.5 50.8 40.8 83.8 38.0
ja-zh 30.5 81.8 17.8 29.8 83.0 1.3
Why? Too-short hypotheses.
Scripts available!
http://phontron.com/project/wat2014