スライド 1

Language & Knowledge Engineering Lab
Example-based Machine
Translation Pursuing Fully
Structural NLP
Kurohashi-lab M1
56430 Toshiaki Nakazawa
Language & Knowledge Engineering Lab
Outline
I.
II.
History of Machine Translation
Introduction of recent MT systems
i. Statistic Machine Translation (SMT)
ii. Example-based Machine Translation (EBMT)
III.
Related work for EBMT
i. Logical Form
ii. Efficient retrieval method
IV.
V.
EBMT pursuing fully structural NLP
Conclusion
Language & Knowledge Engineering Lab
Outline
I.
II.
History of Machine Translation
Introduction of recent MT systems
i. Statistic Machine Translation (SMT)
ii. Example-based Machine Translation (EBMT)
III.
Related work for EBMT
i. Logical Form
ii. Efficient retrieval method
IV.
V.
EBMT pursuing fully structural NLP
Conclusion
Language & Knowledge Engineering Lab
History of Machine Translation
MT quality had been
MT
quality
didn’t
“Machine Translation
improving because of
improved
despite
based on analogy”
When I much
look at an article
in Russian,
the
development of
spending
is proposed
I say: "This is really written in
Not
enough
NLP
money
English,
in
[Nagao,
1981]but is has been coded
quality yet…
some strange symbols. I will now
Doldrums of
Beginning“Mu
of project”
proceed to decode."
MT
Machine started
SMT had been
[Warren Weaver, 1947]
Translation
becoming active
[Brown et al., 1993]
Language & Knowledge Engineering Lab
Outline
I.
II.
History of Machine Translation
Introduction of recent MT systems
i. Statistic Machine Translation (SMT)
ii. Example-based Machine Translation (EBMT)
III.
Related work for EBMT
i. Logical Form
ii. Efficient retrieval method
IV.
V.
EBMT pursuing fully structural NLP
Conclusion
Language & Knowledge Engineering Lab
Statistical Machine Translation (SMT)

Parallel Corpus
Learn models for translation from parallel
corpus statistically
田植えフェスティバル石川県輪島市で外国の大使や一般の参加者など千
人あまりが急な斜面の棚田で田植えを体験する催しが行われました。
輪島市白米町には(しろよねまち)千枚田と呼ばれる(せんまいだ)大小二
千百枚の棚田が急な斜面から海に向かって拡がっています。
Ambassadors and diplomats from 37 countries took part
in a rice planting festival on Sunday in small paddies on
steep hillsides in Wajima, central Japan.
About one-thousand people gathered at the hill, where
some two-thousand 100 miniature paddies, called
Senmaida, stretch toward the Sea of Japan.
Not use any linguistic resources
田植え体験は農作業を通して米作りの意義などを考えていこうという地球
環境平和財団の呼び掛けで開かれたもので、海外三十四ヵ国の大使や書
記官、それに一般の参加者ら合わせておよそ千人が集まりました。
Small translation unit (= “word”)
田植えに使われた苗は去年の秋、天皇陛下が皇居で収穫された稲籾から
育てたものです。

参加者たちは裸足になって水田に足を踏み入れ地元に伝わる田植え歌に
合わせて慣れない手つきで苗を植えていました。

The event was organized by the private Foundation for
Global Peace and Environment.
The rice seedlings are grown from grain harvested by the
Emperor at the Imperial Palace in Tokyo last autumn.
Barefoot participants waded into the paddies to plant the
seedlings by hand while singing a local folk song about
the practice of rice planting.
Require large parallel corpus for highlyaccurate translation
きょうの輪島市は雲が広がったもののまずまずの天気となり、出席された
高円宮さまも海からの風に吹かれながら田植えに加わっていました。
地球環境平和財団では今年の夏休みに全国の子どもたちを対象に草刈り
や生きものの観察会を開く他、秋には稲刈体験を行なう予定にしています。
Language & Knowledge Engineering Lab
Basic Method for SMT

Translate by maximizing the probability:
E  arg max P ( E | J )
E
 arg max P ( E ) P ( J | E )
E
Language Model
Translation Model
Learn from a parallel corpus
Language & Knowledge Engineering Lab
Translation Model

IBM Model 4 [Brown et al., 93]
×
×
×
=
Translation Model
Probability of
translation from one E
word to one J word
Model for word order
# of Japanese words
which each English word
Modelgenerates
for generating
NULL to justify the # of
words
Language & Knowledge Engineering Lab
Overview of EBMT
交差
点で、
Parallel
Corpus
at the intersection
Alignment
TMDB
Input
Translation
Output
Advanced NLP technologies
Language & Knowledge Engineering Lab
Example-based Machine Translation (EBMT)





Divide the input sentence into a few parts
Find similar expressions (= examples,
TMs) from parallel corpus for each part
Combine the examples to generate output
translation
Use any linguistic resources as much as
possible
Larger translation unit (larger example) is
better
Language & Knowledge Engineering Lab
Flow of EBMT
Language & Knowledge Engineering Lab
Furthermore...





Translation algorithm is implicit in EBMT
→ Probabilistic Model for EBMT
[Aramaki et al., 05]
Recently, the number of studies handling
bigger unit is increasing
Difference between SMT and EBMT is
becoming smaller
Most active study = Phrase-based SMT
SMT and EBMT will be merged (?)
Language & Knowledge Engineering Lab
Outline
I.
II.
History of Machine Translation
Introduction of recent MT systems
i. Statistic Machine Translation (SMT)
ii. Example-based Machine Translation (EBMT)
III.
Related work for EBMT
i. Logical Form
ii. Efficient retrieval method
IV.
V.
EBMT pursuing fully structural NLP
Conclusion
Language & Knowledge Engineering Lab
Alignment method using Logical Form

Logical Form
[Arul et al., 01]
– Represent the relations among the content words
of a sentence by unordered graph


Nodes are content words
Branches indicate underlying
semantic relations
Spanish
– Abstract language-particular
aspects of a sentence
Ex. word order, inflectional
morphology, function words
English
Under Hyperlink Information, click the hyperlink address
Language & Knowledge Engineering Lab
Efficient Retrieval Method [Doi et al,. 04]



Similarity between input and examples is
calculated by word-based Edit Distance
Finding suitable examples from a large
parallel corpus takes a long time
Challenged to resolve this problem by
– Classifying sentences into groups according to the
# of content words and function words
– Compressing all sentences in a group into “directed
word graph”
– Searching best example in a group by A* algorithm
Language & Knowledge Engineering Lab
Outline
I.
II.
History of Machine Translation
Introduction of recent MT systems
i. Statistic Machine Translation (SMT)
ii. Example-based Machine Translation (EBMT)
III.
Related work for EBMT
i. Logical Form
ii. Efficient retrieval method
IV.
V.
EBMT pursuing fully structural NLP
Conclusion
Language & Knowledge Engineering Lab
Why EBMT?

Pursuing structural NLP
– Improvement of basic analyses leads to improvement
of MT as an application of basic analyses
– Feedback from application (MT) can be expected

Adequacy of problem settings
– Not a large corpus, but similar examples in relatively
close domain
Ex. Translation of ->
version up of instruction manual
related patent document ...
Language & Knowledge Engineering Lab
Overview of EBMT
Input
Parallel
Corpus
Alignment
TMDB
EBMT
Translation
Output
Advanced NLP technologies
Language & Knowledge Engineering Lab
Alignment
Japanese:交差点で、突然あの車が飛び出して来たのです。
English:The car came at me from the side at the intersection.
交差
点で、
突然
あの
車が
飛び出して 来た のです 。
the car
came
at me
from the side
at the intersection
1. Transform into dependency structure
2. Word-based alignment using bilingual lexicon
3. Extend the correspondence of phrases
4. Extract Translation Examples
Language & Knowledge Engineering Lab
Translation
Translation Examples
交差 (cross)
交差点に入る時
私の信号は青でした。
点 で 、(point)
at me
突然 (suddenly)
from the side
飛び出して 来た のです 。
Input
交差
点に
(enter)
時 (when)
traffic
The light
家に
to remove
(house)
(point)
my
at the intersection
(rush out)
(cross)
入る
came
入る
(enter)
時 (when)
was green
when
entering
脱ぐ (put off)
when
entering
a house
the intersection
私 の (my)
信号 は
(signal)
青
私 の (my)
サイン(signal)
my
signature
(blue)
でした 。
信号 は
traffic
(signal)
(was)
青
The light
(blue)
でした 。
(was)
Language Model
was green
Output
My traffic light was green
when entering the
intersection.
Language & Knowledge Engineering Lab
IWSLT2005

IWSLT
– International Workshop on Spoken Language
Translation
– Aiming at translation of ASR (Automatic Speech
Recognition)

Outline of campaign
– Training set: parallel corpus including 20K sentences
– Development set: two sets including 500 and 506
sentences
– Test set: manual transcription and ASR output (500
sentences each)
Language & Knowledge Engineering Lab
Evaluation Results
Manual Transcription
(Supplied & Tools)
Name
BLUE
Name
NIST
ATR-C3
0.4774
ATR-C3
8.1720
MICROSOFT
0.4057
MICROSOFT
8.0375
ATR-SLR
0.3884
TUV
7.8472
TUV
0.3718
NGKUT
7.7158
NGKUT
0.3418
ATR-SLR
4.3928
USC
0.2741
USC
2.9648
Language & Knowledge Engineering Lab
Conclusion

In this presentation …
–
–
–
–

History of Machine Translation
SMT and EBMT
Two related work for EBMT
Introduction of our EBMT system
Future work
– Improve our EBMT system


Resolve paraphrase problem
Apply anaphora resolution