スライド 1

Structural Phrase Alignment
Based on Consistency Criteria
Core Steps of Alignment
Flow of Our EBMT System
• Searching Correspondence Candidates
Translation Examples
Input
交差 (cross)
交差点に入る時
私の信号は青でした。
交差
(cross)
点に
(point)
at me
my
突然 (suddenly)
traffic
at the intersection
The light
家に
(house)
to remove
入る
(enter)
時 (when)
二百十六万 → 2,160,000 ← 2.16 million
entering
a house
私 の (my)
• Numeral normalization
when
entering
私 の (my)
ローズワイン → rosuwain ⇔ rose wine (similarity:0.78)
新宿 → shinjuku ⇔ shinjuku (similarity:1.0)
was green
when
脱ぐ (put off)
信号 は
(signal)
青
(blue)
でした 。
(was)
• Bilingual dictionaries
• Transliteration (Katakana words, NEs)
from the side
飛び出して 来た のです 。
(rush out)
入る
(enter)
時
(when)
– Fine alignment is efficient in translation
– Search candidates as much as possible using variety of linguistic information
came
点 で 、(point)
Toshiaki Nakazawa, Kun Yu, Sadao Kurohashi
(Graduate School of Informatics, Kyoto University)
{nakazawa, kunyu}@nlp.kuee.kyoto-u.ac.jp
[email protected]
• Japanese flexible matching (Odani et. al. 2007)
• Substring co-occurrence measure (Cromieres 2006)
the intersection
my
サイン (signal)
signature
信号 は
(signal)
青
(blue)
でした 。
(was)
Language Models
traffic
Output
My traffic light was
green when entering
the intersection.
The light
was green
• Selecting Correspondence Candidates
– More candidates derive more ambiguities and improper alignments
– Necessity of robust alignment method which can align parallel sentences
consistently by selecting the adequate candidates set
Selecting Correspondence Candidates
Using Consistency Score and Dependency Type
Ambiguities!
日本 で
1
1
csd J , d E  

dJ dE
you
(in Japan)
保険
Near!
will have to file
(insurance)
会社 に 対して
insurance
Far!
(to company)
保険
Far!
an claim
(insurance)
請求 の
1/1+1/2=1.5
insurance
(claim)
申し立て が
baseline
with the office
Near!
(instance)
可能ですよ
(you can)
Improper
alignments!
in Japan
How to reflect the inconsistency?
Japanese
arg max csd J (ai , a j ), d E (ai , a j )
alignment
i
j
J-Side Distance E-Side Distance
Consistency Score
predicate: level C
6
S / SBAR / SQ …
5
predicate: level B+/B
5
VP / WHADVP
4
predicate: level B-/A
4
WHADJP
case no / rentai
2
Inside clause
1
ADVP / ADJP
NP / PP / INTJ
Others
3
QP / PRT / PRN
predicate: level A-
Frequency
(log)
3
Others
1
Dependency Type Distance
3
NP you
3
デ格 日本 で
1
文節内
Dist of J-Side
Distribution of the distance of alignment
pairs in hand-annotated data (Mainichi
newspaper 40K sentence pairs)
[Uchimoto04]
保険
Score
[renyou]
1
[inside clause] 文節内
2
ノ格
will have to file
1
NN
E-Side
Distance
[case “ga”] (instance)
可能です よ
J-Side Distance
Experimental Result
1
NN
(claim)
Pair 2:
(Ds, Dt) = (1, 7)
Negative Score
(you can)
insurance
3
PP with the office
3
PP in Japan
Quality of Other Language Pairs
500 test sentences from Mainichi newspaper parallel corpus
Bilingual dictionary: KENKYUSYA J-E/J-E 500K entries
Evaluation criteria: Precision / Recall / F-measure
Character-base for Japanese, word-base for English
Rec
64.32
66.90
69.14
71.31
33.15
89.80
保険
請求 の
3
ガ格 申し立て が
insurance
3
NP an claim
(insurance)
[case “no”]
Pre
77.47
80.30
80.77
82.48
60.19
95.58
Pair 1:
(Ds, Dt) = (1, 1)
Positive Score
(to company)
Consistency Score Function
* Using 300K newspaper domain bi-sentences for training
(insurance)
[inside clause]
3
連用 会社 に 対して
“Near-Near” pair → Positive Score
“Far-Far” pair → 0
“Near-Far” pair → Negative Score
Baseline
+Consistency Score
Proposed(+CS,+DpndType)
Filtering (80%)
Moses (SMT Toolkit)*
Manual (upper bound)
(in Japan)
[case “de”]
Dist of E-Side
•
•
•
•
English
F
70.29
72.99
74.51
76.49
42.75
92.60
HLT-NAACL 2003
ACL 2005
(Gildea, 2003)
GIZA++
EnglishFrench
5.71
15.89
EnglishRomanian
28.86
26.55
27.19
EnglishKorean
32
35
(AER)
Conclusion
•
•
•
•
Proposed a new phrase alignment method using consistency criteria.
Enough alignment accuracy compared to other language pairs.
We need to acquire the parameters automatically by machine learning.
We are planning to evolve the framework which revises the parse result.
(There is a translation demos in exhibition corner by NICT which is using our system!)