スライド 1

Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information
Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi
Graduate School of Informatics, Kyoto University
Introduction
Alignment Model
• Common Chinese characters information may be
valuable in word/phrase alignment between
Japanese and Chinese
– Chinese characters are used both in Japanese (Kanji)
and Chinese (Hanzi)
– There exist common Chinese characters between
Kanji and Hanzi
– Parallel sentences contain equivalent meanings in
each language, and we can assume common Chinese
characters appear in the sentences
– Category 1: identical to Simplified Chinese
– Category 2: identical to Traditional Chinese but
different from Simplified Chinese
– Category 3: visual variations
snow country
雪
国
雪
國
雪
国
love
愛
愛
爱
begin
発
發
hair
髪
髪
发
Common Chinese Characters Detection
Simplified Chinese 说 钱
干
故
仿
・・・
Traditional Chinese 說 錢 干,幹,乾 故 仿,彷,倣 ・・・
説 発 銭 検 経 焼
說 發 錢 檢 經 燒
・・・
・・・
• Example of common Chinese characters detection
Kanji: 開発
(develop)
Category 3 Kanji: 発
Category 2 Kanji:開
開→开
發→发
・・・
Category 2 Kanji:
發
発→發
・・・
Unihan database
e, f
T e, f    p N e, f    1  p  A e, f  
(2)
 A e, f   ~ DPM A,  A 
(3)


M A e, f    Pf  f PWA e | f   Pe e PWA  f | e 
1
2
(4)

PWA e | f  
t (e j | f i )
(5)
l 
l f 1 j 1
• Common Chinese characters information incorporation
– Base distribution adjustment
t ( e j | f i )  t (e j | f i )  w
(6)
le
t (e j | f i ) 
t (e j | f i )

(7)
n
t
(
e
|
f
)
j
i
i 1
– Model modification
Pe, f , D  P PD | e, f   w T e, f  
(8)
e, f
• Japanese-Chinese corpus we used
Ja
# of sentences
# of words
# of Chinese characters
# of Chinese characters (+Kana-Kanji)
average sentence length
Zh
680k
21.8M
14.0M
14.6M
32.9
18.2M
24.2M
24.2M
22.7
• Coverage of common Chinese characters detection
80%
70%
60%
50%
40%
30%
20%
10%
0%
Category 1
+Category 2
+Category 3
+Kana-Kanji
Char(Ja) Char(Zh) Word(Ja) Word(Zh)
• Alignment
Simplified Chinese: 开
发
22
• We also do Kana-Kanji conversion for common
Chinese characters detection
Meaning
envious
Kana
うらやましい
Kanji
羨ましい
Traditional Chinese
羡慕
Simplified Chinese
羡慕
(1)
Experiments
• Aiming to detect common Chinese characters
between Japanese and Simplified Chinese, we do a
conversion of Japanese into Chinese
• Freely available resources used for category 2 and
3 Kanji conversion:
Kanji
Traditional Chinese
Pe, f , D   P   PD | e, f    T e, f  
e
• Three categories of Kanji:
Meaning
Kanji
Traditional Chinese
Simplified Chinese
• Bayesian subtree alignment model on dependency
trees (Nakazawa et al. 2011)
wonton
ワンタン
饂飩
餛飩
馄饨
self
おのれ
己
自己
自己
21
21.6
20.39
20
19
18
18.14 17.93
17.41
17
Alignment Error Rate
grow-diag-final-and
BerkelyAligner
Baseline
Base Distribution
Model Modification