Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information Chenhui Chu, Toshiaki Nakazawa and Sadao Kurohashi Graduate School of Informatics, Kyoto University Introduction Alignment Model • Common Chinese characters information may be valuable in word/phrase alignment between Japanese and Chinese – Chinese characters are used both in Japanese (Kanji) and Chinese (Hanzi) – There exist common Chinese characters between Kanji and Hanzi – Parallel sentences contain equivalent meanings in each language, and we can assume common Chinese characters appear in the sentences – Category 1: identical to Simplified Chinese – Category 2: identical to Traditional Chinese but different from Simplified Chinese – Category 3: visual variations snow country 雪 国 雪 國 雪 国 love 愛 愛 爱 begin 発 發 hair 髪 髪 发 Common Chinese Characters Detection Simplified Chinese 说 钱 干 故 仿 ・・・ Traditional Chinese 說 錢 干,幹,乾 故 仿,彷,倣 ・・・ 説 発 銭 検 経 焼 說 發 錢 檢 經 燒 ・・・ ・・・ • Example of common Chinese characters detection Kanji: 開発 (develop) Category 3 Kanji: 発 Category 2 Kanji:開 開→开 發→发 ・・・ Category 2 Kanji: 發 発→發 ・・・ Unihan database e, f T e, f p N e, f 1 p A e, f (2) A e, f ~ DPM A, A (3) M A e, f Pf f PWA e | f Pe e PWA f | e 1 2 (4) PWA e | f t (e j | f i ) (5) l l f 1 j 1 • Common Chinese characters information incorporation – Base distribution adjustment t ( e j | f i ) t (e j | f i ) w (6) le t (e j | f i ) t (e j | f i ) (7) n t ( e | f ) j i i 1 – Model modification Pe, f , D P PD | e, f w T e, f (8) e, f • Japanese-Chinese corpus we used Ja # of sentences # of words # of Chinese characters # of Chinese characters (+Kana-Kanji) average sentence length Zh 680k 21.8M 14.0M 14.6M 32.9 18.2M 24.2M 24.2M 22.7 • Coverage of common Chinese characters detection 80% 70% 60% 50% 40% 30% 20% 10% 0% Category 1 +Category 2 +Category 3 +Kana-Kanji Char(Ja) Char(Zh) Word(Ja) Word(Zh) • Alignment Simplified Chinese: 开 发 22 • We also do Kana-Kanji conversion for common Chinese characters detection Meaning envious Kana うらやましい Kanji 羨ましい Traditional Chinese 羡慕 Simplified Chinese 羡慕 (1) Experiments • Aiming to detect common Chinese characters between Japanese and Simplified Chinese, we do a conversion of Japanese into Chinese • Freely available resources used for category 2 and 3 Kanji conversion: Kanji Traditional Chinese Pe, f , D P PD | e, f T e, f e • Three categories of Kanji: Meaning Kanji Traditional Chinese Simplified Chinese • Bayesian subtree alignment model on dependency trees (Nakazawa et al. 2011) wonton ワンタン 饂飩 餛飩 馄饨 self おのれ 己 自己 自己 21 21.6 20.39 20 19 18 18.14 17.93 17.41 17 Alignment Error Rate grow-diag-final-and BerkelyAligner Baseline Base Distribution Model Modification
© Copyright 2024 ExpyDoc