Using Chinese characters in alignment between

Exploiting Shared Chinese Characters in
Chinese Word Segmentation Optimization
for Chinese-Japanese Machine Translation
Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara,
Sadao Kurohashi
Graduate School of Informatics, Kyoto University
1
EAMT2012 (2012/05/28)
Outline
•
•
•
•
•
•
•
Word Segmentation Problems
Common Chinese Characters
Chinese Word Segmentation Optimization
Experiments
Discussion
Related Work
Conclusion and Future Work
2
Outline
•
•
•
•
•
•
•
Word Segmentation Problems
Common Chinese Characters
Chinese Word Segmentation Optimization
Experiments
Discussion
Related Work
Conclusion and Future Work
3
Word Segmentation for
Chinese-Japanese MT
Zh:
小坂先生是日本临床麻醉学会的创始者。
Ja:
小坂先生は日本臨床麻酔学会の創始者である。
Zh:
小/坂/先生/是/日本/临床/麻醉/学会/的/创始者/。
Ja:
小坂/先生/は/日本/臨床/麻酔/学会/の/創始/者/である/。
Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
4
Word Segmentation Problems in
Chinese-Japanese MT
Zh:
小/坂 /先生/是/日本/临床/麻醉/学会/的/ 创始者 /。
Ja:
小坂 /先生/は/日本/臨床/麻酔/学会/の/ 創始/者 /である/。
Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
• Unknown words
– Affect segmentation accuracy and consistency
• Word segmentation granularity
– Affect word alignment
5
Outline
•
•
•
•
•
•
•
Introduction
Common Chinese Characters
Chinese Word Segmentation Optimization
Experiments
Discussion
Related Work
Conclusion and Future Work
6
Chinese Characters
• Chinese characters are used both in Chinese
(Hanzi) and Japanese (Kanji)
• There are many common Chinese characters
between Hanzi and Kanji
• We made a common Chinese characters
mapping table for 6,355 JIS Kanji (Chu et al.,
2012)
7
Common Chinese Characters Related
Studies
• Automatic sentence alignment task (Tan et al.,
1995)
• Dictionary construction (Goh et al., 2005)
• Word level semantic relations investigation
(Huang et al., 2008)
• Phrase alignment (Chu et al., 2011)
• This study exploits common Chinese
characters in Chinese word segmentation
optimization for MT
8
Outline
•
•
•
•
•
•
•
Word Segmentation Problems
Common Chinese Characters
Chinese Word Segmentation Optimization
Experiments
Discussion
Related Work
Conclusion and Future Work
9
Reason for Chinese Word
Segmentation Optimization
• Segmentation for Japanese is easier than
Chinese, because Japanese uses Kana other
than Chinese characters
• F-score for Japanese segmentation is nearly
99% (Kudo et al., 2004), while that for Chinese
is still about 95% (Wang et al., 2011)
• Therefore, we only do word segmentation
optimization for Chinese, and keep the
Japanese segmentation results
10
Parallel Training
Corpus
① Chinese Lexicons
Extraction
Common Chinese
Characters
Chinese
Lexicons
System Dictionary of
Chinese Segmenter
Chinese Annotated Corpus
for Chinese Segmenter
② Chinese Lexicons
Incorporation
③ Short Unit
Transformation
System Dictionary with
Chinese Lexicons
Short Unit Chinese Corpus for
Chinese Segmenter
Training
Optimized Chinese
Segmenter
11
Parallel Training
Corpus
① Chinese Lexicons
Extraction
Common Chinese
Characters
Chinese
Lexicons
System Dictionary of
Chinese Segmenter
Chinese Annotated Corpus
for Chinese Segmenter
② Chinese Lexicons
Incorporation
③ Short Unit
Transformation
System Dictionary with
Chinese Lexicons
Short Unit Chinese Corpus for
Chinese Segmenter
Training
Optimized Chinese
Segmenter
12
① Chinese Lexicons Extraction
• Step 1: Segment Chinese and Japanese
sentences in the parallel training corpus
• Step 2: Convert Japanese Kanji tokens into
Chinese using the mapping table we made
(Chu et al., 2012)
• Step 3: Extract the converted tokens as
Chinese lexicons if they exist in the
corresponding Chinese sentence
13
Extraction Example
Ja:
小坂 /先生/は/日本/臨床/麻酔/学会/の/ 創始/者 /である/。
Kanji tokens conversion
Ja:
小坂/先生/は/日本/临床/麻醉/学会/の/ 创始/者/である/。
Check
Zh:
小/坂 / 先生 /是/ 日本 / 临床 / 麻醉 / 学会 /的/ 创始 者 /。
Extraction
Chinese Lexicons :
小坂 先生 日本 临床 麻醉 学会 创始 者
Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
14
Parallel Training
Corpus
① Chinese Lexicons
Extraction
Common Chinese
Characters
Chinese
Lexicons
System Dictionary of
Chinese Segmenter
Chinese Annotated Corpus
for Chinese Segmenter
② Chinese Lexicons
Incorporation
③ Short Unit
Transformation
System Dictionary with
Chinese Lexicons
Short Unit Chinese Corpus for
Chinese Segmenter
Training
Optimized Chinese
Segmenter
15
② Chinese Lexicons Incorporation
• Using a system dictionary is helpful for
Chinese word segmentation (Low et al., 2005;
Wang et al., 2011)
• We incorporate the extracted lexicons into the
system dictionary of a Chinese segmenter
• Assign POS tags by converting the POS tags
assigned by the Japanese segmenter using
POS tags mapping table between Chinese and
Japanese
16
CTB (Penn Chinese Treebank)
JUMAN (Kurohashi et al., 1994)
AD(adverb)
副詞(adverb)
CC(coordinating conjunction)
接続詞(conjunction)
CD(cardinal number)
名詞(noun)[数詞(numeral noun)]
FW(foreign words)
未定義語(undefined word)[アルファベット(alphabet)]
IJ(interjection)
感動詞(interjection)
M(measure word)
接尾辞(suffix)[名詞性名詞助数辞(measure word suffix)]
NN(common noun)
名詞(noun)[普通名詞(common noun)/サ変名詞(sahen noun)/
形式名詞(formal noun)/副詞的名詞(adverbial noun)],
接尾辞(suffix)[名詞性名詞接尾辞(noun suffix)/
名詞性特殊接尾辞(special noun suffix)]
NR(proper noun)
名詞(noun)[固有名詞(proper noun)/地名(place name)/
人名(person name)/組織名(organization name)]
NT(temporal noun)
名詞(noun)[時相名詞(temporal noun)]
PU(punctuation)
特殊(special word)
VA(predicative adjective)
形容詞(adjective)
VV(other verb)
動詞(verb)/名詞(noun)[サ変名詞(sahen noun)]
17
Parallel Training
Corpus
① Chinese Lexicons
Extraction
Common Chinese
Characters
Chinese
Lexicons
System Dictionary of
Chinese Segmenter
Chinese Annotated Corpus
for Chinese Segmenter
② Chinese Lexicons
Incorporation
③ Short Unit
Transformation
System Dictionary with
Chinese Lexicons
Short Unit Chinese Corpus for
Chinese Segmenter
Training
Optimized Chinese
Segmenter
18
③ Short Unit Transformation
• Adjusting Chinese word segmentation to make
tokens 1-to-1 mapping as many as possible
between a parallel sentences can improve
alignment accuracy (Bai et al., 2008)
• Wang et al. (2010) proposed a short unit
standard for Chinese word segmentation,
which can reduce the number of 1-to-n
alignments and improve MT performance
19
Our Method
• We transform the annotated training data of
Chinese segmenter utilizing the extracted
lexicons
CTB:
从_P/ 有效性_NN /高_VA/的_DEC/ 格要素_NN /…
Lexicon: 有效 (effective)
Short:
Ref:
Lexicon : 要素 (element)
从_P/ 有效_NN/性_NN /高_VA/的_DEC/ 格_NN/要素_NN /…
From case element with high effectiveness …
20
Constraints
• We do not use the extracted lexicons that are
composed of only one Chinese character
歌颂(praise)
歌(song)
・・・
long token
extracted lexicons
歌(song)/
颂(praise)
short unit tokens
21
Outline
•
•
•
•
•
•
•
Word Segmentation Problems
Common Chinese Characters
Chinese Word Segmentation Optimization
Experiments
Discussion
Related Work
Conclusion and Future Work
22
Two Kinds of Experiments
• Experiments on Moses
• Experiments on Kyoto example-based
machine translation (EBMT) system Nakazawa
and Kurohashi, 2011)
– A dependency tree-based decoder
23
Experimental Settings on MOSES (1/2)
Parallel Training Corpus
Zh-Ja paper abstract corpus
(680k sentences)
Chinese Annotated Corpus NICT Chinese Treebank (same domain
of parallel corpus, 9,792 sentences)
CTB 7 (31,131 sentences)
Chinese Segmenter
In-house Chinese segmenter
Japanese Segmenter
JUMAN (Kurohashi et al., 1994)
MT system
Moses with default options, except for
the distortion limit (6→20)
24
Experimental Settings on MOSES (2/2)
• Baseline: Only using the lexicons extracted
from Chinese annotated corpus
• Incorporation: Incorporate the extracted
Chinese lexicons
• Short unit: Incorporate the extracted Chinese
lexicons and train the Chinese segmenter on
the short unit training data
25
Results of Chinese-to-Japanese
Translation Experiments on MOSES
BLEU
NICT Chinese Treebank
Baseline
34.92
Incorporation
36.19
Short unit
36.82
CTB 7
36.64
37.39
38.59
• CTB 7 shows better performance because the size
is more than 3 times of NICT Chinese Treebank
• Lexicons extracted from a paper abstract domain
also work well on other domains (i.e. CTB 7)
26
Results of Japanese-to-Chinese
Translation Experiments on MOSES
BLEU
NICT Chinese Treebank
Baseline
31.42
Incorporation
31.24
CTB 7
31.83
32.34
Short unit
31.95
31.95
• Not significant compared to Zh-to-Ja, because
our proposed approach does not change the
segmentation results of input Japanese
sentences
27
Experimental Settings on EBMT (1/2)
Parallel Training Corpus
Zh-Ja paper abstract corpus
(680k sentences)
Chinese Annotated Corpus CTB 7 (31,131 sentences)
Chinese Segmenter
In-house Chinese segmenter
Japanese Segmenter
MT system
JUMAN (Kurohashi et al., 1994)
Kyoto example-based machine
translation (EBMT) system
28
Experimental Settings on EBMT (2/2)
• Baseline: Only using the lexicons extracted
from Chinese annotated corpus
• Short unit: Incorporate the extracted Chinese
lexicons and train the Chinese segmenter on
the short unit training data
29
Results of Translation Experiments on
EBMT
BLEU
Baseline
Short unit
Zh-to-Ja
22.84
23.36
Ja-to-Zh
19.10
19.15
• Translation performance is worse than MOSES, because
EBMT suffers from low accuracy of Chinese parser
• Improvement by short unit is not significant because
the Chinese parser is not trained on short unit
segmented training data
30
Outline
•
•
•
•
•
•
•
Word Segmentation Problems
Common Chinese Characters
Chinese Word Segmentation Optimization
Experiments
Discussion
Related Work
Conclusion and Future Work
31
Short Unit Effectiveness on MOSES
Baseline (BLEU=49.38)
Input: 本/论文/中/,/提议/考虑/现存/实现/方式/的/ 功能 / 适应性 /决定/对策/目标/
的/保密/基本/设计法/。
Output: 本/論文/で/は/,/提案/する/ 適応/的 /対策/を/決定/する/セキュリティ/基
本/設計/法/を/考える/現存/の/実現/方式/の/ 機能 /を/目標/と/して/いる/.
Short unit (BLEU=56.33)
Input:本/论文/中/,/提议/考虑/现存/实现/方式/的/ 功能 / 适应/性 /决定/对策/目标/
的/保密/基本/设计/法/。
Output: 本/論文/で/は/,/提案/する/考え/現存/の/実現/方式/の/ 機能/的 / 適応/性
/を/決定/する/対策/目標/の/セキュリティ/基本/設計/法/を/提案/する/.
Reference
本/論文/で/は/,/対策/目標/を/現存/の/実現/方式/の/ 機能/的 / 適合/性 /も/考慮/
して/決定/する/セキュリティ/基本/設計/法/を/提案/する/ .
(In this paper, we propose a basic security design method also consider
functional suitability of the existing implementation method for determining
countermeasures target.)
32
Number of Extracted Lexicons
Source
Parallel training corpus
18,584
Source
NICT Treebank CTB 7
Chinese annotated corpus
13,471
26,202
Short unit Chinese corpus
12,627
25,490
• The number of extracted lexicons deceased
after short unit transformation because
duplicated lexicons increased
33
Short Unit Transformation Percentage
• NICT Chinese Treebank
– 6,623 tokens out of 257,825 been transformed to
13,469 short unit tokens, the percentage is 2.57%
• CTB 7
– 19,983 tokens out of 718,716 been transformed to
41,336 short unit tokens, the percentage is 2.78%
34
Short Unit Transformation Problems
(1/3)
• Improper transformation problem
不好意思
(sorry)
long token
好意(favor)
・・・
extracted lexicons
不(not)/
好意 (favor)/
思 (think)
short unit tokens
35
Short Unit Transformation Problems
(2/3)
• Transformation ambiguity problem
充电器(charger)
充电(charge)
电器(electric
equipment)
・・・
long token
extracted lexicons
充电(charge)/
器(device)
充(charge)/
电器(electric equipment)
short unit tokens
36
Short Unit Transformation Problems
(3/3)
• POS Tag assignment problem
被实验者_NN
(test subject)
long token
实验(test)
・・・
被_NN(be)/
实验_NN(test)/
者_NN(person)
extracted lexicons
short unit tokens
The correct POS tag for “被 (be)” should be LB (“被” in long bei-const)
37
Outline
•
•
•
•
•
•
•
Word Segmentation Problems
Common Chinese Characters
Chinese Word Segmentation Optimization
Experiments
Discussion
Related Work
Conclusion and Future Work
38
Bai et al., 2008
• Proposed a method of learning affix rules
from a aligned Chinese-English bilingual
terminology bank to adjust Chinese word
segmentation in the parallel corpus directly
39
Wang et al., 2010
• Proposed a method based on transfer rules
and a transfer database.
– The transfer rules are extracted from alignment
results of annotated Chinese and segmented
Japanese training data
– The transfer database is constructed using
external lexicons, and is manually modified
40
Conclusion
• We proposed an approach of exploiting
common Chinese characters in Chinese word
segmentation optimization for ChineseJapanese MT
• Experimental results of Chinese-Japanese MT
on a phrase-based SMT system indicated that
our approach can improve MT performance
significantly
41
Future Work
• Solve the short unit transformation problems
• Evaluate the proposed approach on parallel
corpus of other domains
42