スライド タイトルなし

Web-based Acquisition of
Japanese Katakana Variants
Hiroshi Nakagawa (University of Tokyo, Japan)
Takeshi Masuyama (University of Tokyo , Japan)†
†(Now
Yahoo! Japan)
1
• Very sorry for Katakana fonts printing
problem in proceedings. We could not
check the final printing.
• Please read English transliterations of
Katakana parts like %-%c…..
2
Cooperation with
Satoshi Sekine
Computer Science, New York University
And
Language Craft Co.
3
mew,mew
Katakana
word variants
ニャア,ニャア(nyaa,nyaa)
ニャアー、ニャアー(nyah,nyah)
The way of
sound to
spelling
defers
language by4
language
History of Katakana
Every Country and every language has its
own history of meanings, codes and fonts.
Phonogram vs. Ideogram
5
漢字
(Hanji)
Kanji(Hanji)Characters
(=ideogram) imported to
Japan 1300 yeas ago
6
Almost 1000 years ago,
women writers worked out
phonogram(Hiragana and
Katakana) from Kanji
紫式部
(ideogram) to express Japanese
people’s mentality.
世
(Kanji)
ideogram
せ
(hiragana)
セ
(katakana)
phonogram
7
Modern history of Katakana
 Japanese Katakana and Hiragana have one to
one mapping.
 After Meiji revolution(1868), Japanese people
used
 Katakana to express functional word
 Hiragana to express words imported from western
countries.
 After World War II(1945), we exchanged them.
 Hiragana became used to express functional word
like case markers
 Katakana became used to express words imported
from western countries.
 Thus majority of Katakana words are
transliterations from English words.
8
However,
 Japanese Katakana has only five vowels
(a,i,u,e,o) and 19 consonants
(k,g,s,z,j,t,d,n,h,b,m,y,r,w,c,sh,ch,ny,my,).
 Pronunciations are always C+V or V.
 No C+C.
 No distinction between, (b,v),(h,f),(l,r),..
 There are no orthographic way to express English
sounds with Katakana character set.
 Thus Japanese language accepted several
Katakana spellings for one English word.
 Katakana variants
9
ディテール(dhiteeru)
ディティール (dhithiiru)
detail
ディテェール(dhitheeru)
ディテイル(dhiteiru)
Transliterated into
Katakana
variants
キャメロン・ディアス
(kyameronn・dhiasu)
Cameron Diaz
キャメロン・ディアズ
(kyameronn・dhiazu)
キャメロンディアス
(kyameronndhiasu)
10
An example of search result Hits for
“spaghetti” with Google
To make sure to avoid overlap between distinct Katakana
variants by + and - options.
Katakana variants.
Hits of Google search (%)
スパゲッティ(supagettuthi)
187,000 (32.7%)
スパゲッティー(supagettuthii)
57,600 (10.1%)
スパゲッテイ(supagettutei)
6,850 (1.2%)
スパゲティ(supagetuthi)
240,000(41.9%)
スパゲティー(supagethii)
77,400(13.5%)
スパゲテイ(supagetei)
total
3,800 (0.7%)
572,650(100%)
11
Katakana variants extraction system is
needed to enhance
the cross-language ability of
Information Retrieval
Search engine
Machine translation
Information Extraction
Summarization
Question Answering
12
Previous research 1 :
 Manually constructed Rewriting rules to
generate and/or extract Katakana variants
from given Katakana word
(Shishibori et al, 1993, 1994, Kubota 1994)
 Samples of rewrite rules
 ベ(Be)⇔ヴェ(Ve)
 チ(chi)⇔ツィ(thi)
 Input : ベネチア(Benechia)
 Output : ベネツィア(Benethia)

ヴェネチア(Venechia)

ヴェネツィア(Venethia)
13
Previous research 2:
 Extract Katakana variants with weighted edit
distance (Magari et al、2004)、(Ohtake et al、
2004)
 Edit distance is defined as
 Number of operations to transform one Katakana word
into another Katakana word:
 Operations: insert, delete,replace
 Ex.: レポート(Repooto)リポート(Ripooto) → edit dist. =1
 Weighted edit distance
 Weight of each operation is manually given
 Ex: Weight of edit dist. (レポートリポート)  0.8
14
Previous research 3:more direct way
 String penalty to extract Katakana variants
(Masuyama et al, 2004)
 String penalty: SP
 Based on weighted edit distance, but extended to
treat two,three characters:string
 Manually given weights to Combination of edit
operations = string replacing operations.
 Ex.SP(ボイス,ヴォイス)=4 … replace and insert
Boisu, Voisu
15
Previous research 4:
Combination method
(Masuyama,Nakagawa,Sekine 2004 COLING)
 Combination of string penalty and context
 String penalty :SP
 SP value is given by an expertise
 Similarity of contexts in which each Katakana
variant appears
 Vector space model (automatically calculated)
 If Words around each Katakana words are similar, then the
Katakana words are variants each other
16
Problems of previous researches
 Less coverage
 Need human intellectual and intensive work for
 Working out rewrite rules
 Determining weights of weighted edit distance
 Determining values of string penalty of each
Katakana string pairs
 Depend on specific corpus which is used to
calculate weights of
 weighted edit distance
 string penalty
17
Purpose of this work
The problem of manually given string penalty:
Labor intensive (even in combination of SP and
context)
Low coverage
Determine string penalty mechanically
and
Automatically building Katakana variants
for each Katakana word
18
Calculating string penalties
Mechanically
For this, we need accurate and high
quality Katakana variants database!
19
Process English word and
its Katakana
variant
WWW
Pairs of
variant cadi.
Pairs of
variant
String Penalty
(idea, アイデア)
(report, レポート)
…
… レポート …
… report …
… リポート …
…
(レポート’repooto’,リポート’ripooto’)
(レポート’repooto’,サポート’sapooto’)
…
(レポート’repooto’,リポート’repooto’)
(レファレンス’refarensu’,
リファレンス’rifarenssu’)
(アーキテクト’aakitekuto’,
アーキテクツ’aakitekutu’)
…
レ’re’⇔リ’ri’ : 1
ト’to’⇔ッ’ttu’ : 3
…
20
Process English word and
its Katakana
variant
Web search by
WWW
Pairs of
variant cadi.
Pairs of
variant
String Penalty
(idea, アイデア)
(report, レポート)
…
… レポート …
… report …
… リポート …
…
(レポート’repooto’,リポート’ripooto’)
(レポート’repooto’,サポート’sapooto’)
…
(レポート’repooto’,リポート’repooto’)
(レファレンス’refarensu’,
リファレンス’rifarenssu’)
(アーキテクト’aakitekuto’,
アーキテクツ’aakitekutu’)
…
レ’re’⇔リ’ri’ : 1
ト’to’⇔ッ’ttu’ : 3
…
21
How to find candidates of Katakana
variant pairs (1/3)
1. To collect English words and thier
Katakana variants i.e. (vodka ウォッカ)
 we used four Web sites where we collect a
number of English words and their Japanese
translations.




http://homepage2.nifty.com/katakanaEnglish/
http://www.hoshi.cis.ibaraki.ac.jp/usefull/usefull15.html
http://ke.ics.saitama-u.ac.jp/jsgs/keywords.html
http://smalltown.ne.jp/~uasa/pub/distfiles/skk-extra200307/SKK-JISYO.edit
 14,958 distinct pairs of English words and their
Katakana translations.
22
How to find candidates of Katakana
variant pairs (2/3)
1. Extract many English word and its Katakana
variant
 14.958 pairs of English-Katakana
2. To collect more Katakana variants for each
English word, we use Google search to get
pages that include English word and Katakana
word of its translation


“English word +( language = Japanese )”
“English word + 「英和」(“English to Japanese”)” in order to
search English-Japanese dictionary site
3. Gather Katakana words from search results
23
Google search with English word “vodka”
among page written in Japanese
vodka
24
Add a query 「英和」(english-Japanese)
and Google search
英和’e-j’
vodaka
25
Process English word and
Web search
its Katakana
variant
“英和(e-j) report”
WWW
Pairs of
Edit dist. =1
variant cadi.
Pairs of
variant
String Penalty
(idea, アイデア)
(report, レポート)
…
… レポート …
… report …
… リポート …
…
(レポート’repooto’,リポート’ripooto’)
(レポート’repooto’,サポート’sapooto’)
…
(レポート’repooto’,リポート’repooto’)
(レファレンス’refarensu’,
リファレンス’rifarenssu’)
(アーキテクト’aakitekuto’,
アーキテクツ’aakitekutu’)
…
レ’re’⇔リ’ri’ : 1
ト’to’⇔ッ’ttu’ : 3
…
26
How to find candidates of Katakana
variant pairs (3/3)
4. Extract promising candidates of Katakana word
pairs whose edit distance =1 as Katakana
variants
 Ex. (vodka ウォッカ)
 (ウォッカ’Uottuka’、ウォトカ’Uotoka)
(ウォッカ’Uottuka’、ウオッカ’UOttuka’)
(ウォッカ’Uottuka、ヴォッカ(Vuottuka’)
27
Process English word and
its Katakana
variant
Web search by
WWW
“英和(e-j) report”
Pairs of
Edit dist. =1
variant candi.
cosine sim
Pairs of
> 0.00006 variant candi.
by context
String Penalty
(idea, アイデア)
(report, レポート)
…
… レポート …
… report …
… リポート …
…
(レポート’repooto’,リポート’ripooto’)
(レポート’repooto’,サポート’sapooto’)
…
(レポート’repooto’,リポート’repooto’)
(レファレンス’refarensu’,
リファレンス’rifarenssu’)
(アーキテクト’aakitekuto’,
アーキテクツ’aakitekutu’)
…
レ’re’⇔リ’ri’ : 1
ト’to’⇔ッ’ttu’ : 3
…
28
How to extract documents in which
context similarity is calculated
 Google search with a query of Katakana word
which is a candidate of Katakana variant.
 Extract context of the Katakana variant from
search result pages.
29
Search “Vodka” with Google
+ウォッカ
‘+vodka’
retrieves all
pages
including
ウォッカ
ウオッカ‘s
contexts
30
1. Calculate context similarity of a candidate of
Katakana variant pair
drink vodka(Vuottka) with a main dish and plate of caviar in the restaurants
cosine similarity
eat some main dish plate after vodka(Uotoka) in that restaurants
 50 words around a candidate of Katakana variant is
used as its context
2. Identify and extract Katakana variants if cosine
similarity is greater than the threshold of 0.00006.
31
Detail of context similarity calculation
 context=50 words around Katakana word
 Weight of word t in context
 log(freq(t)+1)
 Context similarity = cosine
 Selection from candidates by
 cosine similarity≧0.00006 (threshold)
 The threshold optimization
argmax of F-value
threshold
on positive pairs (347pairs)and negative pairs(111 pair)
32
Results of context similarity vs cosine
threshold
90
86
84
82
0
00
01
9
0.
00
00
8
0.
00
00
7
0.
00
00
6
0.
00
00
5
0.
00
00
4
0.
00
00
3
0.
0.
00
00
2
00
00
0.
00
00
1
80
0.
F-value (%)
88
threshold of cosine similarity
33
Process English word and
its Katakana
variant
Web search by
WWW
“英和(e-j) report”
Pairs of
Edit dist. =1
variant cadi.
cosine sim
> 0.00006
Pairs of
variant
Next to do is to calculate
SP based on Statistics
(idea, アイデア)
(report, レポート)
…
… レポート …
… report …
… リポート …
…
(レポート’repooto’,リポート’ripooto’)
(レポート’repooto’,サポート’sapooto’)
…
(レポート’repooto’,リポート’repooto’)
(レファレンス’refarensu’,
リファレンス’rifarenssu’)
(アーキテクト’aakitekuto’,
アーキテクツ’aakitekutu’)
…
レ’re’⇔リ’ri’ : SP=1
ト’to’⇔ッ’ttu’ : SP=3
…
34
2nd stage:
Calculation of string penalty :SP
 String penalty of operation x y (x replaces
with y)
 We focus on
 High correlation between replaced strings and their
character context which is composed of several
characters around the target string.
Example:
(ウインブルドン、ウィンブルドン)
(ウインドウズ、ウィンドウズ)
(ウインク、ウィンク)
replace イ’I’ with ィ’i’→ウ’U’ and ン’n’ cooccurs
35
Character level context:CLC1..CLC5
used to calculate SP
 x : target character
 α、β、γ、δ :characters around x
CLC String contexts around x
CLC1
αβ x
preceeding two characters of x
CLC2
βx
preceeding one character of x
CLC3
xγ
succeeding one character of x
CLC4
x γδ
succeeding two characters of x
CLC5
βxγ
preceeding and succeeding
characters of x
36
Calculation of string penalty:SP
f (CLCi , x  y)  1
P( x  y | CLCi ) 
f (CLCi )  2
i=1,2,3,4,5
f(CLCi) = freq. of pairs in which CLCi occurs
f(CLCi, xy) = freq. of pairs in which both of
CLCi and xy occur
37
Calculation of string penalty :SP
Identify character context CLCi which most probably cooccurs with operation x  y
CLC  arg max Px  y | CLCi 
CLC i (i 1,..,5)
Then
SPx  y


1


 P (x  y | CLC) 

Rank of occurrence ≈ C * (Prob. of occurrence)-1
Zipf’s law
38
Examples of string penalties
operation
SP
Example
Insertion and deletion
of ‘・’
1
ラストシーン、ラスト・シーン
Insertion and deletion
of macron ‘ ー’
1
エネルギー、エネルギ
Replace オ ‘O’ and ォ
‘o’
1
ウオッカ、ウォッカ
Replace グ ‘gu’ and ク
‘ku’
2
バック、バッグ
Replace ヴ ’vu’ and
ブ ’bu’
2
ジュネーヴ、ジュネーブ
Replace ヴ ‘vu’ and
ウ ‘U’
3
ヴォッカ、ウォッカ
39
Comparison of SP by hand and SP by
the proposed method
SP by hand proposed by Masuyama et
al(2004)
Expertise worked out SP by hand
Gold standard Katakana variants:
682 pairs of Katakana variant candidates
extracted from newspaper corpus and whose
string penalties are between 1 and 12
We found no correct variants whose SPs are bewteen
10 and 12. Thus, the above gold standard probably
cover all correct varinats.
40
Comparison of SPs
SP
SP by hand
SP by proposed mechanical
method
1
216/221 (97.7%)
262/286 (91.6%)
2
162/207 (78.3%)
133/148 (89.9%)
3
70/99 (70.7%)
51/90 (56.7%)
4
2/14 (14.3%)
2/26 (7.7%)
5
0/29 (0.0%)
0/16 (0.0%)
6
0/13 (0.0%)
2/34 (5.9%)
7
1/20 (5.0%)
1/39 (2.6%)
8
0/13 (0.0%)
1/15 (6.7%)
9
1/12 (8.3%)
0/8 (0.0%)
10
0/16 (0.0%)
0/5 (0.0%)
11
0/17 (0.0%)
0/12 (0.0%)
12
0/21 (0.0%)
0/3 (0.0%)
41
Comparison of SPs correlation
SP by proposed mechanical method
SP by hand
7
1
1
1
4
5
2
2
6
8
0
0
0
0
3
0
4
1
9
0
0
0
0
1
3
1
0
10
0
0
0
1
0
1
1
0
11
0
0
0
0
2
2
3
1
9
0
1 0 0 0 0 2
10
0
1 1 5 0 0 4
11
0
0 1 0 2 13 0
12
0
0 0 0 3 5 11
合計 286 148 90 26 16 34 39
3
2
0
2
15
1
1
1
0
8
1
1
0
0
5
4
0
0
0
12
1
2
3
4
5
6
7
8
1
2 3
207
7 3
20 123 59
59
11 20
0
2 3
0
2 2
0
0 0
0
0 1
0
1 0
4
2
2
3
2
6
3
3
0
5
0
1
2
2
3
1
2
0
6
1
1
3
0
4
1
2
4
correlation:0.76
12 合計
0 221
0 207
0
99
0
14
1
29
0
13
1
20
0
13
0
1
0
0
3
12
16
17
21
682
42
Building Katakana variants
DB automatically
43
Summary of comparison and next?
COLING 2004
Correlation
by hand
0.76
SP
Context
similarity
Extracted variants
Accurate
SIGIR
2005
by
Mechanical
method
SP
Context
similarity
Extracted variants
Accurate?
44
Variants DB
News paper corpus
Candidates of
Katakana variants
Candidates of
Katakana variants
Katakana variants DB
… レポート …
… ラポート …
… リポート …
… サポート …
(レポート,ラポート)
(レポート,リポート)
(レポート,サポート)
…
(レポート,ラポート)
(レポート,リポート)
…
(レポート,リポート)
…
45
Variants DB
News paper corpus
Extract Katakana
words
Candidates of
Katakana variants
Candidates of
Katakana variants
Katakana variants DB
… レポート …
… ラポート …
… リポート …
… サポート …
(レポート,ラポート)
(レポート,リポート)
(レポート,サポート)
…
(レポート,ラポート)
(レポート,リポート)
…
(レポート,リポート)
…
46
Variants DB
News paper corpus
Extract Katakana
words
SP ≤ 3
Candidates of
Katakana variants
Candidates of
Katakana variants
Katakana variants DB
… レポート …
… ラポート …
… リポート …
… サポート …
(レポート,ラポート)
(レポート,リポート)
(レポート,サポート)
…
(レポート,ラポート)
(レポート,リポート)
…
(レポート,リポート)
…
47
Variants DB
News paper corpus
Extract Katakana
words
SP ≤3
Context similarity
≥ 0.005
Optimized threshold
Candidates of
Katakana variants
Candidates of
Katakana variants
Katakana variants DB
… レポート …
… ラポート …
… リポート …
… サポート …
(レポート,ラポート)
(レポート,リポート)
(レポート,サポート)
…
(レポート,ラポート)
(レポート,リポート)
…
(レポート,リポート)
…
48
Comparison of variants DB
SP≦3, context similarity≧0.05
SP by hand of expertise
SP by the proposed mechanical
method
recall
417/420 (99.3%)
415/420 (98.8%)
precision
417/480 (86.9%)
415/480 (86.5%)
F-value
92.7%
92.2%
cf. The whole DB contains 3 million Katakana
variants for 1 million distinct Katakana words.
49
Conclusions
 Mechanical method of calculating SP
 Using Web search engine to extract variant
candidates
 SP by character context
 Almost same accuracy as SP by hand of
expertise
 Katakana variants DB with SP by
mechanical method
 recall:98.8%
 precision:86.5%
 F-value :92.2%
50
Future of our research
 Other language like German
 Arbeit -- アルバイト
 Application of our methodology (Web resource +
statistical string penalty) to other language pair.
 Londre  London
 München  Munich
 Our hope is: Cross-language automatic spelling
variants generator for any language pairs based
on the proposed method.
51
Thank you!
サンキュー(sankyuh)
サンキュウ(sankyuu)
Question or comments are welcome.
52
Error analysis
 grizzly bear
 グリーズリーベア vs グリーズリー・ベア
gurihzurihbea
gurihzurih ・bea
 are not regarded as variants
animal
Norman Shwarzkov
totally different contexts!
 sign pole
sign ball
 サインポール
vs. サインボール
sainpohru
sainbohru
Are regarded as variants.
barber shop
baseball
 customer, shop, sales ( very similar contexts)
53
The threshold of SP vs. F-value
100
F-value(%)
80
60
40
20
0
1
2
3
4
5
6
7
8
The threshold of SP
9
10
11
12
54
F-value(%)
cosine similarity vs. F-value
85
80
75
70
65
60
55
50
45
40
35
30
25
0
0.05
0.1
0.15
0.2
0.25
Threshold of SP
0.3
0.35
0.4
55
If you search some Kataka variant with
Google,…
In case of spaghetti
Katakana Variants
Found or not
スパゲッティ( spaghetti)
○
スパゲッティー( supagettuthii)
○
スパゲッテイ( supagettutei)
×
スパゲティ( supagettuthi)
○
スパゲティー( supagethii)
○
スパゲテイ( supagetei)
×
56
How to find candidates of Katakana
 How to extract document in which context
similarity is calculated
 Google search with a query of Katakana word
which is a candidate of Katakana variant.
 Extract context of the Katakana variant from
search result pages and calculate context
similarity to identify Katakana variants.
57
Example of similarity calculation
(ウォッカ’Uottuka’、ウォトカ’Uotoka’)
ウォッカ:liquor:1.1、strong:1.4、alcohol:1.6、 western liquir:0.7、・・・
ウォトカliquor:0.7、strong:0.7、alcohol:3.4、western liquor:1.4、・・・
cos(Uottuka,Uotoka) 
1.1  0.7  1.4  0.7 ・・・
1.12  1.42  1.62 ・・・
 0.00157
 0.72  0.72  3.42 ・・・
58