Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo , Japan)† †(Now Yahoo! Japan) 1 • Very sorry for Katakana fonts printing problem in proceedings. We could not check the final printing. • Please read English transliterations of Katakana parts like %-%c….. 2 Cooperation with Satoshi Sekine Computer Science, New York University And Language Craft Co. 3 mew,mew Katakana word variants ニャア,ニャア(nyaa,nyaa) ニャアー、ニャアー(nyah,nyah) The way of sound to spelling defers language by4 language History of Katakana Every Country and every language has its own history of meanings, codes and fonts. Phonogram vs. Ideogram 5 漢字 (Hanji) Kanji(Hanji)Characters (=ideogram) imported to Japan 1300 yeas ago 6 Almost 1000 years ago, women writers worked out phonogram(Hiragana and Katakana) from Kanji 紫式部 (ideogram) to express Japanese people’s mentality. 世 (Kanji) ideogram せ (hiragana) セ (katakana) phonogram 7 Modern history of Katakana Japanese Katakana and Hiragana have one to one mapping. After Meiji revolution(1868), Japanese people used Katakana to express functional word Hiragana to express words imported from western countries. After World War II(1945), we exchanged them. Hiragana became used to express functional word like case markers Katakana became used to express words imported from western countries. Thus majority of Katakana words are transliterations from English words. 8 However, Japanese Katakana has only five vowels (a,i,u,e,o) and 19 consonants (k,g,s,z,j,t,d,n,h,b,m,y,r,w,c,sh,ch,ny,my,). Pronunciations are always C+V or V. No C+C. No distinction between, (b,v),(h,f),(l,r),.. There are no orthographic way to express English sounds with Katakana character set. Thus Japanese language accepted several Katakana spellings for one English word. Katakana variants 9 ディテール(dhiteeru) ディティール (dhithiiru) detail ディテェール(dhitheeru) ディテイル(dhiteiru) Transliterated into Katakana variants キャメロン・ディアス (kyameronn・dhiasu) Cameron Diaz キャメロン・ディアズ (kyameronn・dhiazu) キャメロンディアス (kyameronndhiasu) 10 An example of search result Hits for “spaghetti” with Google To make sure to avoid overlap between distinct Katakana variants by + and - options. Katakana variants. Hits of Google search (%) スパゲッティ(supagettuthi) 187,000 (32.7%) スパゲッティー(supagettuthii) 57,600 (10.1%) スパゲッテイ(supagettutei) 6,850 (1.2%) スパゲティ(supagetuthi) 240,000(41.9%) スパゲティー(supagethii) 77,400(13.5%) スパゲテイ(supagetei) total 3,800 (0.7%) 572,650(100%) 11 Katakana variants extraction system is needed to enhance the cross-language ability of Information Retrieval Search engine Machine translation Information Extraction Summarization Question Answering 12 Previous research 1 : Manually constructed Rewriting rules to generate and/or extract Katakana variants from given Katakana word (Shishibori et al, 1993, 1994, Kubota 1994) Samples of rewrite rules ベ(Be)⇔ヴェ(Ve) チ(chi)⇔ツィ(thi) Input : ベネチア(Benechia) Output : ベネツィア(Benethia) ヴェネチア(Venechia) ヴェネツィア(Venethia) 13 Previous research 2: Extract Katakana variants with weighted edit distance (Magari et al、2004)、(Ohtake et al、 2004) Edit distance is defined as Number of operations to transform one Katakana word into another Katakana word: Operations: insert, delete,replace Ex.: レポート(Repooto)リポート(Ripooto) → edit dist. =1 Weighted edit distance Weight of each operation is manually given Ex: Weight of edit dist. (レポートリポート) 0.8 14 Previous research 3:more direct way String penalty to extract Katakana variants (Masuyama et al, 2004) String penalty: SP Based on weighted edit distance, but extended to treat two,three characters:string Manually given weights to Combination of edit operations = string replacing operations. Ex.SP(ボイス,ヴォイス)=4 … replace and insert Boisu, Voisu 15 Previous research 4: Combination method (Masuyama,Nakagawa,Sekine 2004 COLING) Combination of string penalty and context String penalty :SP SP value is given by an expertise Similarity of contexts in which each Katakana variant appears Vector space model (automatically calculated) If Words around each Katakana words are similar, then the Katakana words are variants each other 16 Problems of previous researches Less coverage Need human intellectual and intensive work for Working out rewrite rules Determining weights of weighted edit distance Determining values of string penalty of each Katakana string pairs Depend on specific corpus which is used to calculate weights of weighted edit distance string penalty 17 Purpose of this work The problem of manually given string penalty: Labor intensive (even in combination of SP and context) Low coverage Determine string penalty mechanically and Automatically building Katakana variants for each Katakana word 18 Calculating string penalties Mechanically For this, we need accurate and high quality Katakana variants database! 19 Process English word and its Katakana variant WWW Pairs of variant cadi. Pairs of variant String Penalty (idea, アイデア) (report, レポート) … … レポート … … report … … リポート … … (レポート’repooto’,リポート’ripooto’) (レポート’repooto’,サポート’sapooto’) … (レポート’repooto’,リポート’repooto’) (レファレンス’refarensu’, リファレンス’rifarenssu’) (アーキテクト’aakitekuto’, アーキテクツ’aakitekutu’) … レ’re’⇔リ’ri’ : 1 ト’to’⇔ッ’ttu’ : 3 … 20 Process English word and its Katakana variant Web search by WWW Pairs of variant cadi. Pairs of variant String Penalty (idea, アイデア) (report, レポート) … … レポート … … report … … リポート … … (レポート’repooto’,リポート’ripooto’) (レポート’repooto’,サポート’sapooto’) … (レポート’repooto’,リポート’repooto’) (レファレンス’refarensu’, リファレンス’rifarenssu’) (アーキテクト’aakitekuto’, アーキテクツ’aakitekutu’) … レ’re’⇔リ’ri’ : 1 ト’to’⇔ッ’ttu’ : 3 … 21 How to find candidates of Katakana variant pairs (1/3) 1. To collect English words and thier Katakana variants i.e. (vodka ウォッカ) we used four Web sites where we collect a number of English words and their Japanese translations. http://homepage2.nifty.com/katakanaEnglish/ http://www.hoshi.cis.ibaraki.ac.jp/usefull/usefull15.html http://ke.ics.saitama-u.ac.jp/jsgs/keywords.html http://smalltown.ne.jp/~uasa/pub/distfiles/skk-extra200307/SKK-JISYO.edit 14,958 distinct pairs of English words and their Katakana translations. 22 How to find candidates of Katakana variant pairs (2/3) 1. Extract many English word and its Katakana variant 14.958 pairs of English-Katakana 2. To collect more Katakana variants for each English word, we use Google search to get pages that include English word and Katakana word of its translation “English word +( language = Japanese )” “English word + 「英和」(“English to Japanese”)” in order to search English-Japanese dictionary site 3. Gather Katakana words from search results 23 Google search with English word “vodka” among page written in Japanese vodka 24 Add a query 「英和」(english-Japanese) and Google search 英和’e-j’ vodaka 25 Process English word and Web search its Katakana variant “英和(e-j) report” WWW Pairs of Edit dist. =1 variant cadi. Pairs of variant String Penalty (idea, アイデア) (report, レポート) … … レポート … … report … … リポート … … (レポート’repooto’,リポート’ripooto’) (レポート’repooto’,サポート’sapooto’) … (レポート’repooto’,リポート’repooto’) (レファレンス’refarensu’, リファレンス’rifarenssu’) (アーキテクト’aakitekuto’, アーキテクツ’aakitekutu’) … レ’re’⇔リ’ri’ : 1 ト’to’⇔ッ’ttu’ : 3 … 26 How to find candidates of Katakana variant pairs (3/3) 4. Extract promising candidates of Katakana word pairs whose edit distance =1 as Katakana variants Ex. (vodka ウォッカ) (ウォッカ’Uottuka’、ウォトカ’Uotoka) (ウォッカ’Uottuka’、ウオッカ’UOttuka’) (ウォッカ’Uottuka、ヴォッカ(Vuottuka’) 27 Process English word and its Katakana variant Web search by WWW “英和(e-j) report” Pairs of Edit dist. =1 variant candi. cosine sim Pairs of > 0.00006 variant candi. by context String Penalty (idea, アイデア) (report, レポート) … … レポート … … report … … リポート … … (レポート’repooto’,リポート’ripooto’) (レポート’repooto’,サポート’sapooto’) … (レポート’repooto’,リポート’repooto’) (レファレンス’refarensu’, リファレンス’rifarenssu’) (アーキテクト’aakitekuto’, アーキテクツ’aakitekutu’) … レ’re’⇔リ’ri’ : 1 ト’to’⇔ッ’ttu’ : 3 … 28 How to extract documents in which context similarity is calculated Google search with a query of Katakana word which is a candidate of Katakana variant. Extract context of the Katakana variant from search result pages. 29 Search “Vodka” with Google +ウォッカ ‘+vodka’ retrieves all pages including ウォッカ ウオッカ‘s contexts 30 1. Calculate context similarity of a candidate of Katakana variant pair drink vodka(Vuottka) with a main dish and plate of caviar in the restaurants cosine similarity eat some main dish plate after vodka(Uotoka) in that restaurants 50 words around a candidate of Katakana variant is used as its context 2. Identify and extract Katakana variants if cosine similarity is greater than the threshold of 0.00006. 31 Detail of context similarity calculation context=50 words around Katakana word Weight of word t in context log(freq(t)+1) Context similarity = cosine Selection from candidates by cosine similarity≧0.00006 (threshold) The threshold optimization argmax of F-value threshold on positive pairs (347pairs)and negative pairs(111 pair) 32 Results of context similarity vs cosine threshold 90 86 84 82 0 00 01 9 0. 00 00 8 0. 00 00 7 0. 00 00 6 0. 00 00 5 0. 00 00 4 0. 00 00 3 0. 0. 00 00 2 00 00 0. 00 00 1 80 0. F-value (%) 88 threshold of cosine similarity 33 Process English word and its Katakana variant Web search by WWW “英和(e-j) report” Pairs of Edit dist. =1 variant cadi. cosine sim > 0.00006 Pairs of variant Next to do is to calculate SP based on Statistics (idea, アイデア) (report, レポート) … … レポート … … report … … リポート … … (レポート’repooto’,リポート’ripooto’) (レポート’repooto’,サポート’sapooto’) … (レポート’repooto’,リポート’repooto’) (レファレンス’refarensu’, リファレンス’rifarenssu’) (アーキテクト’aakitekuto’, アーキテクツ’aakitekutu’) … レ’re’⇔リ’ri’ : SP=1 ト’to’⇔ッ’ttu’ : SP=3 … 34 2nd stage: Calculation of string penalty :SP String penalty of operation x y (x replaces with y) We focus on High correlation between replaced strings and their character context which is composed of several characters around the target string. Example: (ウインブルドン、ウィンブルドン) (ウインドウズ、ウィンドウズ) (ウインク、ウィンク) replace イ’I’ with ィ’i’→ウ’U’ and ン’n’ cooccurs 35 Character level context:CLC1..CLC5 used to calculate SP x : target character α、β、γ、δ :characters around x CLC String contexts around x CLC1 αβ x preceeding two characters of x CLC2 βx preceeding one character of x CLC3 xγ succeeding one character of x CLC4 x γδ succeeding two characters of x CLC5 βxγ preceeding and succeeding characters of x 36 Calculation of string penalty:SP f (CLCi , x y) 1 P( x y | CLCi ) f (CLCi ) 2 i=1,2,3,4,5 f(CLCi) = freq. of pairs in which CLCi occurs f(CLCi, xy) = freq. of pairs in which both of CLCi and xy occur 37 Calculation of string penalty :SP Identify character context CLCi which most probably cooccurs with operation x y CLC arg max Px y | CLCi CLC i (i 1,..,5) Then SPx y 1 P (x y | CLC) Rank of occurrence ≈ C * (Prob. of occurrence)-1 Zipf’s law 38 Examples of string penalties operation SP Example Insertion and deletion of ‘・’ 1 ラストシーン、ラスト・シーン Insertion and deletion of macron ‘ ー’ 1 エネルギー、エネルギ Replace オ ‘O’ and ォ ‘o’ 1 ウオッカ、ウォッカ Replace グ ‘gu’ and ク ‘ku’ 2 バック、バッグ Replace ヴ ’vu’ and ブ ’bu’ 2 ジュネーヴ、ジュネーブ Replace ヴ ‘vu’ and ウ ‘U’ 3 ヴォッカ、ウォッカ 39 Comparison of SP by hand and SP by the proposed method SP by hand proposed by Masuyama et al(2004) Expertise worked out SP by hand Gold standard Katakana variants: 682 pairs of Katakana variant candidates extracted from newspaper corpus and whose string penalties are between 1 and 12 We found no correct variants whose SPs are bewteen 10 and 12. Thus, the above gold standard probably cover all correct varinats. 40 Comparison of SPs SP SP by hand SP by proposed mechanical method 1 216/221 (97.7%) 262/286 (91.6%) 2 162/207 (78.3%) 133/148 (89.9%) 3 70/99 (70.7%) 51/90 (56.7%) 4 2/14 (14.3%) 2/26 (7.7%) 5 0/29 (0.0%) 0/16 (0.0%) 6 0/13 (0.0%) 2/34 (5.9%) 7 1/20 (5.0%) 1/39 (2.6%) 8 0/13 (0.0%) 1/15 (6.7%) 9 1/12 (8.3%) 0/8 (0.0%) 10 0/16 (0.0%) 0/5 (0.0%) 11 0/17 (0.0%) 0/12 (0.0%) 12 0/21 (0.0%) 0/3 (0.0%) 41 Comparison of SPs correlation SP by proposed mechanical method SP by hand 7 1 1 1 4 5 2 2 6 8 0 0 0 0 3 0 4 1 9 0 0 0 0 1 3 1 0 10 0 0 0 1 0 1 1 0 11 0 0 0 0 2 2 3 1 9 0 1 0 0 0 0 2 10 0 1 1 5 0 0 4 11 0 0 1 0 2 13 0 12 0 0 0 0 3 5 11 合計 286 148 90 26 16 34 39 3 2 0 2 15 1 1 1 0 8 1 1 0 0 5 4 0 0 0 12 1 2 3 4 5 6 7 8 1 2 3 207 7 3 20 123 59 59 11 20 0 2 3 0 2 2 0 0 0 0 0 1 0 1 0 4 2 2 3 2 6 3 3 0 5 0 1 2 2 3 1 2 0 6 1 1 3 0 4 1 2 4 correlation:0.76 12 合計 0 221 0 207 0 99 0 14 1 29 0 13 1 20 0 13 0 1 0 0 3 12 16 17 21 682 42 Building Katakana variants DB automatically 43 Summary of comparison and next? COLING 2004 Correlation by hand 0.76 SP Context similarity Extracted variants Accurate SIGIR 2005 by Mechanical method SP Context similarity Extracted variants Accurate? 44 Variants DB News paper corpus Candidates of Katakana variants Candidates of Katakana variants Katakana variants DB … レポート … … ラポート … … リポート … … サポート … (レポート,ラポート) (レポート,リポート) (レポート,サポート) … (レポート,ラポート) (レポート,リポート) … (レポート,リポート) … 45 Variants DB News paper corpus Extract Katakana words Candidates of Katakana variants Candidates of Katakana variants Katakana variants DB … レポート … … ラポート … … リポート … … サポート … (レポート,ラポート) (レポート,リポート) (レポート,サポート) … (レポート,ラポート) (レポート,リポート) … (レポート,リポート) … 46 Variants DB News paper corpus Extract Katakana words SP ≤ 3 Candidates of Katakana variants Candidates of Katakana variants Katakana variants DB … レポート … … ラポート … … リポート … … サポート … (レポート,ラポート) (レポート,リポート) (レポート,サポート) … (レポート,ラポート) (レポート,リポート) … (レポート,リポート) … 47 Variants DB News paper corpus Extract Katakana words SP ≤3 Context similarity ≥ 0.005 Optimized threshold Candidates of Katakana variants Candidates of Katakana variants Katakana variants DB … レポート … … ラポート … … リポート … … サポート … (レポート,ラポート) (レポート,リポート) (レポート,サポート) … (レポート,ラポート) (レポート,リポート) … (レポート,リポート) … 48 Comparison of variants DB SP≦3, context similarity≧0.05 SP by hand of expertise SP by the proposed mechanical method recall 417/420 (99.3%) 415/420 (98.8%) precision 417/480 (86.9%) 415/480 (86.5%) F-value 92.7% 92.2% cf. The whole DB contains 3 million Katakana variants for 1 million distinct Katakana words. 49 Conclusions Mechanical method of calculating SP Using Web search engine to extract variant candidates SP by character context Almost same accuracy as SP by hand of expertise Katakana variants DB with SP by mechanical method recall:98.8% precision:86.5% F-value :92.2% 50 Future of our research Other language like German Arbeit -- アルバイト Application of our methodology (Web resource + statistical string penalty) to other language pair. Londre London München Munich Our hope is: Cross-language automatic spelling variants generator for any language pairs based on the proposed method. 51 Thank you! サンキュー(sankyuh) サンキュウ(sankyuu) Question or comments are welcome. 52 Error analysis grizzly bear グリーズリーベア vs グリーズリー・ベア gurihzurihbea gurihzurih ・bea are not regarded as variants animal Norman Shwarzkov totally different contexts! sign pole sign ball サインポール vs. サインボール sainpohru sainbohru Are regarded as variants. barber shop baseball customer, shop, sales ( very similar contexts) 53 The threshold of SP vs. F-value 100 F-value(%) 80 60 40 20 0 1 2 3 4 5 6 7 8 The threshold of SP 9 10 11 12 54 F-value(%) cosine similarity vs. F-value 85 80 75 70 65 60 55 50 45 40 35 30 25 0 0.05 0.1 0.15 0.2 0.25 Threshold of SP 0.3 0.35 0.4 55 If you search some Kataka variant with Google,… In case of spaghetti Katakana Variants Found or not スパゲッティ( spaghetti) ○ スパゲッティー( supagettuthii) ○ スパゲッテイ( supagettutei) × スパゲティ( supagettuthi) ○ スパゲティー( supagethii) ○ スパゲテイ( supagetei) × 56 How to find candidates of Katakana How to extract document in which context similarity is calculated Google search with a query of Katakana word which is a candidate of Katakana variant. Extract context of the Katakana variant from search result pages and calculate context similarity to identify Katakana variants. 57 Example of similarity calculation (ウォッカ’Uottuka’、ウォトカ’Uotoka’) ウォッカ:liquor:1.1、strong:1.4、alcohol:1.6、 western liquir:0.7、・・・ ウォトカliquor:0.7、strong:0.7、alcohol:3.4、western liquor:1.4、・・・ cos(Uottuka,Uotoka) 1.1 0.7 1.4 0.7 ・・・ 1.12 1.42 1.62 ・・・ 0.00157 0.72 0.72 3.42 ・・・ 58
© Copyright 2025 ExpyDoc