Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito 2006/12/08 1 Parallel Texts Parallel texts : Translated pair of multilingual texts Parallel corpus : a set of parallel texts English One thing was certain, --it was the black kitten's that the WHITE kitten had had fault entirely. nothing to do with it. 日本語 一つ確実なのは、 ――もうなにもかも、 白い子ネコはなんの関係も 黒い子ネコのせいだったのです。 なかったということ。 2006/12/08 2 Parallel Texts Useful resource for Statistical machine translation Dictionary construction But… existing corpora are small Language • English-French 2006/12/08 Genre • Public Document • Software Manual Number • Not enough • Need human resource 3 Parallel Texts from the Web Crawling parallel texts from the Web Very large number of texts exist Varied languages are used Low human resource Problems - How to detect parallel texts automatically - Calculation cost : O(n2 ) : n 108~9 2006/12/08 4 Parallel Texts from the Web Maybe parallel Web Not parallel ① ① ② Not parallel 2006/12/08 Not parallel ② Parallel Texts Parallel texts 5 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion 2006/12/08 6 STRAND [Resnik et al. 03] URL Matching http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja 1. 2. 3. 2006/12/08 Removing language-specific substrings[LSSs] (Japanese : ja, jp, jpn, euc, sjis,…) Matching LSSs-removed URLs Making a detailed comparison 7 URL Matching Experiment URL Matching for URLs of crawled pages 90,000,000URLs English⇔Japanese Seeing only URL 90,000,000 →4,000 japanese english 1833 /ja/ /en/ 73 _ja _en 3 ja_ en_ 488 .ja .en 1405 ja. en. 271 Total 4073 pairs • Too strict? • Useless pages are included 2006/12/08 japanese.php english.php index.html.ja index.html.en 8 DOM Tree Alignment [Lei et al. 06] Searching linked pages “alt” tag link name “English version” “In English” etc… HTML→DOM Tree link link Parallel link: a pair of the same hyperlinks in parallel texts 2006/12/08 9 Pros and Cons URL Matching ○ High speed and Easy to implement × Small number of pages DOM Tree ○ High accuracy and Small storage × Execution speed is slow 2006/12/08 10 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion 2006/12/08 11 Detecting Parallel Texts [Fukushima 06] Reducing comparison cost without HTML Information word(noun)→semantic ID→comparison 2006/12/08 12 Semantic ID Conversion Constructing a graph from dictionaries Sense Treating Japanese and English texts on same level # of Semantic ID: about 10,000 2006/12/08 Movie 1 感覚 意味 2 Film 映画 Hobby 趣味 Taste 味 3 13 Texts to Vector テキスト 955 辞書を使ってテキストを数列に変える。 … 辞書 1704 1704 … 数列 3173 955 3173 sort (955, 1704, 3173) +position information 2006/12/08 14 Comparison tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 2 1 tscore score#T1#T 2 O(#T1#T 2) 2006/12/08 15 tscore threshold Fry Corpus[05 Fry] F-measure presicion recall 2 presicion recall tscore threshold 0.102 Speed 250,000 pairs/sec 2006/12/08 16 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion 2006/12/08 17 Large Scale Crawling Calculation cost of each comparison Calculation cost of entire crawling Number of comparisons: O(n2 ) • URL matching is too strict • Alt tag or link name are not applied for all parallel pages 2006/12/08 18 HTML on the Web to Natural Language Guess language English, SJIS, EUC-JP, UTF-8 Convert character code Remove HTML Tag For crawling, <a> or <link> tag are used <title>, <Hn> tag may be useful 2006/12/08 19 Calculation Cost Reduction Distance score of vectors Distance score of two texts is far, then, they are not parallel texts. Compare only near vectors distance score : tscore Set a label of the nearest sample text for all texts 2006/12/08 20 Calculation Cost Reduction Flow 1. 2. 3. 4. 2006/12/08 Select sample texts (<<n) When crawling, calculate distance score with sample texts Classify top m score Compare only for texts in the same group 21 Sampling Number of sample Accuracy (risk of miss labeling) Calculation cost Size of the group should be equal Large group are divided into small recursively 2006/12/08 22 Crawling link pages Same links from parallel texts will be parallel texts Evaluation of same links DOM Tree [Lei et al. 06] Evaluate function • Position of <A> tag • Pages in same host • Diff of URLs – hoge.html.en -> fuga.html.en : hoge - fuga – hoge.html.ja -> fuga.html.ja : hoge – fuga 2006/12/08 23 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion 2006/12/08 24 Evaluation of tscore Fry Corpus [Fry 05] 200(japanese) x 200(english) Flow 1. 2. 3. Convert all texts to vector Calculate distance score for all pairs(40000) Check scores of real parallel texts are high Score of parallel texts should be top 2006/12/08 25 (1,1,1,2,4,4,…) Evaluation of tscore (3,1,0,2,…) Other distance score NOT XOR AND (3,1,0,2,0) (3,1,0,2,0) 2 EUCLID 2 ( x y ) i i AND - XOR (3,0,0,1,2) 2006/12/08 3 (3,0,0,1,2) (3,0,0,1,2) (3,1,0,2,0) sparse 0 COS x y x y 26 Evaluation of tscore Number of miss score ([200+200]texts) 2006/12/08 Not Top Not in Top3 AND 32 11 NOT XOR 347 188 AND – XOR 64 9 EUCLID 274 191 COS 93 32 TSCORE 2 2 27 Calculation Time Fry Corpus NORMAL n^2 sampling 50 execution time [sec] 200, 400, 800, 1600, 3200 60 40 30 20 10 tscore(Top3) 0 0 1000 2000 3000 4000 # of pairs # of samples : √(# of All) Miss labeling : 11 (in 200 pairs) 2006/12/08 28 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion 2006/12/08 29 Conclusion and Future work Parallel texts from the Web Detecting parallel texts Large scale crawling Future work Crawling many texts from the Web • Crawling with parallel link structure • Detecting parallel in real HTML texts • Proper sampling 2006/12/08 30 Thank you for your attention! 2006/12/08 31
© Copyright 2024 ExpyDoc