Document

Large Scale Crawling the
Web for Parallel Texts
Chikayama Taura lab.
M1 Dai Saito
2006/12/08
1
Parallel Texts
Parallel texts :
Translated pair of
multilingual texts
Parallel corpus : a set of parallel texts
English
One thing was certain,
--it was the black kitten's
that the WHITE kitten had had
fault entirely.
nothing to do with it.
日本語
一つ確実なのは、
――もうなにもかも、
白い子ネコはなんの関係も
黒い子ネコのせいだったのです。
なかったということ。
2006/12/08
2
Parallel Texts
Useful resource for
Statistical machine translation
Dictionary construction
But… existing corpora are small
Language
• English-French
2006/12/08
Genre
• Public Document
• Software Manual
Number
• Not enough
• Need human
resource
3
Parallel Texts from the Web
Crawling parallel texts from the Web
Very large number of texts exist
Varied languages are used
Low human resource
Problems
- How to detect parallel texts automatically
- Calculation cost : O(n2 ) : n  108~9
2006/12/08
4
Parallel Texts from the Web
Maybe
parallel
Web
Not
parallel
①
①
②
Not parallel
2006/12/08
Not parallel
②
Parallel
Texts
Parallel texts
5
Agenda
Introduction
Related work
Proposal
Detecting parallel texts
Large scale crawling
Experiment
Conclusion
2006/12/08
6
STRAND [Resnik et al. 03]
URL Matching
http://www.hostname.com/index.html.en
http://www.hostname.com/index.html.ja
1.
2.
3.
2006/12/08
Removing language-specific substrings[LSSs]
(Japanese : ja, jp, jpn, euc, sjis,…)
Matching LSSs-removed URLs
Making a detailed comparison
7
URL Matching Experiment
URL Matching for URLs of crawled pages
90,000,000URLs
English⇔Japanese
Seeing only URL
90,000,000 →4,000
japanese english
1833
/ja/ /en/
73
_ja _en
3
ja_ en_
488
.ja .en
1405
ja. en.
271
Total
4073 pairs
• Too strict?
• Useless pages are included
2006/12/08
japanese.php
english.php
index.html.ja
index.html.en
8
DOM Tree Alignment [Lei et al. 06]
Searching linked pages
“alt” tag
link name
“English version”
“In English” etc…
HTML→DOM Tree
link
link
Parallel link: a pair of the same hyperlinks
in parallel texts
2006/12/08
9
Pros and Cons
URL Matching
○ High speed and Easy to implement
× Small number of pages
DOM Tree
○ High accuracy and Small storage
× Execution speed is slow
2006/12/08
10
Agenda
Introduction
Related work
Proposal
Detecting parallel texts
Large scale crawling
Experiment
Conclusion
2006/12/08
11
Detecting Parallel Texts
[Fukushima 06]
Reducing comparison cost
without HTML Information
word(noun)→semantic ID→comparison
2006/12/08
12
Semantic ID Conversion
Constructing a graph from dictionaries
Sense
Treating Japanese
and English texts
on same level
# of Semantic ID:
about 10,000
2006/12/08
Movie
1
感覚
意味
2
Film
映画
Hobby
趣味
Taste
味
3
13
Texts to Vector
テキスト 955
辞書を使ってテキストを数列に変える。
…
辞書
1704
1704
…
数列
3173
955
3173
sort
(955, 1704, 3173)
+position information
2006/12/08
14
Comparison
tscore (translation score)
T1:(106, 335, 455, 567, 1704, 3173, 7421)
T2:(335, 567, 567, 1704, 4014, 5449, 7421)
score= 0
2
1
tscore  score#T1#T 2
O(#T1#T 2)
2006/12/08
15
tscore threshold
Fry Corpus[05 Fry]
F-measure
presicion recall
2
presicion recall
tscore threshold
0.102
Speed
250,000 pairs/sec
2006/12/08
16
Agenda
Introduction
Related work
Proposal
Detecting parallel texts
Large scale crawling
Experiment
Conclusion
2006/12/08
17
Large Scale Crawling
Calculation cost of each comparison
Calculation cost of entire crawling
Number of comparisons: O(n2 )
• URL matching is too strict
• Alt tag or link name are not applied for all parallel
pages
2006/12/08
18
HTML on the Web to Natural Language
Guess language
English, SJIS, EUC-JP, UTF-8
Convert character code
Remove HTML Tag
For crawling, <a> or <link> tag are used
<title>, <Hn> tag may be useful
2006/12/08
19
Calculation Cost Reduction
Distance score of vectors
Distance score of two texts is far, then,
they are not parallel texts.
Compare only near vectors
distance score : tscore
Set a label of the nearest sample text
for all texts
2006/12/08
20
Calculation Cost Reduction
Flow
1.
2.
3.
4.
2006/12/08
Select sample texts (<<n)
When crawling, calculate distance score
with sample texts
Classify top m score
Compare only for texts in the same group
21
Sampling
Number of sample
Accuracy (risk of miss labeling)
Calculation cost
Size of the group
should be equal
Large group are divided into small recursively
2006/12/08
22
Crawling link pages
Same links from parallel texts will be parallel texts
Evaluation of same links
DOM Tree [Lei et al. 06]
Evaluate function
• Position of <A> tag
• Pages in same host
• Diff of URLs
– hoge.html.en -> fuga.html.en : hoge - fuga
– hoge.html.ja -> fuga.html.ja : hoge – fuga
2006/12/08
23
Agenda
Introduction
Related work
Proposal
Detecting parallel texts
Large scale crawling
Experiment
Conclusion
2006/12/08
24
Evaluation of tscore
Fry Corpus [Fry 05]
200(japanese) x 200(english)
Flow
1.
2.
3.
Convert all texts to vector
Calculate distance score for all pairs(40000)
Check scores of real parallel texts are high
Score of parallel texts should be top
2006/12/08
25
(1,1,1,2,4,4,…)
Evaluation of tscore
(3,1,0,2,…)
Other distance score
NOT XOR
AND
(3,1,0,2,0)
(3,1,0,2,0)
2
EUCLID
2
(
x

y
)
 i i
AND - XOR
(3,0,0,1,2)
2006/12/08
3
(3,0,0,1,2)
(3,0,0,1,2)
(3,1,0,2,0)
sparse
0
COS
x y x y
26
Evaluation of tscore
Number of miss score ([200+200]texts)
2006/12/08
Not Top
Not in Top3
AND
32
11
NOT XOR
347
188
AND – XOR
64
9
EUCLID
274
191
COS
93
32
TSCORE
2
2
27
Calculation Time
Fry Corpus
NORMAL
n^2
sampling
50
execution time [sec]
200, 400, 800,
1600, 3200
60
40
30
20
10
tscore(Top3)
0
0
1000
2000
3000
4000
# of pairs
# of samples : √(# of All)
Miss labeling : 11 (in 200 pairs)
2006/12/08
28
Agenda
Introduction
Related work
Proposal
Detecting parallel texts
Large scale crawling
Experiment
Conclusion
2006/12/08
29
Conclusion and Future work
Parallel texts from the Web
Detecting parallel texts
Large scale crawling
Future work
Crawling many texts from the Web
• Crawling with parallel link structure
• Detecting parallel in real HTML texts
• Proper sampling
2006/12/08
30
Thank you for your attention!
2006/12/08
31