沒有投影片標題 - NTU NLPL`s Homepage台大自然語言

Chapter 10
Cross-Language Information Retrieval
Hsin-Hsi Chen (陳信希)
Department of Computer Science and
Information Engineering
National Taiwan University
Hsin-Hsi Chen
10-1
Outlines

Multilingual Environments
 What is Cross-Language Information
Retrieval?
 Interdisciplinary relationship in CLIR
 Major Problems in CLIR
 Major Approaches in CLIR
 Summary
Hsin-Hsi Chen
10-2
Multilingual Collections

There are 6,703 languages listed in the Ethnologue
 Digital libraries
– OCLC Online Computer Library Center serves more
than 17,000 libraries in 52 countries and contains over
30 million bibliographic records with over 500 million
records ownership attached in more than 370 languages

World Wide Web
– Around 40% of Internet users do not speak English,
however, 80% of Web sites are still in English
Hsin-Hsi Chen
10-3
真實世界語言使用人口
( http://www.g11n.com/faq.htm)
Speakers (Millions)
800
600
400
200
0
Chinese
中
文
Hsin-Hsi Chen
HindiUrdu
英
語
印
度
語
Portuguese
西
班
牙
語
葡
萄
牙
語
Russian
孟
加
拉
語
俄
語
Japanese
阿
拉
伯
語
日
語
10-4
荷蘭語
葡萄牙語
義大利語
韓文
西班牙語
瑞典語
中文
法語
德語
日語
Hsin-Hsi Chen
(Statistics from Euro-Marketing Associates, 1998)
10-5
中文人口
比例(6.1%)
<
法文人口
比例(8.8%)
(1998年)
Hsin-Hsi Chen
(Statistics from Euro-Marketing Associates, 1999)
http://www.glreach.com/globstats/
10-6
網路世界語言使用人口
Hsin-Hsi Chen
10-7
網際網路內容
Internet Hosts (thousands)
(Network Wizards Jan 99 Internet Domain Survey)
100,000
33,878
10,000
1,687 1,684
654
546 546 473 458 432
1,000
英 100
English
語
40%的Internet使用者
不懂英文,但是80%
的Internet內容是英文
Hsin-Hsi Chen
German
Dutch
Spanish
Swedish
Language (estimated by domain)
日
語
德
語
法
語
荷
蘭
語
芬
蘭
語
西
班
牙
語
中
文
瑞
典
10-8
語
Hsin-Hsi Chen
(Source: http://www.emarketer.com)
10-9
What is Cross-Language
Information Retrieval?

Definition: Select information in one
language based on queries in another.
 Terminologies
– Cross-Language Information Retrieval
(ACM SIGIR 96 Workshop on Cross-Linguistic
Information Retrieval)
– Translingual Information Retrieval
(Defense Advanced Research Project Agency DARPA)
Hsin-Hsi Chen
10-10
Generalization:
Multi- & Cross- Lingual Information Access
Hsin-Hsi Chen
10-11
MLIR Applications

Multilingual information access in multilingual
country, organization, enterprise, etc.
 Cross- language information retrieval for users
who read a second language (large passive
vocabulary) but are not able to formulate good
queries (small active vocabulary).
 Monolingual users may retrieve images by taking
advantage of multilingual captions.
 Monolingual users may retrieve documents and
have them translated (automatically or manually)
in their language.
Hsin-Hsi Chen
10-12
Why is Cross- Language Information
Retrieval Important?

More information workers with less time
require fast access to global resources
 global B2B interactions (virtual enterprises)
 global B2C interactions (online trading,
travelling)
 time critical information (translation comes too
late)
Hsin-Hsi Chen
10-13
History

1970 Salton runs retrieval experiments with a
small English/ German dictionary
 1972 Pevzner shows for English and Russian that
a controlled thesaurus can be used effectively for
query term translation
 1978 ISO Standard 5964 for developing
multilingual thesauri (revised in 1985)
 1990 Latent Semantic Indexing (LSI) applied to
CLIR
Hsin-Hsi Chen
10-14
History (Continued)

1994 1st PhD thesis on CLIR by Khaled
Radwan
 1996 Similarity thesaurus applied to CLIR
(ETH Zurich)
 1996 Dictionary based retrieval applied to
CLIR (Umass & XEROX Grenoble)
 1997 Generalized Vector Space Model
(GVSM) applied to CLIR (CMU)
Hsin-Hsi Chen
10-15
History (Continued)

1997 CLIR (Cross- Language Information
Retrieval) track starts within TREC
 1998 NTCIR starts in Japan
 1999 TIDES (Translingual Information
Detection, Extraction, and Summarization)
starts in U. S.
 2000 CLEF starts in Europe
Hsin-Hsi Chen
10-16
An Architecture of Multilingual Information Access
Multiple Langauges
Multilingual Resources
Language
Identification
(LI)
Information
Extraction
Information
Filtering
Information
Retrieval
Query
Translation
Text
Classification
Document
Translation
Text
Summarization
Text Processing
Language
Translation
User Interface
(UI)
Native Langauge(s)
Hsin-Hsi Chen
10-17
An Architecture of Cross-Language Information Retrieval
Hsin-Hsi Chen
10-18
Building Blocks for CLIR
Information
Retrieval
Information
Science
Hsin-Hsi Chen
Artificial
Intelligence
Speech
Recognition
Computational
Linguistics
10-19
Information Science

User interface
 Interactive search technique
 Thesaurus construction
 Evaluation
Hsin-Hsi Chen
10-20
Computational Linguistics

Language identification
 Morphological analysis
 Stylistic analysis
 Part-of-speech tagging
 Identifying occurrences of phrases
 Using parallel corpora
 Using comparable corpora
Hsin-Hsi Chen
10-21
Computational Linguistics (Continued)

Aligning documents
 Identifying occurrences of geographic and
temporal concepts
 Stochastic language models
 Word disambiguation
 Lexicons (morphology, part-of-speech)
 Bilingual dictionaries (terms and possible
translation)
Hsin-Hsi Chen
10-22
Information Retrieval (w/o CL)

Filtering
 Relevance Feedback
 Document representation
 Latent semantic indexing
 Generalization vector space model
 Collection fusion
 Passage retrieval
Hsin-Hsi Chen
10-23
Information Retrieval (Continued)

Similarity thesaurus
 Local context analysis
 Automatic query expansion
 Fuzzy term matching
 Adapting retrieval methods to collection
 Building cheap test collection
 Evaluation
Hsin-Hsi Chen
10-24
Artificial Intelligence

Machine translation
 Machine learning
 Template extraction and matching
 Building large knowledge bases
 Semantic network
Hsin-Hsi Chen
10-25
Speech Recognition








Signal processing
Pattern matching
Phone lattice
Background noise elimination
Speech segmentation
Modeling speech prosody
Building test databases
Evaluation
Hsin-Hsi Chen
10-26
Building Blocks Dealing with
Term Dependencies

IS: ISO-Thesaurus
 CL: Word disambiguation, bilingual
dictionaries
 AI: Semantic network
 SR: Stochastic language models
 IR: LSI, GVSM, similarity thesaurus, local
context analysis, (weighted) Boolean filters
Hsin-Hsi Chen
10-27
Major Problems of CLIR

Queries and documents are in different
languages.
– translation

Words in a query may be ambiguous.
– disambiguation

Queries are usually short.
– expansion
Hsin-Hsi Chen
10-28
Major Problems of CLIR (Continued)

Queries may have to be segmented.
– segmentation

A document may be in terms of various
languages.
– language identification
Hsin-Hsi Chen
10-29
Enhancing Traditional
Information Retrieval Systems

Which part(s) should be modified for CLIR?
Documents
Queries
(1)
(3)
Document
Representation
Query
Representation
(2)
(4)
Comparison
Hsin-Hsi Chen
10-30
Enhancing Traditional
Information Retrieval Systems
(Continued)

(1): text translation
 (2): vector translation
 (3): query translation
 (4): term vector translation
 (1) and (2), (3) and (4): interlingual form
Hsin-Hsi Chen
10-31
What are the Problems?







Ambiguous terms (e.g., performance)
Multiword phrases may correspond to single-word
phrases (e. g. South Africa => 南非, Südafrika)
Coverage of the vocabulary
There is not a one-to-one mapping between two
languages
Translating queries automatically (lack of syntax)
Translating documents automatically (performance, …)
Computing mixed result lists
Hsin-Hsi Chen
10-32
Cross-Language Information Retrieval
Cross-Language Information Retrieval
Query Translation
Controlled Vocabulary
Free Text
Knowledge-based
Ontology-based
Dictionary-based
Thesaurus-based
Hsin-Hsi Chen
Document Translation
Text Translation
Corpus-based
Term-aligned
Sentence-aligned
Vector Translation
Hybrid
Document-aligned
Parallel
No Translation
Unaligned
Comparable
10-33
Query Translation Based CLIR
English
Query
Hsin-Hsi Chen
Translation
Device
Chinese
Query
Retrieved
Chinese
Documents
Monolingual
Chinese
Retrieval
System
10-34
Translating the 400 Million
non-English Pages of the WWW

... would take 100’000 days (300 years) on
one fast PC. Or, 1 month on 3’600 PC’s.
Hsin-Hsi Chen
10-35
Controlled Vocabulary

Sublanguage chosen by human indexers
 National Library of Medicine
– Unified Medical Language System (UMLS)
– Integrating medical coverage of many thesauri
• English, French, German, Portuguese
Hsin-Hsi Chen
10-36
Knowledge-Based

Examples
– Subject Thesaurus
• Hierarchical and associative relations.
• Unique term assigned to each node.
– Concept List
• Term space partitioned into concept spaces.
– Term List
• List of cross-language synonyms.
– Lexicon
• Machine readable syntax and/or semantics.
Hsin-Hsi Chen
10-37
Ontology-Based Approaches

Exploit complex knowledge representations
e.g., EuroWordNet

A Proposal for Conceptual Indexing using
EuroWordNet
Hsin-Hsi Chen
10-38
Ontology-Based Approaches
(Continued)

The Indexing Process
Hsin-Hsi Chen
10-39
Dictionary-Based Approaches

Exploit machine-readable dictionaries.

Problems
– translation ambiguity + target polysemy
– coverage (unknown words, abbreviations, ...)
Hsin-Hsi Chen
10-40
Dictionary-Based Approaches
(Continued)

Issue 1: selection strategy
– Select all.
– Select N randomly.
– Select best N.

Issue 2: which level
– word
– phrase
Hsin-Hsi Chen
10-41
Selection Strategy: Select All

Hull and Grefenstette 1996
– Take concatenation of all term translation.
E: politically motivated civil disturbances
F: troubles civils a caractere politique
trouble - turmoil, discord, trouble, unrest, disturbance, disorder
civil - civil, civilian, courteous
caractere - character, nature
politique - political, diplomatic, politician, policy
– Original English (0.393) vs. Automatic wordbased transfer dictionary (0.235): 59.8%.
– errors: multi-word expressions and ambiguity
Hsin-Hsi Chen
10-42
Selection Strategy: Select All
(Continued)

Davis 1997 (TREC5)
– Replace each English query term with all of its
Spanish equivalent terms from the Collins
bilingual dictionary.
– Monolingual (0.2895) vs. All-equivalent
substitution (0.1422): 49.12%
Hsin-Hsi Chen
10-43
Evaluation Method

Average Precision (5-, 9-, 11-points)
 Model
TREC
Spanish Query
English Query
English Query
Hsin-Hsi Chen
Mono
IR Engine
Spanish
Corpus
Bilingual
Spanish
Mono
Dictionary Equivalents IR Engine
POS
Bilingual
Dictionary
Spanish
Mono
Equivalents IR Engine
by POS
TREC
Spanish
Corpus
TREC
Spanish
Corpus
10-44
Selection Strategy: Select N

Simple word-by-word translation
– Each query term is replaced by the word or
group of words given for the first sense of the
term’s definition.
• 50-60% drop in performance (average precision)
Hsin-Hsi Chen
10-45
Selection Strategy: Select N
(Continued)

word/phrase translation
– Take at most three translations of each word,
one from each of the first three senses. Take
phrase translation if appearing in dictionary.
• 30-50% worse than good translation
– Well-translated phrases can greatly improve
effectiveness, but poorly translated phrases may
negate the improvements.
• WBW (0.0244), phrasal (0.0148), good phrasal (0.0610)
-39.3%
+150.3%
Hsin-Hsi Chen
10-46
Selection Strategy: Select Best N

Hayashi, Kikui and Susaki 1997
– search for a dictionary entry corresponding to the
longest sequence of words from left to right
– choose the most frequently used word (or phrases) in a
text corpus collected from WWW
– no report for this query translation approach

Davis 1997 (TREC5)
– POS disambiguation
– Monolingual (0.2895) vs. All-equivalent substitution
(0.1422) vs. POS disambiguation (0.1949): near 67.3%
Hsin-Hsi Chen
10-47
Corpus-Based Approaches

Categorization
–
–
–
–

Term-Aligned
Sentence-Aligned
Document-Aligned (Parallel, Comparable)
Unaligned
Usage
– Setup Thesaurus
– Vector Mapping
Hsin-Hsi Chen
10-48
Term-Aligned Corpora

Fine-grained alignment in parallel corpora
 Oard 1996
– Term alignment is a challenging problem.
English Query
Parallel
Translation
Cooccurrance
Binlingual
Tables
Statistics
Corpus
Hsin-Hsi Chen
Machine
Translation
System
Spanish
Query
10-49
Sentence-Aligned Corpora

Davis & Dunning 1996 (TREC4)
– High-frequency Terms
Hsin-Hsi Chen
10-50
Sentence-Aligned Corpora
(Continued)
– Statistically Significant Terms
Hsin-Hsi Chen
10-51
Sentence-Aligned Corpora
(Continued)

Precision-Recall Averages
Hsin-Hsi Chen
10-52
Document-Aligned Corpora

Exploit parallel or comparable corpora
 Parallel: linked translation equivalents
– LSI mate retrieval achieve 99% effectiveness

Comparable: separate authorship, same
topic
– Easier to find, harder to link the documents
Hsin-Hsi Chen
10-53
Query Term Disambiguation
Hsin-Hsi Chen
10-54
Comparable Document-Aligned Corpora

Sheridan & Ballerini 1996
– Create a comparable corpus.
Align news stories in German and Italian by
topic label and date, and merge them to create
pseudo-parallel documents.
– Generate co-occurrence thesaurus.
– Perform translations using thesaurus.
Hsin-Hsi Chen
10-55
Unaligned Corpora

No document links
 Used in conjunction with dictionaries
– Pretranslation Local feedback (Ballesteros &
Croft 1997)
Hsin-Hsi Chen
10-56
Brief Summary

dictionary-based methods
– Specialized vocabulary not in the dictionaries will not
be translated.
– Ambiguities will add extraneous terms to the query.

parallel/comparable corpora-based methods
– Parallel corpora are not always available.
– Available corpora tend to be relative small or to cover
only a small number of subjects.
– Performance is dependent on how well the corpora are
aligned.
Hsin-Hsi Chen
10-57
Brief Summary (Continued)

Dictionaries are very useful.
– Achieve 50% on their own

Parallel corpora have limitations.
– Domain shifts
– Term alignment accuracy

Dictionaries and corpora are complementary.
– Dictionaries provide broad and shallow coverage.
– Corpora provide narrow (domain-specific) but deep
(more terminology) coverage of the language.
Hsin-Hsi Chen
10-58
Hybrid Methods

What knowledge can be employed?
– lexical knowledge
– corpus knowledge
– ...
Hsin-Hsi Chen
10-59
Hybrid Methods (Continued)

Query Expansion
– Issue 1: context
• pseudo relevance feedback (local feedback)::
A query is modified by the addition of terms found
in the top retrieved documents.
• local context analysis::
Queries are expanded by the addition of the top
ranked concepts from the top passages.
Hsin-Hsi Chen
10-60
Hybrid Methods (Continued)
– Issue 2: when
• before query translation
• after query translation
Hsin-Hsi Chen
10-61
Pseudo- Relevance Feedback illustrated
Hsin-Hsi Chen
10-62
Query Expansion through
Local Context Analysis

local analysis
– Based on the set of documents retrieved for the
original query
– Based on term co-occurrence inside documents
– Terms closest to individual query terms are selected

global analysis
– Based on the whole document collection
– Based on term co-occurrence inside small contexts
and phrase structures
– Terms closest to the whole query are selected
Hsin-Hsi Chen
10-63
Query Expansion through
Local Context Analysis (Continued)

candidates
– noun groups instead of simple keywords
– single noun, two adjacent nouns, or three
adjacent nouns

query expansion
– Concepts are selected from the top ranked
documents (as in local analysis)
– Passages are used for determining cooccurrence (as in global analysis)
Hsin-Hsi Chen
10-64
Query Expansion through
Local Context Analysis (Continued)

algorithm
– Retrieve the top n ranked passages using the original
query
– For each concept in the top ranked passages, the
similarity sim(q,c) between the whole query q and the
concept c is computed using a variant of tf-idf ranking
– The top m ranked concepts are added to the original
query q
• Each concept is assigned a weight 1-0.9i/m (i: rank)
• Each term in the original query is assigned a weight 2original
weight
Hsin-Hsi Chen
10-65
Hybrid Methods (Continued)

Ballesteros & Croft 1997
Original Spanish human English (BASE)
TREC Queries translation
Queries
automatic
dictionary
translation
Spanish
Queries
English
Queries
query
expansion
Spanish
Queries
Hsin-Hsi Chen
query
expansion
INQUERY
automatic
dictionary
translation
Spanish
Queries
10-66
Hybrid Methods (Continued)
– Performance Evaluation
• pre-translation
MRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139)
+33.5%
+38.5%
• post-translation
MRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022)
+11.3%
+24.1%
• combined pre- and post-translation
MRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358)
+51.0%
+65.0%
• 32% below a monolingual baseline
Hsin-Hsi Chen
10-67
Hybrid Methods (Continued)

Davis 1997 (TREC5)
UN English
English Query
Bilingual Spanish
Parallel
Dictionary Equivalents IR Engine
Compare
Document
Vectors
(POS)
UN Spanish
TREC
Spanish
Corpus
Hsin-Hsi Chen
Mono
Reduced
IR Engine Equivalent Set
10-68
Hybrid Methods (Continued)
– corpus-based disambiguation vs. POS-based
disambiguation
– MONO (0.2895) vs. ALL (0.1422) vs.
49.12%
CORP (0.1153) vs. POS (0.1949) vs.
39.83%
67.32%
BOTH (0.2127)
73.47%
Hsin-Hsi Chen
10-69
Document Translation

Translate the documents, not the query
Documents
Queries
Document
Representation
Query
Representation
MT
(1) Efficiency Problem
(2) Retrieval Effectiveness???
(word order, stop words)
(3) Cross-language mate finding
using MT-LSI (Dumais, et al, 1997)
Hsin-Hsi Chen
Comparison
10-70
Vector Translation

Translate document vectors
Documents
Queries
Document
Representation
Query
Representation
MT
Comparison
Hsin-Hsi Chen
10-71
No Translation

Latent Semantic Indexing (Dumais, et al. 1997)
Hsin-Hsi Chen
10-72
No Translation (Continued)

Cross-Language Retrieval Using LSI
 resource: document-aligned parallel corpus
Hsin-Hsi Chen
10-73
No Translation (Continued)

Yellow Page Cross-Language Retrieval
Top 1
Top 10
CL-LSI
63.8%
86.9%
MT
57.5%
74.8%
Hsin-Hsi Chen
10-74
A Comparative Evaluation

Carbonell, Yang, Frederking, et al. (CMU,LTI)
–
–
–
–
–
Corpus-driven Term Translation (TMT)
Pseudo-Relevance Feedback (PRF)
Generalized Vector Space Model (GVSM)
Latent Semantic Indexing (LSI)
GVSM slightly outperforms LSI, which in turn
outperforms PRF and TMT.
Hsin-Hsi Chen
10-75
Research Directions

Comparable corpus techniques
– Automatic document linking

Dictionary-based approaches
– Word sense disambiguation

Evaluation
– Side-by-side tests
– Controllable domain shift
Hsin-Hsi Chen
10-76
CLIR system using query
translation
Hsin-Hsi Chen
10-77
Generating Mixed
Ranked Lists of Documents

Normalizing scales of relevance
– using aligned documents
– using ranks
– interleaving according to given ratios

Mapping documents into the same space
– LSI
– document translations
Hsin-Hsi Chen
10-78
Tools
Hsin-Hsi Chen
10-79
Types of Tools

Mark-Up Tools
 Language Identification
 Stemming/Normalization
 Entity Recognition
 Part-of-Speech taggers
 Indexing Tools
 Text Alignment
 Speech Recognition/ OCR
 Visualization
Hsin-Hsi Chen
• Character Set/Font Handling
• Word Segmentation
• Phrase/Compound Handling
• Terminology Extraction
• Parsers/Linguistic Processors
• Lexicon Acquisition
• MT Systems
• Summarization
10-80
Character Set/Font Handling

Input and Display Support
– Special input modules for e.g. Asian languages
– Out-of-the-box support much improved thanks
to modern web browsers

Character Set/File Format
– Unicode/UTF-8
– XML
Hsin-Hsi Chen
10-81
Language Identification

Different levels of multilingual data
– In different sub-collections
– Within sub-collections
– Within items

Different approaches
– Tri-gram
– Stop words
– Linguistic analysis
Hsin-Hsi Chen
10-82
Stemming/Normalization

Reduction of words to their root form
 Important for languages with rich
morphology
 Rule- based or dictionary- based
 Case normalization
 Handling of diacritics (French, …)
 Vowel (re-) substitution (e.g. semitic
languages, …)
Hsin-Hsi Chen
10-83
Entity Recognition/
Terminology Extraction

Proper Names, Locations, ...
– Critical, since often missing from dictionaries
– Special problems in languages such as Chinese

Domain- specific vocabulary, technical
terms
– Critical for effectiveness and accuracy
Hsin-Hsi Chen
10-84
Phrase/Compound Handling

Collocations (“Hong Kong“)
– Important for dictionary lookup
– Improves retrieval accuracy

Compounds (“Bankangestelltenlohn“ –bank
employee salary)
– Big problem in German
– Infinite number of compounds – dictionary is
no viable solution
Hsin-Hsi Chen
10-85
Lexicon Acquisition/
Text Alignment

Goal: automatic construction of data
structures such as dictionaries and thesauri
– Work on parallel and comparable corpora
– Terminology extraction
– Similarity thesauri

Prerequisite: training data, usually aligned
– Document, sentence, word level alignment
Hsin-Hsi Chen
10-86
CLIR Evaluation at TREC
Hsin-Hsi Chen
10-87
Too many factors in
CLIR system evaluation

translation
 automatic relevance feedback
 term expansion
 disambiguation
 result merging
 test collection
 need to tone it down to see what happened
Hsin-Hsi Chen
10-88
TREC-6 Cross-Language Track

In cooperation with the Swiss Federal Institute of
Technology (ETH)
 Task Summary: retrieval of English, French, and
German documents, both in a monolingual and a
cross-lingual mode
 Documents
– SDA (1988-1990): French (250MB), German (330 MB)
– Neue Zurcher Zeitung (1994): German (200MB)
– AP (1988-1990): English (759MB)

13 participating groups
Hsin-Hsi Chen
10-89
TREC-7 Cross-Language Track





Task Summary: retrieval of English, French,
German and Italian documents
Results to be returned as a single multilingual
ranked list
Addition of Italian SDA (1989-1990), 90 MB
Addition of a subtask of 31,000 structured German
social science documents (GIRT)
9 participating groups
Hsin-Hsi Chen
10-90
TREC-8 Cross-Language Track

Tasks, documents and topic creation similar
to TREC-7
 12 participating groups
Hsin-Hsi Chen
10-91
CLIR in TREC-9

Documents
– Hong Kong Commercial Daily, Hong Kong
Daily News, Takungpao: all from 1999 and
about 260 MB total

25 new topics built in English; translations
made to Chinese
Hsin-Hsi Chen
10-92
Cross-Language Evaluation Forum

A collaboration between the DELOS
Network of Excellence for Digital Libraries
and the US National Institute for Standards
and Technology (NIST)
 Extension of CLIR track at TREC (19971999)
Hsin-Hsi Chen
10-93
Main Goals

Promote research in cross-language system
development for European languages by
providing an appropriate infrastructure for:
– CLIR system evaluation, testing and tuning
– Comparison and discussion of results
Hsin-Hsi Chen
10-94
CLEF 2000 Task Description

Four evaluation tracks in CLEF 2000
– multilingual information retrieval
– bilingual information retrieval
– monolingual (non-English) information
retrieval
– domain-specific IR
Hsin-Hsi Chen
10-95
CLEF 2000 Document Collection

Multilingual Comparable Corpus
–
–
–
–

English: Los Angeles Times
French: Le Monde
German: Frankfurter Rundschau+Der Speigel
Italian: La Stampa
Similar for genre, content, time
Hsin-Hsi Chen
10-96
Case Study: CLIR for NPDM
Hsin-Hsi Chen
10-97
3M in Digital Libraries/Museums

Multi-media
– Selecting suitable media to represent contents

Multi-linguality
– Decreasing the language barriers

Multi-culture
– Integrating multiple cultures
Hsin-Hsi Chen
10-98
NPDM Project

Palace Museum, Taipei, one of the famous
museums in the world
 NSC supports a pioneer study of a digital
museum project NPDM starting from 2000
– Enamels from the Ming and Ch’ing Dynasties
– Famous Album Leaves of the Sung Dynasty
– Illustrations in Buddhist Scriptures with
Relative Drawings
Hsin-Hsi Chen
10-99
Design Issues

Standardization
– A standard metadata protocol is indispensable for the
interchange of resources with other museums.

Multimedia
– A suitable presentation scheme is required.

Internationalization
– to share the valuable resources of NPDM with users of
different languages
– to utilize knowledge presented in a foreign language
Hsin-Hsi Chen
10-100
Translingual Issue

CLIR
– to allow users to issue queries in one language
to access documents in another language
– the query language is English and the document
language is Chinese

Two common approaches
– Query translation
– Document translation
Hsin-Hsi Chen
10-101
Resources in NPDM pilot

an enamel, a calligraphy, a painting, or an
illustration
 MICI-DC
– Metadata Interchange for Chinese Information
– Accessible fields to users
• Short descriptions vs. full texts
• Bilingual versions vs. Chinese only
– Fields for maintenance only
Hsin-Hsi Chen
10-102
Search Modes

Free search
– users describe their information need using
natural languages (Chinese or English)

Specific topic search
– users fill in specific fields denoting authors,
titles, dates, and so on
Hsin-Hsi Chen
10-103
Example

Information need
– Retrieval “Travelers Among Mountains and Streams,
Fan K‘uan” (“范寬谿山行旅圖”)

Possible queries
– Author: Fan Kuan; Kuan, Fan
– Time: Sung Dynasty
– Title: Mountains and Streams; Travel among mountains;
Travel among streams; Mountain and stream painting
– Free search: landscape painting; travelers, huge
mountain, Nature; scenery; Shensi province
Hsin-Hsi Chen
10-104
English
Query
Document
Translation
Query
Translation
English
Names
Name
Search
Specific
Bilingual
Dictionary
Machine
Transliteration
Chinese
Names
English
Titles
Query
Disambiguation
Title
Search
Generic
Bilingual
Dictionary
Chinese
Titles
Chinese
Query
NPDM
Collection
Chinese IR
System
Hsin-Hsiin
Chen
ECIR
NPDM
10-105
Results
Specific Topic Search

proper names are important query terms
– Creators such as “林逋” (Lin P’u), “李建中”
(Li Chien-chung), “歐陽脩” (Ou-yang Hsiu), etc.
– Emperors such as “康熙” (K'ang-hsi), “乾隆”
(Ch'ien-lung), “徽宗” (Hui-tsung), etc.
– Dynasty such as ”宋” (Sung), “明” (Ming), “清”
(Ch’ing), etc.
Hsin-Hsi Chen
10-106
Name Transliteration

The alphabets of Chinese and English are totally
different
 Wade-Giles (WG) and Pinyin are two famous systems
to romanize Chinese in libraries
 backward transliteration
– Transliterate target language terms back to source language
ones
– Chen, Huang, and Tsai (COLING, 1998)
– Lin and Chen (ROCLING, 2000)
Hsin-Hsi Chen
10-107
Name Mapping Table

Divide a name into a sequence of Chinese
characters, and transform each character
into phonemes
 Look up phoneme-to-WG (Pinyin) mapping
table, and derive a canonical form for the
name
 Example
– “林逋”  “ㄌㄧㄣ ㄆㄨ”  “Lin P’u”
(WG)
Hsin-Hsi Chen
10-108
Name Similarity

Extract named entity from the query
 Select the most similar named entity from name
mapping table
 Naming sequence/scheme
– LastName FirstName1, e.g., Chu Hsi (朱熹)
– FirstName1 LastName, e.g., Hsi Chu (朱熹)
– LastName FirstName1-FirstName2, e.g., Hsu Tao-ning
(許道寧)
– FirstName1-FirstName2 LastName, e.g., Tao-ning Hsu
(許道寧)
– Any order, e.g., Tao Ning Hsu (許道寧)
– Any transliteration, e.g., Ju Shi (朱熹)
Hsin-Hsi Chen
10-109
Title




谿山行旅圖”  “Travelers
among Mountains and Streams”
"travelers", "mountains", and
"streams" are basic components
Users can express their information
need through the descriptions of a
desired art
System will measure the similarity of
art titles (descriptions) and a query
Hsin-Hsi Chen
10-110
Free Search

A query is composed of several concepts.
 Concepts are either transliterated or translated.
 The query translation similar to a small scale IR
system
 Resources
–
–
–
–
–
Name-mapping table
Title-mapping table
Specific English-Chinese Dictionary
Generic English-Chinese Dictionary
…
Hsin-Hsi Chen
10-111
Algorithm

(1) For each resource, the Chinese translations
whose scores are larger than a specific threshold
are selected.
 (2) The Chinese translations identified from
different resources are merged, and are sorted by
their scores.
 (3) Consider the Chinese translation with the
highest score in the sorting sequence.
– If the intersection of the corresponding English
description and query is not empty, then select the
translation and delete the common English terms
between query and English description from query.
– Otherwise, skip the Chinese translation.
Hsin-Hsi Chen
10-112
Algorithm (Continued)

(4) Repeat step (3) until query is empty or
all the Chinese translations in the sorting
sequence are considered.
 (5) If the query is not empty, then these
words are looked up from the general
dictionary. A Chinese query is composed of
all the translated results.
Hsin-Hsi Chen
10-113