ACML 2010 Tutorial Web People Search: Person Name

ACML 2010 Tutorial
Web People Search: Person Name
Disambiguation and Other Problems
Hiroshi Nakagawa
Introduction  Feature Extraction(Phrase Extraction)
Minoru Yoshida
Feature Extraction(Information Extraction Approach) End
(University of Tokyo)
Contents
1.
2.
3.
4.
5.
Introduction
Feature Extraction
Feature Weighting / Similarity Calculation
Clustering
Evaluation Issues
Contents
1.
2.
3.
4.
5.
Introduction
Feature Extraction
Feature Weighting / Similarity Calculation
Clustering
Evaluation Issues
Introduction
1.
2.
3.
4.
Motivation
Problem Settings
Differences from other problems
History
Motivation
A study of the query log of the
AllTheWeb and Altavista
• Web search for person names:
search sites gives an idea of the
over 10% of all queries
relevance of the people search
task:
11-17%
of thesearch
queries were
• “Same-name” Problem in
person
name
composed
of a person
name
– When different real-world
entities have
the same
name,
with additional
terms
the reference from the name
to the entity
can and
be 4% were
ambiguous.
identified simply as person names:
(Artiles+,
2009
– Many different persons
having
theWePS2)
same name
• (e.g.,) John Smith
ordinary search engines, it is tough to
– Persons having the With
same
name as a famous one
find Bill Gates who is not a Microsoft
• (e.g.,) Bill Gates
 Difficult to access to the target
founder!
Domination!
person
Problem in People Search
Query
Search engine
Results
Which pages for what persons?
Person Name Clustering
Query
Each page in a cluster refers to
the same entity.
Search engine
Search
result
Clusters of
Web pages
Sample System
query= Ichiro Suzuki:famous Japanese baseball player
Keywords about
the person
Documents about
the same person
数理情報学輪講(2008/04/18)
8
Output Example (Ichiro Suzuki)
Painter
Used as an example name
because Ichiro is so famous
Dentist
Lawyer
数理情報学輪講(2008/04/18)
9
Introduction
1.
2.
3.
4.
Motivation
Problem Settings
Differences from other problems
History
Problem Setting
• Given: a set of Web pages returned from a
search engine when entering person name
queries
• Goal: to cluster Web pages
– One cluster for one entity
– Possibly with related information (e.g., biography
and/or related words)
Another usage :
If a person has many aspects, like scientist and poet, these
aspects are grouped together. Easy to grasp who he/she is.
Example: Sakai Shuichi
Sakai shuichi is a professor of the University of Tokyo in the
field of Computer Architecture: These pages are about his
books of Computer Architecture
He is a Japanese poet too. These pages are about his
collection of poems
.
Example: Famous car maker”TOYOTA”
These pages are about TOYOTA’s retailer’s network
These pages are about TOYOTA HOME which is a house
maker and one of TOYOTA company’s group enterprise
Introduction
1.
2.
3.
4.
Motivation
Problem Settings
Differences from other problems
History
Difference from Other Tasks
Method
WSD,Catego Person Name
rization
Clustering
Goal
Categorize
Learning
Supervised
Document
Clustering
Cluster documents Cluster similar
about the same
documents
entity(=person)
Answers
Not definite
Definite y/n Definite y/n
but exact # Task dependent
Number of # of
#Unknown
of entities
in real world
Cluster
categories
(unknown)
Training
Yes
Difficult to use
No
Data
Unsupervised
Unsupervised
• Cluster documents for the same person
• Difficult to use training data for other person names
15
WSD: Word Sense Disambiguation
I was strolling the bank.
Do you use a bank card there?
Did you go to the bank?
?
bank
(1) Heavy and sophisticated NLP tools
such as HPSG parser is not suitable for
the purpose.
(2)The system should work in tolerant
Noisy Web data
speed
 light weight tools is needed
– Light linguistic tools
Challenges
•
• POS taggers, Stemmer, NE taggers
• Pattern-based information extraction
• How to use “training data”
– Most systems use unsupervised clustering
approach
– Some systems assume “background knowledge”
• How to determine K (number of clusters)
Remember this K does not depend on users intention but is exact and
fixed, in real use. Different form usual clustering!
Introduction
1.
2.
3.
4.
Motivation
Problem Settings
Differences from other problems
History
History
(Word Sense Disambiguation)
(Coreference Resolution)
1998
Cross-document coreference Resolution
[Bagga+, 98] – Naive VSM
2003
Disambiguation for Web Search Results
2007
Web People Search Workshop (WePS)
[Mann+, 03] – Biographic data
[Artiles+, 07][Artiles+, 09]
History
• Web People Search Workshop
– 1st, SemEval-2007
– 2nd, WWW-2009
• Document Clustering
• Attribute Extraction
– 3rd, CLEF-2010(Conference on Multilingual and
Multimodal Information Access Evaluation )
20-23 September 2010, Padua.
• Document Clustering & Attribute Extraction
• Organization Name Disambiguation
WePS2 Data Source: 30names
WePS2 Data 1
(Artiles+, 09)
WePS2 Data 2
WePS2 Data 3
WePS2 summary report
Contents
1.
2.
3.
4.
5.
Introduction
Feature Extraction
Feature Weighting / Similarity Calculation
Clustering
Evaluation Issues
Main Steps
1.
2.
3.
4.
5.
Preprocessing
Feature extraction
Feature weighting / Similarity calculation
Clustering
(Related Information Extraction)
PREPROCESSING
In addition, alphabetically
ordered name list page. (Ono+, 08)
Preprocessing
• Filter out useless pages (“junk pages”)
– the name is matched, but the matched string
doesn’t refer to a person (e.g., company name)
• Data cleaning
– HTML Tag removal
– Sentence (snippet) extraction
– Coreference resolution(used by Bagga+)
In fact, very difficult task
of NLP
Junk Page Filtering
• SVM-based classification (Wan+, 05)
– features
words related or not related
to the person name
• Simple lexical features
• Stylistic features (fonts / tags)
• query-relevant features
(next-to-query words)
• linguistic features (NE counts) …
Such as how many
person, organization,
location name appear.
i.e. how many and
which words in bold
font
FEATURE EXTRACTION
Feature Extraction
• How to characterize each name appearance
– Name itself can not be used for disambiguation!
• Each name appearances can be characterized
by contexts.
• Possible contexts
– Surrounding words, adjacent strings, syntactically
related words, etc.
– Which to use?
Basic Approach
• Use all words in documents
– Or snippets (texts around the name)
– Or titles/summaries (first sentence, etc.)
• Use TFIDF weighting scheme
Problem
• There exist:
– relatively useful features and relatively useless
features
• (especially for person name disambiguation)
– Useful: NEs, biography, noun phrases, etc.
– Useless: General words, boilerplate, etc.
• How to distinguish useful features from others
• How to give weight to each feature
Named Entities
• Documents about Bill Gates
related person name
related organization name
Noun Phrases
• Documents about Bill Gates
related key words
Other Words
• Documents about Bill Gates
Other Words
• Documents about Bill Gates
more important
Extracting Useful Features
• Thresholding
• Tool-based approach
Based on score related to
our purpose: TFIDF etc.
– POS tagging, NE tagging
• Information Extraction approach
• Meta-data approach
– Link structures, Meta tags
Later described
by Yoshida
Thresholding
• Calculate TFIDF scores of words
• Discard the words with low TFIDF scores
Unigram, Bigram, even N-gram can be used
(Chen+, 09) , where Google 5 gram corpus
(from 1T words) is used to calculate TFIDF
score
such as Log-Likelihood Ratio,
Other Scores:
Mutual information, KLdivergence,
Tool Based Approach
• Available Tools:
– POS tagging
– NE Extraction
(sophisticated
High performance POS taggers are
developed for many languages.
For western languages , stemmers
are also developed .
unsophisticated but simple)
bigram, N-gram
– Keyword extraction
middle between NE and bigram,N-gram
Part of Speech (POS) Tagging
• Detect the grammatical categories of the words
– Nouns, verbs, prepositions, adverbs, adjectives, …
– Typically nouns are used as features
William Henry "Bill" Gates III (born October 28, 1955) is an
NOUNS
VERB
NOUNS
VERB DETERMINER
American business magnate, philanthropist, …
ADJECTIVE
NOUNS
– Noun phrases can be extracted with some simple rules
– Many available tools (e.g., Tree Tagger)
Named Entities (NE) Extraction
• Find “proper names” in texts
– e.g., names of persons, organizations, locations, …
– Include time expressions in many cases
William Henry "Bill" Gates III (born October 28, 1955) is an
PERSON
DATE
American business magnate, philanthropist, …
– Many available tools (Stanford NER, OpenNLP,
Espotter, …)
Key Phrase Extraction
• Noun phrases consisting of 2 or more words
– Likely to be topic-related concepts
– Term-extraction tool “Gensen”(Nakagawa+, 05)
• Noun phrases with the score of “term-likelihood”
• Topic related term -> higher score
Gates held the positions of CEO and chief software architect,
SCORE=45.2
and remains the largest individual shareholder …
SCORE=22.4
Gensen(言選) Web Score
From corpus we extract:
信息処理,
計算機処理能力,
処理段階, 信息処理学会
Information proc, computer proc. capacity, proc. step, info. proc.society
R:# of right adjacent words: :3
+1
L:# of left adjacent words:2
+1
能力
信息
Information
計算機
computer
capacity
処理
段階
Processing
(=proc)
step
学会
society
L(W=処理)=2+1
R(W=処理)=3+1
LR(W=処理)=3×4=12
Calculation of LR and FLR
 Compound word:W ={ w1, ... , wn} where wi is a simple
noun.
 L(wi) = # of left side connection of wi+1
 R(wi) = # of right side connection of wi+1
 Score LR of Comp. word:W={ w1 ... wn},
like 信息処理学会 is defined as follows:


LR(W )   L( wi )  R ( wi )
 i 1

n
1/ 2 n
Normalized by length
 Example :LR(信息処理)
=[L(信息)×R(信息) × L(処理)×R(処理) ]1/4
Or LR(information processing)
=[L(info.)×R(info.) × L(proc.)×R(proc.) ]1/4
Calculation of LR and FLR


LR(W )   L( wi )  R ( wi )
 i 1

n
1/ 2 n
Normalized by length
Thisfrequency
FLR is the score
to rank word:W
term
 F(W) is the independent
of comp.
where “independent” means thatcandidates
W is not a part of
longer comp. word.
 Then FLR(W) is defined as
 FLR(W) = F(W) × LR(W)
 Example
F(W) has similar effect as TF. Then, if corpus is
big, F(w) affects more to FLR(w).
FLR(信息処理)
=F (信息処理)×[L(信息)×R(信息) × L(処理)×R(処理) ]1/4
Example of term extraction by Gensen Web: English
article:SVM on Wikipedia
Support vector machines (SVMs) are a set of related supervised learning methods that
analyze data and recognize patterns, used for classification and regression analysis. The
original SVM algorithm was invented by Vladimir Vapnik and the current standard
incarnation (soft margin) was proposed by Corinna Cortes and Vladimir Vapnik[1]. The
standard SVM is a non-probabilistic binary linear classifier, i.e. it predicts, for each given
input, which of two possible classes the input is a member of. Since an SVM is a classifier,
then given a set of training examples, each marked as belonging to one of two categories,
an SVM training algorithm builds a model that predicts whether a new example falls into
one category or the other. Intuitively, an SVM model is a representation of the examples as
points in space, mapped so that the examples of the separate categories are divided by a
clear gap that is as wide as possible. New examples are then mapped into that same space
and predicted to belong to a category based on which side of the gap they fall on.
…….
Another approach is to use an interior point method that uses Newton-like iterations to
find a solution of the Karush-Kuhn-Tucker conditions of the primal and dual problems.[10]
Instead of solving a sequence of broken down problems, this approach directly solves the
problem as a whole. To avoid solving a linear system involving the large kernel matrix, a
low rank approximation to the matrix is often used to use the kernel trick.
Extracted terms (term score)
Top 18-38
set of point 20.73
linear classifier 19.99
maximum-margin
hyperplane 19.92
example 19.60
one 17.32
Vladimir Vapnik 15.87
parameter 14.70
linear SVM 14.40
training set 14.00
optimization 13.42
model 12.25
training vector 12.04
support vector
classification 11.70
two classe 11.57
normal vector 11.38
kernel trick 11.22
maximum margin
classifier 11.22
.....
Top 1-17
hyperplane 116.65
margin 109.54
SVM 74.08
vector 56.12
point 52.85
support vector 49.34
training data 48.12
data 47.83
problem 44.27
space 44.09
data point 38.01
classifier 30.59
classification 29.58
optimization problem
26.05
set 25.30
support vector
machine 24.66
kernel 21.00
Top 408–426(last)
Vandewalle 1.00
derive 1.00
it 1.00
Leisch 1.00
2.3 1.00
H1 1.00
c 1.00
Hornik 1.00
mean 1.00
testing 1.00
transformation 1.00
unconstrained 1.00
homogeneous 1.00
need 1.00
learner 1.00
grid-search 1.00
convex 1.00
See 1.00
trade 1.00
Contents
1.
2.
3.
4.
5.
Introduction
Feature Extraction
Feature Weighting / Similarity Calculation
Clustering
Evaluation Issues
Contents
1.
2.
3.
4.
5.
Introduction
Feature Extraction
Feature Weighting / Similarity Calculation
Clustering
Evaluation Issues
Introduction
1.
2.
3.
4.
Motivation
Problem Settings
Differences from other problems
History
Information Extraction Approach
• Information extraction:
– The task to extract specific type of information
– e.g., person and his/her working place
William Henry "Bill" Gates III (born October 28, 1955) is an
NAME
DATE OF BIRTH
American business magnate, philanthropist, …
NATIONALITY
OCCUPATION
Information Extraction Approach
• Useful features for disambiguation (Wan+,
2005) (Mann+, 2003) (Niu+, 04)
• Also used as “summaries” of clusters
– To be help of users to find objective clusters
– WePS-2 “attribute extraction task”
Information Extraction Approach
• Different methods for different attributes
– Simple patterns (hand-crafted / automatically
obtained)
• Phone, FAX, URL, E-mail
– Syntactic rules (hand-crafted /automatically
generated)
• Date of birth, Titles, positions,
– Dictionary match (from wikipedia, etc.)
• Occupation, Major, Degree, Nationality
– Keywords extracted by NER tools
• Birth place (LOCATION), Affiliation (ORGANIZATION),
Schools (ORGANIZATION)
Hand-Crafted Patterns
• Typically written with regular expressions
• Phone, FAX
– +## (#) ####-####
• URLs
– http://www.xxx.xxx.xxx/...
• E-mails
– [email protected]
• Needs some classification (Phone or FAX?)
– Supervised learning
– Keyword-based approach (e.g., “born” for date of
birth)
Automatically Generated Patterns
• Patterns for birth years (Mann+, 03)
<name> (<birth year> - ####)
<name> <name> ( <birth year>
<name> was born in <birth year>
• Patterns for titles (Wan+, 05)
<name> is a <title>
Automatically Generated Patterns
• Approach by (Mann+, 03)
– Bootstrapping method
• Start with seed facts
<name> (<birth year> - ####)
<name> <name> ( <birth year>
<name> was born in <birth year>
– (e.g., (Mozart, 1756))
• Find sentences (from the Web) that contain both of
elements
– (e.g., “Mozart was born in 1756”)
• Perform some generalization
– (e.g., “<name> was born in <birth year>”)
• Extract substrings with high score (measured using
current facts)
• Extract new facts
Dictionary Matching
• Construct a list of occupations, nations (for
“nationality” attributes), etc. from existing
dictionaries
– Wikipedia, WordNet, etc.
Dictionary Matching
• e.g., List of countries
Link Structure Approach
• It is difficult to find correct network structures
– Difficulty in finding “in-links”
• Needs some approximation
• (Bekkerman+, 05) : “socially linked persons
tend to link similar pages”
– Determine whether two pages are linked or not
– MaxEnt classification with “linked-page” (URLs in
pages) features
FEATURE WEIGHTING / SIMILARITY
CALCULATION
Feature Weighting
• Knowledge-based approach
– US Census data, WordNet
•
•
•
•
Web-query approach
SVD
Bootstrapping
Determination of link/non-link by supervised
classifiers
Knowledge-Base Approach
• US Census data
– Frequent name -> ambiguous (Fleishman+, 04)
• WordNet
– Semantic similarity for concept words
• WordNet distance
WordNet
• Publicly available “dictionary” (thesaurus)
– Hierarchical structures between words
– We can find “synonyms”, “hyponyms”,
“hypernyms” of words
• Many “semantic distance” measures between
two words
– Path length
– Depth of common hypernyms
–…
Web-Query Approach
• Name-concept relation (Fleishman+, 04)
• Validate relations between context NEs by
Web search counts (Kalashnikov+, 08) (NurayTuran+, 09)
• Use query “name + bigram”, concatenating
the snippetes into a new document (Chen+,
09)
• Obtaining reliable counts (google_df)
(Bekkerman+, 05)
Name-concept relation (Fleishman+, 04)
• Task: distinguish (name, concept) pairs
– (Paul Simon, pop star) ; (Paul Simon, singer)
– (Paul Simpn, pop star) ; (Paul Simon, politician)
• MaxEnt Classifier
• Features using Web counts (N:name,
c:concept, +:AND operation)
– Q(N + c1 + c2) : Intersection
– | Q(N + c1) - Q(N + c2) |: Difference
– Q(N + c1 + c2) / (Q(N + c1) + Q(N + c2)) : Ratio
Validate relations between context NEs
by Web search counts
(Kalashnikov+, 08) (Nuray-Turan+, 09)
• NE-based document similarity calculated using
Web counts
– NE: persons or organizations
• WebDice (C:context set … [c1] OR [c2] OR …)
– 2Q(N + C1 + C2) / (Q(N + C1) + Q(N + C2))
– 2Q(N + C1 + C2) / (Q(N) + Q(C1 + C2))
– The second one was better
Use query “name + bigram”,
concatenating the snippetes into a
new document (Chen+, 09)
• Obtain additional features for similarity
calculation
– Web page -> b: maximal weight bigram
– Snippets100(N + b) -> one new document
– New document -> additional features (tokens)
Obtaining reliable counts (google_df)
(Bekkerman+, 05)
• Google_tfidf(w) =
tf(w) / log(Q(w))
• Some recent systems use Google N-gram
(Chen+, 09)
Dimension Reduction by SVD
(Pedersen+, 05)
• Reduce sparseness of context vectors
• More semantic-level representations (can use
word similarities in contexts)
• Bigram features (contexts)
Cluster Refinement by Bootstrapping (1/4)
(Yoshida+, 10)
• Strong features can identify a person
– High precision, but not always observed
Strong
Features
•NEs
•CKWs...
Bill Gates
Paul Allen
Microsoft
Weak
Features program
same
person
Bill Gates
Steve
Ballmer
Microsoft
Bill Gates
program
program
Not useful in general, but useful for this name
72
Cluster Refinement by Bootstrapping (2/4)
Feature Set F
Document Set D
Feature-Cluster
Document-Cluster
Relation rF ,C
Relation rD,C
Document-Feature Matrix P
d1
f1
d2
f2
d3
d4
f3
d5
d6
・
・
・
・
dn
・
・
Initial Cluster
fm
Cluster Refinement by Bootstrapping (3/4)
Feature Set F
Document Set D
Feature-Cluster
Document-Cluster
Relation rF ,C
Relation rD,C
Document-Feature Matrix P
d1
f1
(t )
T (t )
d2
f2
d3
F ,C
D ,C
d4
f3
d5
( t 1)
(t )
d6
D ,C
F ,C
・
・
・
・
dn
r
P r
r
 Pr
(t 1)
T (t )
D,C
D,C
Initial Cluster
r
 PP r
・
・
fm
Cluster Refinement by Bootstrapping (4/4)
Each document is
taken in the cluster
with the largest
relation value
 0.8 0.2 0.3   0.40

 
 1.0 0.1 0.2   0.40
 0.15 0.85 0.2    0.10

 
 0.15 0.85 0.2   0.10
 0.5 0.4 0.3   0.30

 
Refined values
0.40 0.10 0.10
0.60 0.05 0.05
0.05 0.45 0.40
0.05 0.40 0.45
0.20 0.20 0.20
PP
T
Initial values
0.30 1 0 0 


0.20 1 0 0 
0.20 0 1 0 


0.20 0 1 0 
0.30 0 0 1 
75
Determination of “linked” or
”not-linked“ by supervised classifiers
• MaxEnt Classification (Fleischman+, 04)
– Features: name features, web features, etc.
• SkyLine-Based Classification (Kalashnikov+, 08)
– Features: search engine hit counts
CLUSTERING
Problem: How to Determine K
• Hierarchical clustering with thresholds
• Online Clustering (Single Pass Clustering)
• Building “core” clusters (2-stage clustering)
• Variable-Component-Number Clustering (e.g.,
Dirichlet Process Mixture)
Hierarchical clustering with thresholds
• Used in many systems
• Popular settings:
– Agglomerative clustering
– Group-average method (or, single-link method in
some times)
– Predetermined threshold (or, determined by
cross-validation in some times)
Hierarchical clustering with thresholds
Low
Cluster Similarity
High
→2 clusters
{1,2,3,5,9},
{4,6,7,8}
5 2 3 9 18 7 6 4
Document ID
→4 clusters,
{2,5},{1,3,9},
{6,7,8},{4}
Cluster similarity: simC Cx , C y   1
Cx C y
group average method
 sim d , d 
d x C x , d y C y
d
x
y
80
Cluster-Distance Calculation
(single linkage method)
(complete linkage method)
×
(centroid method)
×
Online Clustering
• Single Pass Clustering (Balog+, 08)
– Take pages from the 1st in search results
6
1
5
2
4
3
Online Clustering
• Single Pass Clustering (Balog+, 08)
– Take pages from the 1st in search results
– For each page, find the most similar cluster
6
1
5
2
4
3
Online Clustering
• Single Pass Clustering (Balog+, 08)
– Take pages from the 1st in search results
– For each page, find the most similar cluster
– If the similarity is below the threshold, create a
new cluster
• Similarity: Naïve Bayes | Cosine with TFIDF
6
1
5
2
4
3
Building Core Clusters
• 1st stage clustering – High Precision Clusters
– Relatively high threshold (Mann+, 03)
– Use strong features only (Ikeda+, 09)
• 2nd stage clustering – Treat Remaining
Documents
– Add to the most similar 1st stage clusters (Mann+,
03) (Ikeda+, 09)
– Feature weighting by 1st stage clusters (Yoshida+,
10)
Query Expansion Approach
(Ikeda+, 09)
• Re-extract key-phrases by using 1st-stage
clusters
– Key-phrases for documents -> key-phrases for
clusters
– More reliable than one document
Current cluster
Top CKWs
home runs, major
leagues, all stars,
1
2
Search
Other documents
1
1 Extract top CKWs
from the current
cluster
2 Search for the
CKWs in documents
out of the cluster
3 If such documents
exist, then copy them
into the cluster (soft
clustering)
4 Remove 1-element
clusters
87
87
Feature weighting by 1st stage clusters
(Yoshida+, 10)
1. Make clusters by strong features
2. Weight weak features using clusters, and
refine similarities
3. Refine clusters by using new similarities
88
Using Dirichlet Process Mixture
(Ono+, 08)
• Topic = word distribution
– Topic:”economics” = Word distribution:
{“dollar”:0.03, “stock”:0.05, “share”:0.01, ...}
• Document = mixture of topics
– {economics:0.3, politics:0.2, ...}
• Document’s topic = topic with highest weight
• Modeling by DPUM (Dirichlet Process Unigram
Mixture)
– # of topics is automatically determined
89
Example:
Estimation of Latent Topics
word-1
word-2
Latent entity = each (red) bar
Document = each point
word-3
Dirichlet Process Unigram Mixture
G0=Distrubution for θ
(Dirichlet Distribution)
θ=Multi. Distribution
UM
θ
M
θd
G
wdn
Nd
θ
(Countable number of Multi. distributions)
DPUM Parameter Estimation
Initial entity distribution
Estimation of entity distribution
by iteratively maximizing likelihood
Politics
Economics
Merge clusters with
the same topic
Emonomics
Politics
Politics
Sports
Society
Arts
Entertainment
EVALUATION ISSUES
Evaluation Issues
• Evaluation Measures
• Available Corpus
• WePS Workshop
Evaluation Measures
• Precision / Recall / F-measure
• Purity / Inverse Purity
• B-cubed Precision / Recall / F-measure
– Extended B-cubed
Recall and Precision for Clustering
• Features and recall/precision
• First stage cluster = high precision
A=5
B=8
C=3
Recall
A:size of cluster
B:# of correct documents
C:# of correct documents in
C
Pprecision   0.6
cluster
A
Precision
Rrecall 
C
 0.375
97
B
Recall and Precision [Larsen and Aone 1999]
• Correct clusters
• Machine-made
clusters
are calculated for each
for each
that maximize
as:
F-measure
Total F-measure (F):
Note:
Precision (P),Recall (R) are calculated in the same way.
99
Example
Correct clusters: C
P = 2 /3
R = 2 /5
F = 1 /2
Machine-made
clusters: D
[A][A][A]
[A][A]
[A][A]
[B]
P = 3 /5
R = 3 /5
F = 3 /5
[B]
[B]
[A][A][A]
[B][C]
…
Purity / Inverse Purity
• Similar to precision / recall
– L: manually annotated categories (clusters)
– C: clusters output by systems
B-Cubed Precision/Recall
• Entity-wise accuracy calculation
– C: cluster (by system) containing e
– L: cluster (by human) containing e
102
B-Cubed Precision/Recall
• Borrowed from (Amigo, 09)
103
Other Metrics
• Counting pairs
– Given pair of documents, label “link” or “unlink”
– Problem: # of pairs is quadratic to size of clusters
• Entropy
– Low entropy in cluster -> pure
• Edit distance
– Distance from system output to correct output
Which Metrics to Use
• Constraints (borrowed from (Amigo, 09))
• Homogeneity: the purer, the better
• Completeness: the more complete, the better
Which Metrics to Use
• Constraints (borrowed from (Amigo, 09))
• Rag bag
– Noisy cluster <- noise: better!
– Pure cluster <- noise: worse!
Which Metrics to Use
• Constraints (borrowed from (Amigo, 09))
• Cluster size vs. quantity
– A small error in big cluster : better!
– (Large number of) small errors in small clusters :
worse!
Which Metrics to Use
• Borrowed from (Amigo, 09)
Which Metrics to Use
• Borrowed from (Amigo, 09)
Which Metrics to Use
• Borrowed from (Amigo, 09)
Baselines
111
P-IP vs. B-Cubed: for Practical Data
• Purity/Inverse-Purity measure is not
appropriate in soft-clustering case
– It gives very high scores to “cheat” baseline
clustering (COMBINED in the table)
• B-cubed measure is appropriate in this case
Available Corpus
• John Smith Corpus (Bagga+, 98)
• 12 different people (Bekkerman+, 05)
• WePS corpus (Artiles+, 07)(Artiles+, 09)
– WePS-1
• 79 person names (49 training + 30 test), 100 top pages
for each
– WePS-2
• 30 person names, 150 top pages for each
WePS (Web People Search) Workshops
(Artiles+, 07)(Artiles+, 09)
• Evaluation campaigns for person name
disambiguation (along with person attribute
extraction)
• WePS-1
– with SemEval-2007
– 16 teams participated
• WePS-2
– with WWW-2009
– 17 teams participated
114
References
• (Amigo+, 09) Enrique Amigó , Julio Gonzalo , Javier Artiles , Felisa Verdejo, A
comparison of extrinsic clustering evaluation metrics based on formal
constraints, Information Retrieval, v.12 n.4, p.461-486, August 2009
• (Artiles+, 07) Javier Artiles , Julio Gonzalo , Satoshi Sekine, The SemEval-2007
WePS evaluation: establishing a benchmark for the web people search task,
Proceedings of the 4th International Workshop on Semantic Evaluations, p.6469, June 23-24, 2007, Prague, Czech Republic
• (Artiles+, 09) J. Artiles, J. Gonzalo, and S. Sekine. WePS 2 Evaluation Campaign:
overview of the Web People Search Clustering Task. 2nd Web People Search
Evaluation Workshop (WePS 2009), 2009.
• (Bagga+, 98) Amit Bagga , Breck Baldwin, Entity-based cross-document
coreferencing using the Vector Space Model, Proceedings of the 17th
international conference on Computational linguistics, August 10-14, 1998,
Montreal, Quebec, Canada
• (Balog+, 08) K. Balog, L. Azzopardi, and M. de Rijke. Personal name resolution
of web people search. In WWW2008 Workshop: NLP Challenges in the
Information Explosion Era (NLPIX 2008), 2008.
• (Balog+, 09) Krisztian Balog, Jiyin He, Katja Hofmann, Valentin Jijkoun, Christof
Monz, Manos Tsagkias, Wouter Weerkamp and Maarten de Rijke, The
University of Amsterdam at WePS2. 2nd Web People Search Evaluation
Workshop (WePS 2009), 2009.
References
• (Bekkerman+, 05) Ron Bekkerman , Andrew McCallum, Disambiguating Web
appearances of people in a social network, Proceedings of the 14th
international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
• (Bollegala+, 06) Danushka Bollegala , Yutaka Matsuo , Mitsuru Ishizuka,
Extracting key phrases to disambiguate personal name queries in web search,
Proceedings of the Workshop on How Can Computational Linguistics Improve
Information Retrieval?, July 23-23, 2006, Sydney, Australia
• (Bunescu+, 06) R. Bunescu and M. Pasca. Using encyclopedic knowledge for
named entity disambiguation. In Proceedings of the 11th Conference of the
European Chapter of the Association for Computational Linguistics (EACL-06),
2006.
• (Chen+, 09) Names.Ying Chen, Sophia Yat Mei Lee and Chu-Ren Huang,
PolyUHK: A Robust Information Extraction System for Web Personal, 2nd Web
People Search Evaluation Workshop (WePS 2009), 2009.
• (Chen+, 07) Ying Chen, James Martin, Towards Robust Unsupervised Personal
Name Disambiguation, EMNLP-CoNLL 2007, pp. 190-198, 2007
• (Elmacioglu+, 07) Ergin Elmacioglu , Yee Fan Tan , Su Yan , Min-Yen Kan ,
Dongwon Lee, PSNUS: web people name disambiguation by simple clustering
with rich features, Proceedings of the 4th International Workshop on
Semantic Evaluations, p.268-271, June 23-24, 2007, Prague, Czech Republic
References
•
•
•
•
•
•
(Fleishman+, 2004) Fleischman, M.B. and E.H. Hovy, Multi-Document Person
Name Resolution. Proceedings of the Reference Resolution Workshop at the 42nd
Annual Meeting of the Association for Computational Linguistics (ACL). Barcelona,
Spain, 2004
(Gooi+, 04) Chung H. Gooi, James Allan, Cross-Document Coreference on a Large
Scale Corpus, HLT-NAACL 2004: Main Proceedings, pp. 9-16, 2004
(Han+, 04) Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, Kostas Tsioutsiouliklis, Two
supervised learning approaches for name disambiguation in author citations,
JCDL 2004, pp. 296-305, 2004
(Ikeda+, 09) M. Ikeda, S. Ono, I. Sato, M. Yoshida, and H. Nakagawa. Person Name
Disambiguation on the Web by Two-Stage Clustering. 2nd Web People Search
Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.
(Kalashnikov+, 08) Dmitri V. Kalashnikov , Rabia Nuray-Turan , Sharad Mehrotra,
Towards breaking the quality curse.: a web-querying approach to web people
search., Proceedings of the 31st annual international ACM SIGIR conference on
Research and development in information retrieval, July 20-24, 2008, Singapore,
Singapore
(Li+, 04) X. Li, P. Morie and D. Roth, Robust Reading: Identification and Tracing of
Ambiguous Names. Proc. of the Annual Meeting of the North American Association
of Computational Linguistics (NAACL) , pp. 17-24, 2004
References
•
•
•
•
•
•
(Malin, 05) Bradley Malin, Unsupervised name disambiguation via social network
similarity, In Workshop on Link Analysis, Counterterrorism, and Security, with SDM
2005
(Murakami, 10) Hiroshi Ueda, Harumi Murakami, and Shoji Tatsumi, Suggesting
Subject Headings using Web Information Sources, ... Conference on Agents and
Artificial Intelligence (ICAART 2010) Volume 1 Artificial Intelligence, pp.640-643,
2010.
(Mann+, 03) Gideon S. Mann , David Yarowsky, Unsupervised personal name
disambiguation, Proceedings of the seventh conference on Natural language
learning at HLT-NAACL 2003, p.33-40, May 31, 2003, Edmonton, Canada
(Nakagawa+, 03) H. Nakagawa and T. Mori. Automatic term recognition based on
statistics of compound nouns and their components. Terminology, 9(2):201--219,
2003.
(Niu+, 04) Cheng Niu , Wei Li , Rohini K. Srihari, Weakly supervised learning for
cross-document person name disambiguation supported by information
extraction, Proceedings of the 42nd Annual Meeting on Association for
Computational Linguistics, p.597-es, July 21-26, 2004, Barcelona, Spain
(Nuray-Turan+, 09) R. Nuray-Turan, Z. Chen, D. Kalashnikov, and S. Mehrotra.
Exploiting web querying for web people search in weps2. 2nd Web People Search
Evaluation Workshop (WePS 2009), 2009.
References
•
•
•
•
•
•
(On+, 07) B.-W. On and D. Lee. Scalable name disambiguation using multi-level
graph partition. In Proc. of the SIAM SDM Conf., Minneapolis, Minnesota, USA,
2007
(Ono+, 08) Shingo Ono , Issei Sato , Minoru Yoshida , Hiroshi Nakagawa, Person
name disambiguation in web pages using social network, compound words and
latent topics, Proceedings of the 12th Pacific-Asia conference on Advances in
knowledge discovery and data mining, May 20-23, 2008, Osaka, Japan
(Pedersen+, 05) Ted Pedersen, Amruta Purandare, Anagha Kulkarni , Name
Discrimination by Clustering Similar Contexts, CICLing 2005, pp. 226-237, 2005
(Resnick+, 94) Paul Resnick , Neophytos Iacovou , Mitesh Suchak , Peter Bergstrom
, John Riedl, GroupLens: an open architecture for collaborative filtering of
netnews, Proceedings of the 1994 ACM conference on Computer supported
cooperative work, p.175-186, October 22-26, 1994, Chapel Hill, North Carolina,
United States
(Yoshida+, 10) Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, Hiroshi
Nakagawa, Person name disambiguation by bootstrapping, In SIGIR '10:
Proceeding of the 33rd international ACM SIGIR conference on Research and
development in information retrieval , pp. 10-17, 2010
(Wan+, 05) Xiaojun Wan , Jianfeng Gao , Mu Li , Binggong Ding, Person resolution
in person search results: WebHawk, Proceedings of the 14th ACM international
conference on Information and knowledge management, October 31-November
05, 2005, Bremen, Germany