A Graph-based Approach to Named Entity

A Graph-based Approach to Named Entity Categorization
in Wikipedia Using Conditional Random Fields
Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto
Nara Institute of Science and Technology
EMNLP-CoNLL 2007
29th June
Prague, Czech
Background


Named Entity

Proper nouns (e.g. Shinzo Abe (Person), Prague (Location)),
time/date expressions (e.g. June 29 (Date)) and numerical
expressions (e.g. 10%)

In many NLP applications (e.g. IE, QA), Named Entities play an
important role
Named Entity Recognition task (NER)

Treated as sequential tagging problem

Machine learning methods have been proposed

Recall is usually low

Large scale NE dictionary is useful for NER
Semi-automatic methods to compile NE dictionaries have
been demanded
2
Resource for NE dictionary construction

Wikipedia

Multi-lingual encyclopedia on the Web

382,613 gloss articles (as of June 20, 2007, Japanese)

Gloss indices are composed by nouns or proper nouns

HTML (Semi-structured text)

Lists(<LI>) and Tables(<TABLE>) can be used as clues for
NE type categorization

Linked articles are glossed by anchor texts in articles

Each article has one or more categories
Wikipedia has useful information for NE categorization
Can be considered as a suitable resource
3
Objective

Extract Named Entities by assigning proper NE labels for
gloss indices of Wikipedia
Person
Person
Natural Object
Location
Organization
Product
4
Use of Wikipedia features

Features of Wikipedia articles

Anchors of an article refer to the other related articles

Anchors in list elements have dependencies each other

=> Make 3 assumptions about dependencies between anchors
PERSON
VOCATION
 Burt
Bacharach…composer
ORGANIZATION
 Dillard & Clark
 Carpenters
ORGANIZATION
PERSON
 Karen Carpenter
an example of a list structure
Assumption 1 : The latter element
in a list item tends to be in an
attribute relation to the former
element
Assumption 2 : The elements in the
same itemization tends to be in the
same NE category
Assumption 3 : The nested element
tends to be in a part-of relation to the
upper element
5
Overview of our approach

Focus on HTML list structure in Wikipedia



Make 3 assumptions about dependencies between anchors
Formalize NE categorization problem as labeling NE
classes to anchors in lists

Define 3 kinds of cliques (edges: Sibling, Cousin and Relative )
between anchors based on 3 assumptions

Construct graphs based on 3 defined cliques
CRFs for NE categorization in Wikipedia

Define potential functions over 3 edges (and nodes) to provide
conditional distribution over the graphs

Estimate MAP label assignment over the graphs using Conditional
Random Fields
6
Conditional Random Fields (CRFs)

Conditional Random Fields [Lafferty 2001]

Discriminative, Undirected Models

Define conditional distribution p(y|x)

Features

Arbitrary features can be used

Globally optimize on all possible label assignments

Can deal with label dependencies by defining potential functions for
cliques (2 or more nodes)
y1
y2
y3
・・・
yn
x
1


p(y | x) 
exp k f k xc , y c 

Z (x) cC
k

Z (x) : partitionfunction, C : cliques, f k : featurefunction,
k : modelparameter
7
Use of dependencies for categorization

NE categorization problem as labeling classes to
anchors
: article exists
 Dillard & Clark..country rock
 Carpenters
: article does not
exist
 Karen Carpenter

The edges of the constructed graphs corresponds to a particular
dependency

Estimate MAP label assignment over the constructed graphs using
Conditional Random Fields

Our formulation: Can extract anchors without gloss articles
8
Clique definition based on HTML tree structure
<UL>
 Dillard & Clark…country rock
 Carpenters
<LI>
<LI>
 Karen Carpenter
Sibling
The latter element tends to be in an attribute
or a concept of the former element
Cousin
The elements tend to have a common NE
category (e.g. ORGANIZATION)
Relative
The latter element tends to be in a
constituent part of the former element
<A>
<A>
Dillard &
Clark
country
rock
<A>
<UL>
Carpenters
<LI>
Sibling
Cousin
Relative
<A>
Karen
Carpenter
Use these 3 relations as
cliques of CRFs
9
A graph constructed from 3 clique definitions
Sibling
The latter element tends to be
an attribute or a concept of the
former element
Cousin
The elements tend to have a
common attribute
(e.g. ORGANIZATION)
S
S
 Burt Bacharach…”On my own”…1986
C
 Dillard & Clark
R
 Gene
C Clark
C
C
C
S
S
 Carpenters …”As
Time
Goes By”…2000
R
 Karen Carpenter
Relative
The latter element tends to be
a constituent part of the former
element
Estimate the MAP label
assignment over the graph
S : Sibling
C : Cousin
R : Relative
10
Model


1 
p ( y | x) 
 SCR  yi , y j   V  yi , x 

 v V
Z (x)  ( vi ,v j )ES , EC , ER

 i
 SCR


 SCR ( yi , y j )  exp  k f k  yi , y j 
 k



V  yi , x   exp  k ' f k '  yi , x 
 k'

• Constructed graphs include cycles :
exact inference is computationally
expensive
->Introduce Tree-based
Reparameterization (TRP) [Wainwright
2003] for approximate inference
: Potential function for
Sibling, Cousin and
Relative cliques
V : Potential function for
nodes
S
S
C
R
C
C
C
S
S
R
11
Experiments

The aims of experiments are:
1.
Compare graph-based approach (relational) to node-wise
approach (independent) to investigate how the relational
classification improves classification accuracy
2.
Investigate the effect of defined cliques
3.
Compare CRFs models to baseline models based on SVMs
4.
Show the effectiveness of using marginal probability for filtering
NE candidates.
12
Dataset

Dataset
NE Class
 Randomly
sampled 2300 articles (Japanese
version as of October 2005)
 Anchors
in list elements(<LI>) are handannotated with NE class label
 We
used Extended Named Entity
Hierarchy (Sekine et al. 2002)
 We
reduced the number of classes to 13
from the original 200+ in order to avoid
data sparseness
 Classification
are NEs)
target :16136 (14285 of those
EVENT
PERSON
UNIT
# of
articles
121
3315
15
LOCATION
1480
FACILITY
2449
TITLE
42
ORGANIZATION
991
VOCATION
303
NATURAL_OBJECT
1132
PRODUCT
1664
NAME_OTHER
24
TIMEX/NUMEX
2749
OTHER
1851
ALL
16136
13
Experiments (CRFs)

To investigate which clique type contributes
classification accuracy:

We construct models that constitute of possible combinations of
defined cliques

8 models (SCR, SC, SR, CR, S, C, R, I)

Classification is performed on each connected subgraph
S
S
S
C
S
S
S
C
R
C
C
C
C
C
C
R
R
C
C S
R
S
C S
SCR model
S
C
C
C
S
S
SC model
R
S
SR model
C
R
CR model
S
C
C
C
R
C
S
S
S model
C
C model
R
R model
I model
14
Experimental settings (Baseline) , Evaluation

Baseline:Support Vector Machines (SVMs) [Vapnik 1998]

We perform two models:

I model: each anchor text is classified independently

P model: anchor texts are ordered by linear position in HTML,
and performed history-based classification (j-1th classification
result is used in j-th classification)
I model


P model
For multi-class classification : one-versus-rest
Evaluation

5-fold cross validation, by F1-value
15
Results (F1-value)
CRFs
SVMs
#
SCR
SC
SR
CR
S
C
R
I
P
I
ALL
14285
.7854
.7855
.7822
.7862
.7817
.7845
.7813
.7805
.7798
.7790
no article
3898
.5465
.5484.
.5223
.5495
.5271
.5475
.5273
.5249
.5386
.5278
ALL : whole dataset , no article : anchors without articles
S
S
S
C
S
S
S
C
R
C
C
C S
R
C
S
C
C
C
C S
SCR model
S
R
C
S
R
S
SC model
R
S
SR model
C
C
C
C
R
CR model
P model
I model
I model
S
C
S
S
S model
C
C
R
C
C
C model
R
CRFs
R model
SVMs
16
Results (F1-value)
1. Graph-based vs. Node-wise
CRFs
SVMs
#
SCR
SC
SR
CR
S
C
R
I
P
I
ALL
14285
.7854
.7855
.7822
.7862
.7817
.7845
.7813
.7805
.7798
.7790
no article
3898
.5465
.5484.
.5223
.5495
.5271
.5475
.5273
.5249
.5386
.5278
ALL : whole dataset , no article : anchors without articles
S
S
S
C
S
S
S
C
R
C
C
C
C S
R
S
C
C
C
R
C
R
S
C S
S paired testSon S
Performed
McNemar
R
SCR model
SC model
SR model
labeling disagreements
S
=> difference
was significant
(p
C
< C0.01) C
R
S
S
S model
C
C
C
C
C
R
CR model
P model
I model
I model
C
C model
R
CRFs
R model
SVMs
17
Results (F1-value)
2. Which clique is most contributed? => Cousin clique
CRFs
SVMs
#
SCR
SC
SR
CR
S
C
R
I
P
I
ALL
14285
.7854
.7855
.7822
.7862
.7817
.7845
.7813
.7805
.7798
.7790
no article
3898
.5465
.5484.
.5223
.5495
.5271
.5475
.5273
.5249
.5386
.5278
ALL : whole dataset , no article : anchors without articles
S
S
S
C
S
S
S
C
R
C
C
C S
R
C
S
C
C
C
C S
SCR model
S
R
C
S
R
S
SC model
R
S
SR model
C
C
C
C
R
CR model
P model
S
C
S
S
S model
C
C
R
C
C
C model
R
CRFs
Cousin cliques provided the highest
accuracy
improvements
compare toI model
R model
I model
sibling and relative cliques
SVMs
18
Results (F1-value)
Significance Test:
McNemar paired test on labeling disagreements
3. CRFs vs. SVMs
CRFs
SVMs
#
SCR
SC
SR
CR
S
C
R
I
P
I
ALL
14285
.7854
.7855
.7822
.7862
.7817
.7845
.7813
.7805
.7798
.7790
no article
3898
.5465
.5484.
.5223
.5495
.5271
.5475
.5273
.5249
.5386
.5278
ALL : whole dataset , no article : anchors without articles
S
S
S
C
S
S
S
C
R
C
C
C S
R
C
S
C
C
C
C S
SCR model
S
R
C
S
R
S
SC model
R
S
SR model
C
C
C
C
R
CR model
P model
I model
I model
S
C
S
S
S model
C
C
R
C
C
C model
R
CRFs
R model
SVMs
19
Filtering NE candidates using marginal probability

Construct dictionaries from extracted NE candidates

Methods with lower cost are desirable

Extract only confident NE candidates
-> Use of marginal probability that provided by CRFs

Marginal probability

probability of a particular label assignment for a node vi V

This can be regarded as
“confidence” of a classifier
yi
p( yi | x)   p(y | x)
y \ yi
20
Precision-Recall Curve
Precision-Recall curve obtained by thresholding the marginal probability of
the MAP estimation in the CR model of CRFs
At this point,
recall value is about 0.57 and precision
value is about 0.97
Using the proper thresholding of
marginal probability, NE dictionary can
be constructed with lower cost
21
Summary and future work

Summary

Proposed a method for categorizing NEs in Wikipedia

Defined 3 kinds of cliques (Sibling, Cousin and Relative) over
HTML tree

Graph-based model achieved significant improvements compare
to Node-wise model, and baseline methods (SVMs)

NEs can be extracted with lower cost by exploiting marginal
probability
22
Summary and Future work

Future work


Use fine-grained NE classes

For many NLP applications (e.g. QA, IE), NE dictionary with
fine grained label sets will be a useful resource

Classification with statistical methods becomes difficult in case
that the label set is large, because of the insufficient positive
examples
Incorporate hierarchical structure of label sets into our models
(Hierarchical Classification)

Previous work suggest that exploiting hierarchical structure of
label sets improve classification accuracy
23

Thank you.
24