Graph-based Analysis of Semantic Drift in

Graph-based Analysis of Semantic Drift in
Espresso-like Bootstrapping Algorithms
Mamoru Komachi†, Taku Kudo‡, Masashi Shimbo† and Yuji Matsumoto†
†Nara Institute of Science and Technology, Japan
‡Google Inc.
EMNLP 2008
2015/10/1
Bootstrapping and semantic drift
Bootstrapping grows small amount of seed instances

Seed instance
iPhone
Co-occurrence pattern
Extracted instance
MacBook Air
Buy # at Apple Store
iPod
iPod
# for sale
car
house
Generic patterns
= Patterns which co-occur with many (irrelevant) instances
2
10/1/2015
Main contributions of this work
1.
Suggest a parallel between semantic drift in Espressolike bootstrapping and topic drift in HITS (Kleinberg,
1999)
2.
Solve semantic drift by graph kernels used in link
analysis community
3
10/1/2015
Espresso Algorithm [Pantel and Pennacchiotti, 2006]
Repeat







Pattern extraction
Pattern ranking
Pattern selection
Instance extraction
Instance ranking
Instance selection
Until a stopping criterion is met

4
10/1/2015
Espresso uses pattern-instance matrix A
for ranking patterns and instances
|P|×|I|-dimensional matrix holding the (normalized)
pointwise mutual information (pmi) between patterns and
instances
instance indices
1
2
...
i
...
|I|
1
2
pattern indices
:
p
:
|P|
[A]p,i = pmi(p,i) / maxp,i pmi(p,i)
5
10/1/2015
Pattern/instance ranking in Espresso
p = pattern score vector Reliable instances are supported by
i = instance score vector reliable patterns, and vice versa
A = pattern-instance matrix
1
p  Ai ... pattern ranking
I
1 T
i  A p ... instance ranking
P

|P| = number of patterns
|I| =number of instances
6
normalization factors
to keep score vectors not too large
10/1/2015
Espresso Algorithm [Pantel and Pennacchiotti, 2006]
Repeat







Pattern extraction
Pattern ranking
Pattern selection
Instance extraction
Instance ranking
Instance selection
For graph-theoretic analysis,
we will introduce 3 simplifications
to Espresso
Until a stopping criterion is met

7
10/1/2015
First simplification
Compute the pattern-instance matrix
Repeat








Pattern extraction
Pattern ranking
Pattern selection
Instance extraction
Instance ranking
Instance selection
Simplification 1
Remove pattern/instance extraction
steps
Instead,
pre-compute all patterns and instances
once in the beginning of the algorithm
Until a stopping criterion is met

8
10/1/2015
Second simplification
Compute the pattern-instance matrix
Repeat
Simplification 2






Remove pattern/instance selection steps
which retain only highest scoring k patterns
Pattern ranking
Pattern selection / m instances for the next iteration
i.e., reset the scores of other items to 0
Instance ranking
Instead,
Instance selection retain scores of all patterns and instances
Until a stopping criterion is met

9
10/1/2015
Third simplification
Compute the pattern-instance matrix
Repeat




Pattern ranking

Instance ranking
Simplification 3
No early stopping
i.e., run until convergence
Untilscore
a stopping
criterion
met
Until
vectors
p and iisconverge
10
10/1/2015
Simplified Espresso
Input
 Initial score vector of seed instances i  0,1,0,0
 Pattern-instance co-occurrence matrix A
Main loop
Repeat p  1 Ai
 ranking
... pattern
I
1
i  AT p ... instance ranking
P
Until

i and p converge
Output

Instance
and pattern score vectors i and p
11
10/1/2015
Simplified Espresso is HITS
Simplified Espresso = HITS in a bipartite graph whose
adjacency matrix is A
Problem The ranking vector i tends to the principal
eigenvector of ATA as the iteration proceeds
regardless of the seed instances!
No
matter which seed you start with, the same instance is always
ranked topmost
Semantic drift (also called topic drift in HITS)
12
10/1/2015
How about the original Espresso?
Original Espresso has heuristics not present in Simplified
Espresso


Early stopping
Pattern and instance selection
Do these heuristics really help reduce semantic drift?
13
10/1/2015
Experiments on semantic drift
Does the heuristics in original Espresso help reduce drift?
14
10/1/2015
Word sense disambiguation task of
Senseval-3 English Lexical Sample
Predict the sense of “bank”
… the financial benefits of the bank (finance) 's
employee package ( cheap mortgages and pensions,
etc ) , bring this up to …
In that same year I was posted to South Shields on the
south bank (bank of the river) of the River Tyne and
quickly became aware that I had an enormous burden
Possibly aligned to water a sort of bank(???) by a
rushing river.
15
Training instances are
annotated with their sense
Predict the sense of target
word in the test set
10/1/2015
Word sense disambiguation by Original
Espresso
Seed instance
Seed instance = the instance to predict its sense
System output = k-nearest neighbor (k=3)
Heuristics of Espresso
 Pattern and instance selection
 # of patterns to retain p=20 (increase p by 1 on each iteration)
 # of instance to retain m=100 (increase m by 100 on each iteration)

Early stopping
16
10/1/2015
Convergence process of Espresso
1
Heuristics in Espresso helps reducing semantic drift
(However, early stopping is required for optimal performance)
Recall
0.9
Original Espresso
0.8
Simplified Espresso
0.7
Most frequent sense (baseline)
Semantic drift occurs (always
outputs the most frequent sense)
Output the most
frequent sense
regardless of input
0.6
0.5
1
17
6
11
16
Iteration
21
10/1/2015
26
Learning curve of Original Espresso:
per-sense breakdown
1
Most frequent sense
0.9
# of most frequent sense predictions increases
Recall
0.8
0.7
Recall for infrequent senses worsens
even with original Espresso
0.6
0.5
0.4
Other senses
0.3
1
18
6
11
16
Iteration
21
10/1/2015
26
Summary: Espresso and semantic drift
Semantic drift happens because
 Espresso is designed like HITS
 HITS gives the same ranking list regardless of seeds
Some heuristics reduce semantic drift
 Early stopping is crucial for optimal performance
Still, these heuristics require
 many parameters to be calibrated
 but calibration is difficult
19
10/1/2015
Main contributions of this work
1.
Suggest a parallel between semantic drift in Espressolike bootstrapping and topic drift in HITS (Kleinberg,
1999)
2.
Solve semantic drift by graph kernels used in link
analysis community
20
10/1/2015
Q. What caused drift in Espresso?
A. Espresso's resemblance to HITS
HITS is an importance computation method
(gives a single ranking list for any seeds)
Why not use a method for another type of link analysis
measure - which takes seeds into account?
"relatedness" measure
(it gives different rankings for different seeds)
21
10/1/2015
The regularized Laplacian kernel



A relatedness measure
Takes higher-order relations into account
Has only one parameter
Graph Laplacian
Regularized Laplacian matrix

L  D A
A: adjacency matrix of the graph
D: (diagonal) degree matrix

n


n

1
R

(

L
)

(
I

L
)

n

0
β: parameter
Each column of Rβ gives the rankings relative to a node

22
10/1/2015
Evaluation of regularized Laplacian
Comparison to (Agirre et al. 2006)
23
10/1/2015
WSD on all nouns in Senseval-3
Espresso needs optimal stopping to achieve
an equivalent performance
algorithm
F measure
Most frequent sense (baseline)
54.5
HyperLex
64.6
PageRank
64.6
Simplified Espresso
44.1
Espresso (after convergence)
46.9
Espresso (optimal stopping)
66.5
Regularized Laplacian (β=10-2)
67.1
Outperforms other graph-based methods
24
10/1/2015
Conclusions

Semantic drift in Espresso is a parallel form of topic drift
in HITS

The regularized Laplacian reduces semantic drift

25
inherently a relatedness measure ( importance measure)
10/1/2015
Future work

Investigate if a similar analysis is applicable to a wider
class of bootstrapping algorithms (including co-training)

Try other popular tasks of bootstrapping such as named
entity extraction

26
Selection of seed instances matters
10/1/2015
27
10/1/2015
Label prediction of “bank” (F measure)
Espresso suffers from semantic drift
(unless stopped at optimal stage)
Algorithm
Most frequent sense Other senses
Simplified Espresso
100.0
0.0
Espresso (after convergence)
100.0
30.2
Espresso (optimal stopping)
94.4
67.4
Regularized Laplacian (β=10-2)
92.1
62.8
The regularized Laplacian keeps
high recall for infrequent senses
28
10/1/2015
Regularized Laplacian is mostly stable
across a parameter
29
10/1/2015
Pattern/instance ranking in Espresso
Score for pattern p


pmi
(
i,p
)
r
i)



(
max
pmi

i
I
r
(
p
)


I
Score for instance i


pmi
(i,p
)
r
)


(p
max
pmi 

p

P
r
i)
(
P
p: pattern
i: instance
P: set of patterns
I: set of instances
pmi: pointwise mutual information
max pmi: max of pmi in all the patterns and
instances
x
,p
,y
pmi
(
i
,p
)

log

x
,*,
y
*,
p
,*

30
08.10.24