Graph-based Analysis of Semantic Drift in Espresso-like Bootstrapping Algorithms Mamoru Komachi†, Taku Kudo‡, Masashi Shimbo† and Yuji Matsumoto† †Nara Institute of Science and Technology, Japan ‡Google Inc. EMNLP 2008 2015/10/1 Bootstrapping and semantic drift Bootstrapping grows small amount of seed instances Seed instance iPhone Co-occurrence pattern Extracted instance MacBook Air Buy # at Apple Store iPod iPod # for sale car house Generic patterns = Patterns which co-occur with many (irrelevant) instances 2 10/1/2015 Main contributions of this work 1. Suggest a parallel between semantic drift in Espressolike bootstrapping and topic drift in HITS (Kleinberg, 1999) 2. Solve semantic drift by graph kernels used in link analysis community 3 10/1/2015 Espresso Algorithm [Pantel and Pennacchiotti, 2006] Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection Until a stopping criterion is met 4 10/1/2015 Espresso uses pattern-instance matrix A for ranking patterns and instances |P|×|I|-dimensional matrix holding the (normalized) pointwise mutual information (pmi) between patterns and instances instance indices 1 2 ... i ... |I| 1 2 pattern indices : p : |P| [A]p,i = pmi(p,i) / maxp,i pmi(p,i) 5 10/1/2015 Pattern/instance ranking in Espresso p = pattern score vector Reliable instances are supported by i = instance score vector reliable patterns, and vice versa A = pattern-instance matrix 1 p Ai ... pattern ranking I 1 T i A p ... instance ranking P |P| = number of patterns |I| =number of instances 6 normalization factors to keep score vectors not too large 10/1/2015 Espresso Algorithm [Pantel and Pennacchiotti, 2006] Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection For graph-theoretic analysis, we will introduce 3 simplifications to Espresso Until a stopping criterion is met 7 10/1/2015 First simplification Compute the pattern-instance matrix Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection Simplification 1 Remove pattern/instance extraction steps Instead, pre-compute all patterns and instances once in the beginning of the algorithm Until a stopping criterion is met 8 10/1/2015 Second simplification Compute the pattern-instance matrix Repeat Simplification 2 Remove pattern/instance selection steps which retain only highest scoring k patterns Pattern ranking Pattern selection / m instances for the next iteration i.e., reset the scores of other items to 0 Instance ranking Instead, Instance selection retain scores of all patterns and instances Until a stopping criterion is met 9 10/1/2015 Third simplification Compute the pattern-instance matrix Repeat Pattern ranking Instance ranking Simplification 3 No early stopping i.e., run until convergence Untilscore a stopping criterion met Until vectors p and iisconverge 10 10/1/2015 Simplified Espresso Input Initial score vector of seed instances i 0,1,0,0 Pattern-instance co-occurrence matrix A Main loop Repeat p 1 Ai ranking ... pattern I 1 i AT p ... instance ranking P Until i and p converge Output Instance and pattern score vectors i and p 11 10/1/2015 Simplified Espresso is HITS Simplified Espresso = HITS in a bipartite graph whose adjacency matrix is A Problem The ranking vector i tends to the principal eigenvector of ATA as the iteration proceeds regardless of the seed instances! No matter which seed you start with, the same instance is always ranked topmost Semantic drift (also called topic drift in HITS) 12 10/1/2015 How about the original Espresso? Original Espresso has heuristics not present in Simplified Espresso Early stopping Pattern and instance selection Do these heuristics really help reduce semantic drift? 13 10/1/2015 Experiments on semantic drift Does the heuristics in original Espresso help reduce drift? 14 10/1/2015 Word sense disambiguation task of Senseval-3 English Lexical Sample Predict the sense of “bank” … the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to … In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden Possibly aligned to water a sort of bank(???) by a rushing river. 15 Training instances are annotated with their sense Predict the sense of target word in the test set 10/1/2015 Word sense disambiguation by Original Espresso Seed instance Seed instance = the instance to predict its sense System output = k-nearest neighbor (k=3) Heuristics of Espresso Pattern and instance selection # of patterns to retain p=20 (increase p by 1 on each iteration) # of instance to retain m=100 (increase m by 100 on each iteration) Early stopping 16 10/1/2015 Convergence process of Espresso 1 Heuristics in Espresso helps reducing semantic drift (However, early stopping is required for optimal performance) Recall 0.9 Original Espresso 0.8 Simplified Espresso 0.7 Most frequent sense (baseline) Semantic drift occurs (always outputs the most frequent sense) Output the most frequent sense regardless of input 0.6 0.5 1 17 6 11 16 Iteration 21 10/1/2015 26 Learning curve of Original Espresso: per-sense breakdown 1 Most frequent sense 0.9 # of most frequent sense predictions increases Recall 0.8 0.7 Recall for infrequent senses worsens even with original Espresso 0.6 0.5 0.4 Other senses 0.3 1 18 6 11 16 Iteration 21 10/1/2015 26 Summary: Espresso and semantic drift Semantic drift happens because Espresso is designed like HITS HITS gives the same ranking list regardless of seeds Some heuristics reduce semantic drift Early stopping is crucial for optimal performance Still, these heuristics require many parameters to be calibrated but calibration is difficult 19 10/1/2015 Main contributions of this work 1. Suggest a parallel between semantic drift in Espressolike bootstrapping and topic drift in HITS (Kleinberg, 1999) 2. Solve semantic drift by graph kernels used in link analysis community 20 10/1/2015 Q. What caused drift in Espresso? A. Espresso's resemblance to HITS HITS is an importance computation method (gives a single ranking list for any seeds) Why not use a method for another type of link analysis measure - which takes seeds into account? "relatedness" measure (it gives different rankings for different seeds) 21 10/1/2015 The regularized Laplacian kernel A relatedness measure Takes higher-order relations into account Has only one parameter Graph Laplacian Regularized Laplacian matrix L D A A: adjacency matrix of the graph D: (diagonal) degree matrix n n 1 R ( L ) ( I L ) n 0 β: parameter Each column of Rβ gives the rankings relative to a node 22 10/1/2015 Evaluation of regularized Laplacian Comparison to (Agirre et al. 2006) 23 10/1/2015 WSD on all nouns in Senseval-3 Espresso needs optimal stopping to achieve an equivalent performance algorithm F measure Most frequent sense (baseline) 54.5 HyperLex 64.6 PageRank 64.6 Simplified Espresso 44.1 Espresso (after convergence) 46.9 Espresso (optimal stopping) 66.5 Regularized Laplacian (β=10-2) 67.1 Outperforms other graph-based methods 24 10/1/2015 Conclusions Semantic drift in Espresso is a parallel form of topic drift in HITS The regularized Laplacian reduces semantic drift 25 inherently a relatedness measure ( importance measure) 10/1/2015 Future work Investigate if a similar analysis is applicable to a wider class of bootstrapping algorithms (including co-training) Try other popular tasks of bootstrapping such as named entity extraction 26 Selection of seed instances matters 10/1/2015 27 10/1/2015 Label prediction of “bank” (F measure) Espresso suffers from semantic drift (unless stopped at optimal stage) Algorithm Most frequent sense Other senses Simplified Espresso 100.0 0.0 Espresso (after convergence) 100.0 30.2 Espresso (optimal stopping) 94.4 67.4 Regularized Laplacian (β=10-2) 92.1 62.8 The regularized Laplacian keeps high recall for infrequent senses 28 10/1/2015 Regularized Laplacian is mostly stable across a parameter 29 10/1/2015 Pattern/instance ranking in Espresso Score for pattern p pmi ( i,p ) r i) ( max pmi i I r ( p ) I Score for instance i pmi (i,p ) r ) (p max pmi p P r i) ( P p: pattern i: instance P: set of patterns I: set of instances pmi: pointwise mutual information max pmi: max of pmi in all the patterns and instances x ,p ,y pmi ( i ,p ) log x ,*, y *, p ,* 30 08.10.24
© Copyright 2025 ExpyDoc