A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto Nara Institute of Science and Technology EMNLP-CoNLL 2007 29th June Prague, Czech Background Named Entity Proper nouns (e.g. Shinzo Abe (Person), Prague (Location)), time/date expressions (e.g. June 29 (Date)) and numerical expressions (e.g. 10%) In many NLP applications (e.g. IE, QA), Named Entities play an important role Named Entity Recognition task (NER) Treated as sequential tagging problem Machine learning methods have been proposed Recall is usually low Large scale NE dictionary is useful for NER Semi-automatic methods to compile NE dictionaries have been demanded 2 Resource for NE dictionary construction Wikipedia Multi-lingual encyclopedia on the Web 382,613 gloss articles (as of June 20, 2007, Japanese) Gloss indices are composed by nouns or proper nouns HTML (Semi-structured text) Lists(<LI>) and Tables(<TABLE>) can be used as clues for NE type categorization Linked articles are glossed by anchor texts in articles Each article has one or more categories Wikipedia has useful information for NE categorization Can be considered as a suitable resource 3 Objective Extract Named Entities by assigning proper NE labels for gloss indices of Wikipedia Person Person Natural Object Location Organization Product 4 Use of Wikipedia features Features of Wikipedia articles Anchors of an article refer to the other related articles Anchors in list elements have dependencies each other => Make 3 assumptions about dependencies between anchors PERSON VOCATION Burt Bacharach…composer ORGANIZATION Dillard & Clark Carpenters ORGANIZATION PERSON Karen Carpenter an example of a list structure Assumption 1 : The latter element in a list item tends to be in an attribute relation to the former element Assumption 2 : The elements in the same itemization tends to be in the same NE category Assumption 3 : The nested element tends to be in a part-of relation to the upper element 5 Overview of our approach Focus on HTML list structure in Wikipedia Make 3 assumptions about dependencies between anchors Formalize NE categorization problem as labeling NE classes to anchors in lists Define 3 kinds of cliques (edges: Sibling, Cousin and Relative ) between anchors based on 3 assumptions Construct graphs based on 3 defined cliques CRFs for NE categorization in Wikipedia Define potential functions over 3 edges (and nodes) to provide conditional distribution over the graphs Estimate MAP label assignment over the graphs using Conditional Random Fields 6 Conditional Random Fields (CRFs) Conditional Random Fields [Lafferty 2001] Discriminative, Undirected Models Define conditional distribution p(y|x) Features Arbitrary features can be used Globally optimize on all possible label assignments Can deal with label dependencies by defining potential functions for cliques (2 or more nodes) y1 y2 y3 ・・・ yn x 1 p(y | x) exp k f k xc , y c Z (x) cC k Z (x) : partitionfunction, C : cliques, f k : featurefunction, k : modelparameter 7 Use of dependencies for categorization NE categorization problem as labeling classes to anchors : article exists Dillard & Clark..country rock Carpenters : article does not exist Karen Carpenter The edges of the constructed graphs corresponds to a particular dependency Estimate MAP label assignment over the constructed graphs using Conditional Random Fields Our formulation: Can extract anchors without gloss articles 8 Clique definition based on HTML tree structure <UL> Dillard & Clark…country rock Carpenters <LI> <LI> Karen Carpenter Sibling The latter element tends to be in an attribute or a concept of the former element Cousin The elements tend to have a common NE category (e.g. ORGANIZATION) Relative The latter element tends to be in a constituent part of the former element <A> <A> Dillard & Clark country rock <A> <UL> Carpenters <LI> Sibling Cousin Relative <A> Karen Carpenter Use these 3 relations as cliques of CRFs 9 A graph constructed from 3 clique definitions Sibling The latter element tends to be an attribute or a concept of the former element Cousin The elements tend to have a common attribute (e.g. ORGANIZATION) S S Burt Bacharach…”On my own”…1986 C Dillard & Clark R Gene C Clark C C C S S Carpenters …”As Time Goes By”…2000 R Karen Carpenter Relative The latter element tends to be a constituent part of the former element Estimate the MAP label assignment over the graph S : Sibling C : Cousin R : Relative 10 Model 1 p ( y | x) SCR yi , y j V yi , x v V Z (x) ( vi ,v j )ES , EC , ER i SCR SCR ( yi , y j ) exp k f k yi , y j k V yi , x exp k ' f k ' yi , x k' • Constructed graphs include cycles : exact inference is computationally expensive ->Introduce Tree-based Reparameterization (TRP) [Wainwright 2003] for approximate inference : Potential function for Sibling, Cousin and Relative cliques V : Potential function for nodes S S C R C C C S S R 11 Experiments The aims of experiments are: 1. Compare graph-based approach (relational) to node-wise approach (independent) to investigate how the relational classification improves classification accuracy 2. Investigate the effect of defined cliques 3. Compare CRFs models to baseline models based on SVMs 4. Show the effectiveness of using marginal probability for filtering NE candidates. 12 Dataset Dataset NE Class Randomly sampled 2300 articles (Japanese version as of October 2005) Anchors in list elements(<LI>) are handannotated with NE class label We used Extended Named Entity Hierarchy (Sekine et al. 2002) We reduced the number of classes to 13 from the original 200+ in order to avoid data sparseness Classification are NEs) target :16136 (14285 of those EVENT PERSON UNIT # of articles 121 3315 15 LOCATION 1480 FACILITY 2449 TITLE 42 ORGANIZATION 991 VOCATION 303 NATURAL_OBJECT 1132 PRODUCT 1664 NAME_OTHER 24 TIMEX/NUMEX 2749 OTHER 1851 ALL 16136 13 Experiments (CRFs) To investigate which clique type contributes classification accuracy: We construct models that constitute of possible combinations of defined cliques 8 models (SCR, SC, SR, CR, S, C, R, I) Classification is performed on each connected subgraph S S S C S S S C R C C C C C C R R C C S R S C S SCR model S C C C S S SC model R S SR model C R CR model S C C C R C S S S model C C model R R model I model 14 Experimental settings (Baseline) , Evaluation Baseline:Support Vector Machines (SVMs) [Vapnik 1998] We perform two models: I model: each anchor text is classified independently P model: anchor texts are ordered by linear position in HTML, and performed history-based classification (j-1th classification result is used in j-th classification) I model P model For multi-class classification : one-versus-rest Evaluation 5-fold cross validation, by F1-value 15 Results (F1-value) CRFs SVMs # SCR SC SR CR S C R I P I ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790 no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278 ALL : whole dataset , no article : anchors without articles S S S C S S S C R C C C S R C S C C C C S SCR model S R C S R S SC model R S SR model C C C C R CR model P model I model I model S C S S S model C C R C C C model R CRFs R model SVMs 16 Results (F1-value) 1. Graph-based vs. Node-wise CRFs SVMs # SCR SC SR CR S C R I P I ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790 no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278 ALL : whole dataset , no article : anchors without articles S S S C S S S C R C C C C S R S C C C R C R S C S S paired testSon S Performed McNemar R SCR model SC model SR model labeling disagreements S => difference was significant (p C < C0.01) C R S S S model C C C C C R CR model P model I model I model C C model R CRFs R model SVMs 17 Results (F1-value) 2. Which clique is most contributed? => Cousin clique CRFs SVMs # SCR SC SR CR S C R I P I ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790 no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278 ALL : whole dataset , no article : anchors without articles S S S C S S S C R C C C S R C S C C C C S SCR model S R C S R S SC model R S SR model C C C C R CR model P model S C S S S model C C R C C C model R CRFs Cousin cliques provided the highest accuracy improvements compare toI model R model I model sibling and relative cliques SVMs 18 Results (F1-value) Significance Test: McNemar paired test on labeling disagreements 3. CRFs vs. SVMs CRFs SVMs # SCR SC SR CR S C R I P I ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790 no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278 ALL : whole dataset , no article : anchors without articles S S S C S S S C R C C C S R C S C C C C S SCR model S R C S R S SC model R S SR model C C C C R CR model P model I model I model S C S S S model C C R C C C model R CRFs R model SVMs 19 Filtering NE candidates using marginal probability Construct dictionaries from extracted NE candidates Methods with lower cost are desirable Extract only confident NE candidates -> Use of marginal probability that provided by CRFs Marginal probability probability of a particular label assignment for a node vi V This can be regarded as “confidence” of a classifier yi p( yi | x) p(y | x) y \ yi 20 Precision-Recall Curve Precision-Recall curve obtained by thresholding the marginal probability of the MAP estimation in the CR model of CRFs At this point, recall value is about 0.57 and precision value is about 0.97 Using the proper thresholding of marginal probability, NE dictionary can be constructed with lower cost 21 Summary and future work Summary Proposed a method for categorizing NEs in Wikipedia Defined 3 kinds of cliques (Sibling, Cousin and Relative) over HTML tree Graph-based model achieved significant improvements compare to Node-wise model, and baseline methods (SVMs) NEs can be extracted with lower cost by exploiting marginal probability 22 Summary and Future work Future work Use fine-grained NE classes For many NLP applications (e.g. QA, IE), NE dictionary with fine grained label sets will be a useful resource Classification with statistical methods becomes difficult in case that the label set is large, because of the insufficient positive examples Incorporate hierarchical structure of label sets into our models (Hierarchical Classification) Previous work suggest that exploiting hierarchical structure of label sets improve classification accuracy 23 Thank you. 24
© Copyright 2025 ExpyDoc