Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities Masumi Shirakawa1, Kotaro Nakayama2, Takahiro Hara1, Shojiro Nishio1 1Osaka University, Osaka, Japan 2University of Tokyo, Tokyo, Japan 1 Challenge in short text analysis Statistics are not always enough. A year and a half after Google pulled its popular search engine out of mainland China Baidu and Microsoft did not disclose terms of the agreement 2 Challenge in short text analysis Statistics are not always enough. A year and a half after Google pulled its popular search engine out of mainland China They are talking about... Search engines and China Baidu and Microsoft did not disclose terms of the agreement 3 Challenge in short text analysis Statistics are not always enough. A year and a half after Google pulled its popular search engine out of mainland China They are talking about... Search engines and China Baidu and Microsoft did not disclose terms of the agreement How do machines know that the two sentences mention about the similar topic? 4 Reasonable solution Use external knowledge. A year and a half after Google pulled its popular search engine out of mainland China Baidu and Microsoft did not disclose terms of the agreement Wikipedia Thesaurus [Nakayama06] 5 Related work ESA: Explicit Semantic Analysis [Gabrilovich07] Add Wikipedia articles (entities) to a text as its semantic representation. 1. Get search ranking of Wikipedia for each term (i.e. Wiki articles and scores). 2. Simply sum up the scores for aggregation. Key term extraction Apple sells a new product Input: T Apple product sells new t Related entity finding pear iPhone pricing iPad Apple Inc. business c Aggregation Apple Inc. iPhone pear Output: ranked list of c 6 Problems in real world noisy short texts “Noisy” means semantically noisy in this work. (We do not handle informal or casual surface forms, or misspells) Term ambiguity • Apple (fruit) should not be related with Microsoft. Fluctuation of term dominance • A term is not always important in texts. We explore more effective aggregation method. 7 Probabilistic method We propose Extended naïve Bayes to aggregate related entities Key term extraction Related entity finding Apple Apple sells a new product product Input: T t pear iPhone pricing iPad Apple Inc. 𝑃 𝑒𝑡 [Mihalcea07] [Milne08] From text T to related articles c 𝑃 𝑐|𝑇 = 𝐾 𝑘=1 Apple Inc. iPhone iPad Output: ranked list of c c 𝑃 𝑐𝑡 𝑃(𝑡 ∈ 𝑇) Aggregation 𝑃 𝑐𝑒 [Nakayama06] 𝑘𝑃 𝑐 𝑡𝑘 𝑃 𝑐 𝐾−1 [Song11] 𝑃 𝑡𝑘 ∈ 𝑇 𝑃 𝑐 𝑡𝑘 + 1 − 𝑃 𝑡𝑘 ∈ 𝑇 𝑃(𝑐) 𝑃 𝑐 𝐾−1 8 When input is multiple terms Apply naïve Bayes [Song11] to multiple terms 𝑡1 , … , 𝑡𝐾 to obtain related entity c using each probability P (c |tk ). 𝑃 𝑐 𝑡1 , … , 𝑡𝐾 𝑃 𝑡1 , … , 𝑡𝐾 𝑐 𝑃 𝑐 𝑃 𝑐 𝑘 𝑃 𝑡𝑘 𝑐 𝑘 𝑃 𝑐 𝑡𝑘 = = = 𝑃 𝑡1 , … , 𝑡𝐾 𝑃 𝑡1 , … , 𝑡𝐾 𝑃 𝑐 𝐾−1 𝑡1 Apple 𝑡2 product 𝑃(𝑐|𝑡1 ) 𝑃(𝑐|𝑡2 ) 𝑐 = “iPhone” iPhone 𝑐 … 𝑡𝐾 new 𝑃(𝑐|𝑡𝐾 ) Compute 𝑃(𝑐|𝑡1 , ⋯ , 𝑡𝐾 ) for each related entity c 9 When input is multiple terms Apply naïve Bayes [Song11] to multiple terms 𝑡1 , … , 𝑡𝐾 to obtain related entity c using each probability P (c |tk ). 𝑃 𝑐 𝑡1 , … , 𝑡𝐾 𝑡1 𝑡2 𝑃 𝑡1 , … , 𝑡𝐾 𝑐 𝑃 𝑐 𝑃 𝑐 𝑘 𝑃 𝑡𝑘 𝑐 𝑘 𝑃 𝑐 𝑡𝑘 = = = 𝑃 𝑡1 , … , 𝑡𝐾 𝑃 𝑡1 , … , 𝑡𝐾 𝑃 𝑐 𝐾−1 Apple 𝑃(𝑐|𝑡1 ) 𝑐 = “iPhone” By using naïve Bayes, entities that are related product to multiple terms can be boosted.iPhone 𝑐 𝑃(𝑐|𝑡 ) 2 … 𝑡𝐾 new 𝑃(𝑐|𝑡𝐾 ) Compute 𝑃(𝑐|𝑡1 , ⋯ , 𝑡𝐾 ) for each related entity c 10 When input is text Not “multiple terms” but “text,” i.e., we don’t know which terms are key terms. We developed extended naïve Bayes to solve this problem. Cannot observe which are key terms 𝑡1 Apple 𝑡2 product 𝑃(𝑐|𝑡1 ) 𝑃(𝑐|𝑡2 ) 𝑐 = “iPhone” iPhone 𝑐 … 𝑡𝐾 new 𝑃(𝑐|𝑡𝐾 ) 11 Extended naïve Bayes Apple product Candidates of key term 𝑇 Apple … … new 𝑇′ = {𝑡1 } 𝑇′ = {𝑡1 , 𝑡2 } Apple product … Apply naïve Bayes to each state T’ Apple 𝑇′ = {𝑡1 , ⋯ , 𝑡𝐾 } product … Probability that the set of key terms T is a state T’ : P(𝑇 = 𝑇′) new 12 Extended naïve Bayes Apple product Candidates of key term 𝑇′ = {𝑡1 } Apple … … new 𝑇′ = {𝑡1 , 𝑡2 } 𝑇 Apple product … Apply naïve Bayes to each state T’ Apple 𝑇′ 𝑇′ = {𝑡1 , ⋯ , 𝑡𝐾 } 𝑃(𝑐|𝑇 ′ ) 𝑃(𝑇 = 𝑇 ′ ) = 𝑘 product … Probability that the set of key terms T is a state T’ : P(𝑇 = 𝑇′) new 𝑃 𝑡𝑘 ∈ 𝑇 𝑃 𝑐 𝑡𝑘 + 1 − 𝑃 𝑡𝑘 ∈ 𝑇 𝑃(𝑐) 𝑃 𝑐 𝐾−1 13 Extended naïve Bayes Apple product Candidates of key term 𝑇′ = {𝑡1 } Apple … … new 𝑇′ = {𝑡1 , 𝑡2 } 𝑇 Apple product … Apply naïve Bayes to each state T’ Term dominance is incorporated into naïve Bayes Apple 𝑇′ 𝑇′ = {𝑡1 , ⋯ , 𝑡𝐾 } 𝑃(𝑐|𝑇 ′ ) 𝑃(𝑇 = 𝑇 ′ ) = 𝑘 product … Probability that the set of key terms T is a state T’ : P(𝑇 = 𝑇′) new 𝑃 𝑡𝑘 ∈ 𝑇 𝑃 𝑐 𝑡𝑘 + 1 − 𝑃 𝑡𝑘 ∈ 𝑇 𝑃(𝑐) 𝑃 𝑐 𝐾−1 14 Experiments on short text sim datasets [Datasets] Four datasets derived from word similarity datasets using dictionary [Comparative methods] Original ESA [Gabrilovich07], ESA with 16 parameter settings [Metrics] Spearman’s rank correlation coefficient ESA with well-adjusted parameter is superior to our method for “clean” texts. 15 Tweet clustering K-means clustering using the vector of related entities for measuring distance [Dataset] 12,385 tweets including 13 topics #MacBook (1,251) #Silverlight (221) #VMWare (890) #MySQL (1,241) #Ubuntu (988) #Chrome (1,018) #NFL (1,044) #NHL (1,045) #MLB (752) #MLS (981) #NASCAR (878) #NBA (1,085) #UFC (991) [Comparative methods] Bag-of-words (BOW), ESA with the same ESA with well-adjusted parameter parameter, [Metric] Average of Normalized Mutual Information (NMI), 20 runs 16 Results 0.567 0.6 NMI score 0.5 0.524 0.429 0.4 p-value < 0.01 0.421 0.3 0.2 0.1 0 10 BOW 20 50 100 200 500 1,000 2,000 Number of related entities ESA-same ESA-adjusted Our method 17 Results 0.567 0.6 NMI score 0.5 0.524 0.429 0.4 p-value < 0.01 0.421 0.3 0.2 Our method outperformed ESA with well-adjusted parameter for noisy short texts. 0.1 0 10 BOW 20 50 100 200 500 1,000 2,000 Number of related entities ESA-same ESA-adjusted Our method 18 Conclusion We proposed extended naïve Bayes to derive related Wikipedia entities given a real world noisy short text. [Future work] Tackle multilingual short texts Develop applications of the method 19
© Copyright 2024 ExpyDoc