Minimally Supervised Learning of Semantic Knowledge from Query Logs Mamoru Komachi(†) and Hisami Suzuki(‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA IJCNLP-08, Hyderabad, India 2015/10/1 Task similar Darjeeling similar Chai (Indian tea) Kombucha (Japanese tea) Learn semantic categories from web search query logs by bootstrapping with minimal supervision Semantic category: a set of words which are interrelated Named entities, technical terms, paraphrases, … Can be useful for search ads, etc… 2 2015/10/1 Our Contribution First to use the Japanese query logs for the task of learning of named entities Propose an efficient method suited for query logs, based on the general-purpose Espresso (Pantel and Pennacchiotti 2006) algorithm 4 2015/10/1 Table of Contents Related work 1. The Tchai algorithm 2. Problems of Espresso Extension to Espresso Experiment 3. 5 Bootstrapping techniques for relation extraction Scoring metrics System performance and comparison to other algorithms Samples of extracted instances and patterns 2015/10/1 Bootstrapping Iteratively conduct pattern induction and instance extraction starting from seed instances Can fertilize small set of seed instances Instances vaio Query log (Corpus) Contextual patterns Compare vaio laptop Compare # laptop Toshiba satellite HP xb3000 6 Compare toshiba satellite laptop #:slot Compare HP xb3000 laptop 10/1/2015 Instance lookup and pattern induction ANA instance ANA 予約 # 予約 query log extracted pattern Instance Query log Pattern ANA (All Nippon Airways) ANA 予約 (reservation) 644 # 予約 Restaurant reservation? (reservation) ラスベガス (Las Vegas) 140 ラスベガス格安航空券 #格安航空券 Flight reservation? (discount flight ticket) (discount flight ticket) ラスベガス ラスベガスホテル … Semantic drift Computational efficency 7 114 #ホテル(hotel) Use all strings but instances =Require no segmentation Count Broad coverage, Noisy patterns Generic patterns 2015/10/1 Instance/Pattern Scoring Metrics Sekine & Suzuki (2007) Starts from a large named entity dictionary Assign low scores to generic patterns and ignore Basilisk (Thelen and Riloff, 2002) Balance the recall and precision of generic patterns Espresso (Pantel and Pennacchiotti, 2006) pm i(i, p) r (i) iI max pm i r ( p) I pm i(i, p) r ( p) pP max pm i r (i) P PMI is normalized by the maximum of all P and I 8 Reliability of an instance and a pattern is mutually defined P: patterns in corpus I: instances in corpus PMI: pointwise mutual information r: reliability score 2015/10/1 The Tchai Algorithm Filter generic patterns/instances Not to select generic patterns and instances Replace scaling factor in reliability scores Take the maximum PMI for a given instance/pattern rather than the maximum for all instances and patterns This modification shows a large impact on the effectiveness of our algorithm pm i(i, p) max pm i r (i ) iI iI r ( p) I pm i(i, p ) r ( p) pP max pP pm i r (i ) P Only induce patterns at the beginning 10 Tchai runs 400X faster than Espresso 2015/10/1 Experiments Japanese query logs from 2007/01-02 Unique one million (166 millions in token) Target categories Manually classified 10,000 most frequent search words (in the log of 2006/12) -- hereafter referred to as 10K list Travel: the largest category (712 words) Finance: the smallest category (240 words) Category Travel Finance 12 Seeds (with English translation) JAL(Japan Airlines), ANA(All Nippon Airways), JR(Japan Railways), じゃらん(Travel information site), HIS(Travel agency) み ず ほ 銀 行 (Mizuho Bank), 三 井 住 友 銀 行 (Sumitomo Mitsui Banking Corporation), JCB, 新生銀行 (Shinsei Bank), 野村證券 (Nomura Securities) 2015/10/1 Results Travel Travel Not Travel High precision (92.1%) Learned 251 novel words 10K list Not in 10K list Not Travel Travel 280 17 251 0 7 125 Finance Due to the ambiguity of hand Include common nouns labeling related Travel 10K list Not into10K list (e.g. Tokyo Disney Land) (e.g. Rental car) Not Finance Finance Finance 41 30 30 Not Finance 0 5 99 13 2015/10/1 Sample of Instances (Travel category) Type Place Examples (with translation) トルコ (Turkey), ラスベガス (Las Vegas), バリ島 (Bali Island) Travel agency Jtb, トクー (www.tocoo.jp), yahoo (Yahoo ! Travel), net cruiser Able to learn severalusj subAttraction ディズニーランド (Disneyland), categories (Universal Studio Japan) in which no seed words given Hotel 帝国ホテル (Imperial Hotel), リッツ (Ritz Hotel) Transportation 京浜急行(Keihin Express), 奈良交通 (Nara Kotsu Bus Lines) 14 2015/10/1 System Performance Travel Finance System Instances Precision Rel. Recall System Instances Precision Rel. Recall Basilisk 651 63.4% 1.26 Basilisk 278 27.3% 0.70 Espresso 500 65.6% 1.00 Espresso 704 15.2% 1.00 Tchai 80.6% 1.67 Tchai 35.0% 0.73 680 223 High precision but low relative High precision and recall Relative Recall (Pantel et al., 2004) recall due to strict filtering RA|B 17 RA C A C C A PA A RB CB / C CB PB B 2015/10/1 Cumulative precision: Travel Tchai achieved the best precision 18 2015/10/1 Basilisk and Espresso extracted location namesPatterns as context patterns, which may be Sample Extracted too generic for Travel domain System Sample Patterns (with English translation) Basilisk #東日本(east_japan), #西日本(west_japan), p#sonic, # 時 刻 表 (timetable), # 九 州 (Kyushu), #+ マ イ レ ー ジ (mileage), #バス(bus), google+#lytics, #+料金(fare), #+国内(domestic), #ホテル(hotel) Espresso #バス(bus), 日本#(Japan), #ホテル(hotel), #道路(road), #イン(inn), フジ#(Fuji), #東京(Tokyo), #料金(fare), #九 州(Kyushu), #時刻表(timetable), #+旅行(travel), #+名 古屋(Nagoya) Tchai #+ホテル(hotel), #+ツアー(tour), #+旅行(travel), #予約 (reserve), #+ 航 空 券 (flight_ticket), #+ 格 安 航 空 券 (discount_flight_titcket), #マイレージ(mileage), 羽田空 港+#(Haneda Airport) Tchai found context patterns that are characteristic to the domain 20 2015/10/1 Conclusion and future work Conclusion Use of query logs for semantic category learning Improved Espresso algorithm in both precision and performance Future work 21 Generalize bootstrapping method by graph-based matrix calculation 2015/10/1 Tchai Thank you for listening! 22 2015/10/1
© Copyright 2024 ExpyDoc