Weekly 1:1 meeting 6/20/2007

Minimally Supervised Learning of Semantic
Knowledge from Query Logs
Mamoru Komachi(†) and Hisami Suzuki(‡)
(†) Nara
Institute of Science and Technology, Japan
(‡) Microsoft Research, USA
IJCNLP-08, Hyderabad, India
2015/10/1
Task
similar
Darjeeling
similar
Chai
(Indian tea)
Kombucha
(Japanese tea)
Learn semantic categories from web search query logs by
bootstrapping with minimal supervision
Semantic category: a set of words which are interrelated
 Named entities, technical terms, paraphrases, …
Can be useful for search ads, etc…



2
2015/10/1
Our Contribution

First to use the Japanese query logs for the task of
learning of named entities

Propose an efficient method suited for query logs, based
on the general-purpose Espresso (Pantel and Pennacchiotti
2006) algorithm
4
2015/10/1
Table of Contents
Related work
1.


The Tchai algorithm
2.


Problems of Espresso
Extension to Espresso
Experiment
3.


5
Bootstrapping techniques for relation extraction
Scoring metrics
System performance and comparison to other algorithms
Samples of extracted instances and patterns
2015/10/1
Bootstrapping
Iteratively conduct pattern induction and instance
extraction starting from seed instances
Can fertilize small set of seed instances


Instances
vaio
Query log
(Corpus)
Contextual
patterns
Compare vaio laptop
Compare # laptop
Toshiba satellite
HP xb3000
6
Compare toshiba satellite laptop
#:slot
Compare HP xb3000 laptop
10/1/2015
Instance lookup and pattern induction
ANA
instance
ANA 予約
# 予約
query log
extracted pattern
Instance
Query log
Pattern
ANA
(All Nippon Airways)
ANA 予約
(reservation)
644
# 予約
Restaurant reservation?
(reservation)
ラスベガス
(Las Vegas)
140
ラスベガス格安航空券 #格安航空券 Flight reservation?
(discount flight ticket) (discount flight ticket)
ラスベガス
ラスベガスホテル
…
Semantic drift
Computational efficency

7
114
#ホテル(hotel)
Use all strings but instances
=Require no segmentation

Count
Broad coverage,
Noisy patterns
Generic patterns
2015/10/1
Instance/Pattern Scoring Metrics
Sekine & Suzuki (2007)



Starts from a large named entity dictionary
Assign low scores to generic patterns and ignore
Basilisk (Thelen and Riloff, 2002)


Balance the recall and precision of generic patterns
Espresso (Pantel and Pennacchiotti, 2006)

 pm i(i, p)


 r (i) 

iI  max pm i

r ( p) 
I
 pm i(i, p)


 r ( p) 

pP  max pm i

r (i) 
P
PMI is normalized by
the maximum of all P and I
8
Reliability of an instance and
a pattern is mutually defined
P: patterns in corpus
I: instances in corpus
PMI: pointwise mutual information
r: reliability score
2015/10/1
The Tchai Algorithm

Filter generic patterns/instances


Not to select generic patterns and instances
Replace scaling factor in reliability scores


Take the maximum PMI for a given instance/pattern rather than
the maximum for all instances and patterns
This modification shows a large impact on the effectiveness of
our algorithm
 pm i(i, p)



 max pm i r (i ) 
iI
iI

r ( p)  
I

 pm i(i, p )


 r ( p) 



pP  max pP pm i

r (i ) 
P
Only induce patterns at the beginning

10
Tchai runs 400X faster than Espresso
2015/10/1
Experiments

Japanese query logs from 2007/01-02


Unique one million (166 millions in token)
Target categories



Manually classified 10,000 most frequent search words (in the
log of 2006/12) -- hereafter referred to as 10K list
Travel: the largest category (712 words)
Finance: the smallest category (240 words)
Category
Travel
Finance
12
Seeds (with English translation)
JAL(Japan Airlines), ANA(All Nippon Airways), JR(Japan
Railways), じゃらん(Travel information site), HIS(Travel agency)
み ず ほ 銀 行 (Mizuho Bank), 三 井 住 友 銀 行 (Sumitomo Mitsui
Banking Corporation), JCB, 新生銀行 (Shinsei Bank), 野村證券
(Nomura Securities)
2015/10/1
Results
Travel
Travel
Not Travel
High precision
(92.1%)
Learned 251 novel words
10K list
Not in 10K list
Not Travel
Travel
280
17
251
0
7
125
Finance
Due to the ambiguity of hand
Include common nouns
labeling
related
Travel
10K list
Not into10K
list
(e.g. Tokyo Disney Land)
(e.g. Rental car)
Not Finance
Finance
Finance
41
30
30
Not Finance
0
5
99
13
2015/10/1
Sample of Instances (Travel category)
Type
Place
Examples (with translation)
トルコ (Turkey), ラスベガス (Las
Vegas), バリ島 (Bali Island)
Travel agency Jtb, トクー (www.tocoo.jp), yahoo
(Yahoo ! Travel), net cruiser
Able to
learn severalusj
subAttraction
ディズニーランド
(Disneyland),
categories
(Universal Studio
Japan) in which no
seed words given
Hotel
帝国ホテル (Imperial Hotel), リッツ
(Ritz Hotel)
Transportation 京浜急行(Keihin Express), 奈良交通
(Nara Kotsu Bus Lines)
14
2015/10/1
System Performance
Travel
Finance
System
Instances Precision
Rel.
Recall
System
Instances Precision
Rel.
Recall
Basilisk
651
63.4%
1.26
Basilisk
278
27.3%
0.70
Espresso 500
65.6%
1.00
Espresso 704
15.2%
1.00
Tchai
80.6%
1.67
Tchai
35.0%
0.73
680
223
High precision but low relative
High precision and recall
Relative Recall (Pantel et al., 2004) recall due to strict filtering
RA|B
17
RA C A C C A PA  A




RB CB / C CB PB  B
2015/10/1
Cumulative precision: Travel
Tchai achieved the
best precision
18
2015/10/1
Basilisk and Espresso extracted location
namesPatterns
as context patterns, which may be
Sample Extracted
too generic for Travel domain
System
Sample Patterns (with English translation)
Basilisk #東日本(east_japan), #西日本(west_japan), p#sonic, #
時 刻 表 (timetable), # 九 州 (Kyushu),
#+ マ イ レ ー ジ
(mileage), #バス(bus), google+#lytics, #+料金(fare),
#+国内(domestic), #ホテル(hotel)
Espresso #バス(bus), 日本#(Japan), #ホテル(hotel), #道路(road),
#イン(inn), フジ#(Fuji), #東京(Tokyo), #料金(fare), #九
州(Kyushu), #時刻表(timetable), #+旅行(travel), #+名
古屋(Nagoya)
Tchai
#+ホテル(hotel), #+ツアー(tour), #+旅行(travel), #予約
(reserve), #+ 航 空 券 (flight_ticket), #+ 格 安 航 空 券
(discount_flight_titcket), #マイレージ(mileage), 羽田空
港+#(Haneda Airport)
Tchai found context patterns that
are characteristic
to the domain
20
2015/10/1
Conclusion and future work

Conclusion



Use of query logs for semantic category learning
Improved Espresso algorithm in both precision and performance
Future work

21
Generalize bootstrapping method by graph-based matrix
calculation
2015/10/1
Tchai
Thank you for listening!
22
2015/10/1