x - Extreme Big Data

Establishment of Knowledge‐Intensive Structural Natural Language Processing and Construction of Knowledge Infrastructure
and Construction of Knowledge Infrastructure
Sadao Kurohashi
Kyoto University
The Japanese Extreme Big Data Projects Workshop, Fukuoka JAPAN, Feb 26, 2014
Texts are the Basis of Human
Knowledge Representation
• Data
Data analysis results and analysis results and
interpretation by experts
• criticisms and opinions
• procedures and instructions
Information Access, Organization, Analysis
Improvement
of NLP
Knowledge
Acquisition
TEXT
TEXTs
Japanese
林檎
りんご
リンゴ
ゴ
Synonymy Problem
• Head Final
Japanese
私 が 本 を 買った
I‐NOM book‐ACC bought
• Scrambling
本 を 私 が 買った
book‐ACC I‐NOM bought
本は 私も
book‐TOP I‐also
買った
bought
Japanese
• Lots of Omission (Zero Pronoun)
私は 本を 買った
I book bought
私は 本を 買って、
私は それを 読んだ
I book bought (and) I it read
TSUBAKI (Info
TSUBAKI
(Info‐plosion
plosion 2005
2005‐2010)
2010)
WISDOM (NICT 2006‐2010)
Information Access, Organization, Analysis
Improvement
of NLP
Case Analysis
Knowledge
Synonym Recognition
Synonym Recognition
Acquisition
TEXT
TEXTs
Caseframe
Distributional Similarity
Distributional Similarity
Improving NLP by exploiting Big Data
1
Accuracy of Syntactic Analysis
0.9
0.8
Accuracy of Case Structure Analysis
0.7
Accuracy,, Coveragee
0.6
05
0.5
Coverage of f
Case Frames
0.4
0.3
0.2
Accuracy of Zero Anaphora
Zero Anaphora
Resolution
Accuracy of Synonym Recognition
Synonym Recognition
0.1
0
1 5M 6M 25M 100M 400M 1 6G 6 4G 15G
1.5M 6M 25M 100M 400M 1.6G 6.4G 15G
Corpus Size (# of sentences)
Predicate Argument Structure
Predicate‐Argument Structure
?
クロールで 泳いでいる 女の子を 見た
crawl
swim
girl
saw
?
望遠鏡で 泳いでいる 女の子を 見た
telescope
swim
girl
saw
Case frame
泳ぐ swim
i
{人 person, 子 child,…}が
{クロール
{クロ
ル crawl, 平泳ぎ,…}で
crawl 平泳ぎ }で
{海 sea, 大海,…}を
見る see
{人 person, 者,…}が
{望遠鏡 telescope, 双眼鏡
telescope 双眼鏡 ,,…}で
}で
{姿 figure, 人 person,…}を
[Kawahara and Kurohashi, HLT2001, COLING2002, LREC2006]
WEB
15G sentences
(3G pages)
Predicate‐argument
structures
Cl t i
Clustering
2‐day computation
using
using 1,000 CPU cores!
1,000
CPU
cores!
Case
Case frames for
frames for
P i & Filt i
Parsing & Filtering
120K predicates
Distributional Similarity
Distributional Similarity
に 相談 (consult)
医者
に 相談 (consult)
の 診察 (observation)
≒
(doctor) が 手術 (operation)
(
ti )
(noun)
医師
(doctor) が 手術 (operation)
‥
‥
低迷し
(hover around)
の 診察 (observation)
低迷し
株価が下がる
景気が冷え込む (stock prices 増税し
(the market go down)
(increase taxes)
cools down)
感動的で 心をとらえる 離さない
(i
(impressive)
i ) (capture one’s (
’ (k
(keep))
‥
‥
heart)
≒
(predicate
phrase)
≒
(idiom)
(hover around)
株価が下がる
景気が悪化する (stock prices 増税し
(the market go down)
(increase taxes)
becomes bad)
感動的で
魅了 離さない
(i
(impressive)
i ) (charm) (keep)
‥
‥
Knowledge Acquisition Flow
Web Corpus
(100M pages)
Wikipedia
JUMAN/KNP
Lexical DB
Lexical DB
entry : hypernym
(1 week)
k)
JUMAN
Parsed
Web Corpus
Suffix
POS Classifier
(2 weeks)
unknown word candidate enumeration
(1 day)
variant DB
POS classification
Web auto dic
(half day)
unknown word unknown
word
detection
distributional
similarity
Lexical DB
JUMAN/KNP
・ detect unknown word
detect unknown word
・ assign POS/hypernym
・ assign repform
Wikipedia auto dic
Wikipedia auto dic
Web Corpus
(1 billion
(1
billion pages)
(1 week)
variant
recognition
iti
semantic
classification
Web auto dic
(variant merged,
w/ semantic)
Parsed
Web Corpus
Web Corpus
distributional
similarity
Lexical DB
JUMAN/KNP
Parsed
Web Corpus
distributional
similarity
Case frames
TSUBAKI
『農業の再生を推進する人材の養成』
(Develop Human Resources Promoting Agriculture Regeneration)
(Develop Human Resources Promoting Agriculture Regeneration)
Parse of Query
【再生する】 (Regenerate)
ga
wo
【推進する】 (Promote)
network,introduction,
policy, human resource, …
ga
government,citizen, …
wo
activity, regeneration, …
education, agriculture,…
de
area,plan,development, …
Develop human resources promoting agriculture regeneration
i
Agriculture regeneration … develop human resources who promote cooperation …
i
農業の再生を推進する
人材を育成する。
育成=養成
Develop “IT leaders of food and agriculture” who have agricultural technologies and …
Develop human resources promoting forestry regeneration
Human resource development of forest
development of forest
maintenance
B
Bag of Words
fW d
Regeneration plan … agricultural IT training, filling up human resources of IT trainers
f
i
Produce new generation engineering agriculture creators
g
g g
Sightseeing regeneration
… human resources
Sightseeing regeneration
… human resources
Use of IT to agriculture … human resources
先進的農業地帯であるが … 生産額が停滞し、再生が切望され … Develop human resources of creation … agriculture regeneration
WIDSOM
[Akamine+ ACL IJCNLP2009 Demo]
ACL‐IJCNLP2009 Demo]
Important remarks of p
the analysis result
“Electric toothbrush is good for teeth?”
Definition of “electric
Definition of electric toothbrush
toothbrush”
Distribution of posi/nega opinions over info. sender classes
Positive opinions
Major/Contradictory Major/Contradictory
statements
[Kawahara+ COLING 2010]
Major keywords
[Shibata+ [Shibata+
Web Intelligence 2009]
Negative opinions
[Nakagawa+ HLT‐NAACL 2010]
Info. sender class distribution
[Kato+ Web Intelligence 2009]
Major info. senders p
and their opinions
TSUBAKI (Info
TSUBAKI
(Info‐plosion
plosion 2005
2005‐2010)
2010)
WISDOM (NICT 2006‐2010)
Information Access, Organization, Analysis
Improvement
of NLP
Case Analysis
Knowledge
Synonym Recognition
Synonym Recognition
Acquisition
TEXT
TEXTs
Caseframe
Distributional Similarity
Distributional Similarity
CREST “Advanced Core Technologies for Big Data Integration”
Establishment of
Knowledge-Intensive Structural NLP
and Construction of Knowledge Infrastructure
(2013/10 – 2019/3)
(2013/10 –
Kurohashi Kawahara Shibata
(K t Univ.)
(Kyoto
U i )
Bekki
(Ochanomizu
Univ.)
Miyao
(NII)
Inui
Okazaki Watanabe
(T h k Univ.)
(Tohoku
U i )
16
Knowledge Infrastructure TSUBAKI (Info
TSUBAKI
(Info‐plosion
plosion 2005
2005‐2010)
2010)
WISDOM (NICT 2006‐2010)
Information Access, Organization, Analysis
Improvement
of NLP
Case Analysis
Knowledge
Synonym Recognition
Synonym Recognition
Acquisition Anaphora Resolution
TEXT
TEXTs
Caseframe
Distributional Similarity
Distributional Similarity
Event Schema
Textual Inference
Textual Inference
Improving NLP by exploiting Big Data
1
Accuracy of Syntactic Analysis
0.9
0.8
Accuracy of Case Structure Analysis
0.7
Accuracy,, Coveragee
0.6
05
0.5
Coverage of f
Case Frames
0.4
0.3
0.2
Accuracy of Zero Anaphora
Zero Anaphora
Resolution
Accuracy of Synonym Recognition
Synonym Recognition
0.1
0
1 5M 6M 25M 100M 400M 1 6G 6 4G 15G
1.5M 6M 25M 100M 400M 1.6G 6.4G 15G
Corpus Size (# of sentences)
Event Schema Acquisition
Event Schema Acquisition
Events
A search B
A arrest B
B
C convict B
Role
A: Police
B Suspect
B:
S
C: Juryy
[Chambers+ 09]
• Automatic
Automatic acquisition methods have been proposed acquisition methods have been proposed
[Chmabers+ 08, 09]
– extract two events that share an argument
• In Japanese, arguments are often omitted, and it’s hard to extract shared arguments
• We proposed a two‐stage
W
d t
t
event pairs extraction method t i
t ti
th d
[Shibata+ 11]
Two‐stage Event Pairs Acquisition [Shibata+11]
he‐ga(nom) purse‐wo(acc) pick up, and police‐ni(dat) bring
Web
PA1
PA2
text
Event
d
(
) purse‐wo(acc) pick up,
( 2) (PA
k 2) bring
b
Event1 (PA1)since driver‐ga(nom)
P2: bring PA1 PA2
PPA pairs
1: pick up
police‐ni(dat) bring after purse‐wo(acc)
pick up nom A1: {{man,
, person,
p
, …}}
A
1
:
{man,
{man
person,
person
…}
}
nom
⇒
extraction
PA1…}
PA2
acc
A
2: {purse,
…
acc A2: {purse, …}
dat A3: {police, …}
PA co occurring statistics PA co‐occurring statistics (A1 pick up A
pick up A2, and then A
and then A1 bring A2 to A
to A3)
calculation using Apriori
calculation using purse‐wo(acc) pick up ⇒ police‐ni(dat) bring
case frames
pick
pick up: 1
up:111
pick
pick up: 1
up:
pick up: 1
i
k
pick up: 1
pick up: 3
pick up: 1
ga(nom) man, person, …
man, person, …
man, person, …
ga(nom)
ga(nom)
man, person, …
ga(nom) man, girl, …
ga(nom)
man, person, …
wo(acc) dust, cigarette, …
dust, cigarette, …
dust,
dust, cigarette, …
cigarette, …
wo(acc)
dust
cigarette
wo(acc)
d
dust, cigarette, …
i
wo(acc)
(
)
wo(acc) purse, phone, …
dust, cigarette, …
argument alignment based on case frames
bi 1
bring: 1
bring: 1
bring: 2
bring: 1
shop, company, …
ga(nom) shop, company, …
ga(nom)
man, person, …
ga(nom)
shop, company, …
goods
goods, item, …
wo(acc)
goods, item, …
d item
wo(acc)
( ) purse, money, …
wo(acc)
goods, item, …
you, customer, …
ni(dat) you, customer, …
ni(dat)
police, …
ni(dat)
you, customer, …
Web
text
distributional
similarity
case frames
(120K pred.)
automation ga
t
ti
progress
technology ga
progress
needs ga
increase
price ga
p
g
decrease events
(0.3M event pairs)
utilized
improved
p
cost ga
decrease
downsized
b
become
widespread
X is downsized
X
is downsized → →
X become widespread
p
utilized
recognized
i d
well‐
well
known
Winograd Schema Challenge
Schema Challenge
[Jevesque11]
• A dataset for the task of resolving definite d
f h
k f
l i d fi i
pronouns (2000 problems)
• Require the use of world knowledge and reasoning
The red team defeated the blue team
?
because they made the last penalty kick.
X makes a penalty kick → X defeats Y
X makes a penalty kick
X defeats Y
Very Fast and Trainable Abductive Reasoning on
First‐Order
First
Order Logic [Inoue&Inui
Logic [Inoue&Inui 12, 13]
12 13]
Input
∃x, y q(y) ∧ r(A) ∧ s(x)
ILP constraints:
C1: hq(y) = 1
P: Potential Elemental Hypotheses
C2: rs(x) ≤ hr(x); hs(y)=ht(u)
r(x)
s(y) ∧ t(u)
C3: 2ur(x), r(A) ≤ hr(x) + hr(A)
A=x
C4: ur(x), r(A) ≤ sx,A
y=x
C5: sy,A - sx,A - sx,y ≥ -1
∃x, y q(y) ∧ r(A) ∧ s(x)
Step 1. Backward chaining
Background
Knowledge
r(x) → s(x)
s(x) ∧ t(y) → q(x)
:
Step 2. Solve ILP optimization problem
Output
Output: Most-likely hypothesis
H*: ∃x,, y q(y) ∧ r(A)
( ) ∧ s(x)
( ) ∧ r(x)
( ) ∧ x=A
ILP representation of search space:
Candidate hypothesis:
yp
ILP variables:
hq(y)
hr(A)
hs(x)
hs(y)
ht(u)
hr(x)
sx,A
sx,y
sy,A
H1: q(y) ∧ r(A) ∧ s(x)
1
1
1
0
0
0
0
0
0
0
0
H2: q(y) ∧ r(A) ∧ s(x) ∧ r(x)
1
1
1
1
0
1
0
0
0
0
0
H3: q(y) ∧ r(A)
( ) ∧ s(x)
( ) ∧ r(x)
( ) ∧ x=A
1
1
1
1
0
1
1
0
0
0
1
H4: q(y) ∧ r(A) ∧ s(x) ∧ r(x) ∧ s(y) ∧
t(u) ∧ x=A ∧ x=y
1
1
1
1
1
1
1
1
1
1
1
:
:
ur(A),r(x)
us(x),s(y)
Very Fast and Trainable Abductive Reasoning on
First‐Order
First
Order Logic [Inoue&Inui
Logic [Inoue&Inui 12, 13]
12 13]
• Benchmark performance
– Background knowledge: 370,000 rules
– Input: 1,600 texts (20 literals on avg.)
• # of potential elemental hypotheses: 1,000 on avg.
• # of candidate hypotheses: # of combinations of potential elemental hypotheses (i.e., approx. O(2
(
( 1000))
≥ 30 min.
average runtime
g
7 min.
×150
Mulkar‐Mehta et al. [07]:
Mulkar‐Mehta
et al [07]:
Blythe et al [11]:
Blythe et al. [11]:
Heuristics‐based
Markov Logic Network‐based
Abduction
Abduction
2.6 sec.
Inoue & Inui [12]:
Inoue
& Inui [12]:
ILP‐based
Abduction
Anaphora Resolution with Abductive Reasoning
B
PK(z) ∧ 決められる(x, y, z) →
z)
「xがyにPKを決められる」
= 決める(y,
「yがPKを決める」
(be a
scored)
(score)
(x is scored
PK by y)
(y score a PK)
PK(z) ∧ 決められる(x, y, z) ∧ α→
→ 「xがyに負ける」
負ける(x, y)
「xがyにPKを決められる+α」
(be a
scored)
(beisdefeated)
(x is scored
PK by y)
(x
defeated by y)
負ける(x y) → 負かす(y,
負ける(x,
負かす(y
x)
「xがyに負ける」
= 「yがxを負かす」
(beisdefeated)
(x
defeated (defeat)
by y) (y defeat x)
PK(z) ∧ 決める(x,
→ 成功させる(y, z)
「PKを決める」
= z)
「PKを成功させる」
(PK) ) (make
((make)
)
(score a (score)
a kPK)
PK(z) ∧ 決められる(x, y, z)
(be scored)
PK(z) →
「PK」
= ペナルティーキック(z)
「ペナルティーキック」
(penalty
(penalty
(pe
a tykick)
kick)
c )
H
負ける(x, y)
(be defeated)
x=x1, y=x0
O
決める (y, z)
(score)
make a PK=
score a PK
y=x2
penalty kick=PK
z=x3
赤チーム(x0) ∧ 青チーム(x1) ∧ 負かす(x0, x1) ∧ 成功させる(x2, x3) ∧ ペナルティーキック(x3)
( d tteam))
(red
(bl team)
(blue
t
)
(defeat)
(d f t)
(make)
( k )
(penalty
(
lt kick)
ki k)
赤チームは青チームを負かした。彼らが最後のペナルティーキックを成功させたからだ。
(The red team defeated the blue team because they made the last penalty kick.)
Knowledge Infrastructure:”Lithium‐ion battery”
Chemical
Properties
Light, small, big capacity
Low memory effect
Manufacturing process is complex and costly
Manufacturing process is complex and costly
Overheated by overvoltage/overcharge
Long‐term use under high temperature
Began to spread in
home electric appliances (1990‐)
Current (2010) performance is insufficient
for vehicles
Fire, liquid leakage
Risks of explosion
Battery pressure increases
Import all of lithium raw materials
Producing regions are maldistributed
and production is low (ore in Chile)
Have biggest market among rechargeable
batteries (2012)
Toyota launched Priusα with lithium‐ion battery (2011.5)
Supply a "battery pack" with a mechanism of preventing overcharge
Past Trends
It is very unlikely to cause ignition or explosion
Trading
Concern the gap of commercial
supply and demand
Recent Trends
Application to a Real‐World Problem
APU battery from a B787 aircraft (JAL) caught fire at Boston Logan Airport (2013.1)
fire at Boston Logan Airport Suspended B787 flights
• Hitachi Contact Center: 2000 calls/day
A B787 aircraft (ANA) made an emergency landing at Takamatsu Airport due to a battery problem (2013.1)
An issue is to develop a technology of recycling used battery
JX, Waseda Univ. and Nagoya Univ. will put recycling technology to practice use
An issue is to develop a technology of recovering Li from seawater
Battery melting in a Mitsubishi "Outlander PHEV" was discovered (2013.3)
Discontinued production
p
and sales
A lithium‐ion battery of a Mitsubishi "i‐MiEV" caught fire at the Mizushima plant (2013.3)
Suspended production
Perspective
p
Merit: 230 billion tons exist in seawater against 14 million tons onshore
GS Yuasa: it takes 10 years to supply mass‐ Airbus abandoned its plans to use lithium‐
produced products (2013.3+10 yrs.)
ion batteries for A350 airplanes (2013.4)
http://www.hitachi‐systems.com/cloud/ccs/
Demerit: concentration is low (0.1‐0.2 mg per 1 litter of seawater)
Many engineers expect that it needs at y
p
least 10 years for the spread to transportation industries (2013.3+10 yrs.)
Technology
“Practical recovery of lithium from seawater,” Journal of Ion Exchange 2011 Academic Award
Energy Power Systems has been improving lead‐
acid batteries from 2 years ago (2013.4‐2
acid batteries from 2 years ago (2013.4
2 yrs.)
yrs.)
First elucidation of detailed charge‐discharge mechanism of lithium‐ion batteries, Journal of the American Chemical Society Online (2013.4)
: positive
28
: negative
Knowledge Infrastructure TSUBAKI (Info
TSUBAKI
(Info‐plosion
plosion 2005
2005‐2010)
2010)
WISDOM (NICT 2006‐2010)
Information Access, Organization, Analysis
Improvement
of NLP
Case Analysis
Knowledge
Synonym Recognition
Synonym Recognition
Acquisition Anaphora Resolution
TEXT
TEXTs
Caseframe
Distributional Similarity
Distributional Similarity
Event Schema
Textual Inference
Textual Inference
Thank you!
Thank you!
30