Establishment of Knowledge‐Intensive Structural Natural Language Processing and Construction of Knowledge Infrastructure and Construction of Knowledge Infrastructure Sadao Kurohashi Kyoto University The Japanese Extreme Big Data Projects Workshop, Fukuoka JAPAN, Feb 26, 2014 Texts are the Basis of Human Knowledge Representation • Data Data analysis results and analysis results and interpretation by experts • criticisms and opinions • procedures and instructions Information Access, Organization, Analysis Improvement of NLP Knowledge Acquisition TEXT TEXTs Japanese 林檎 りんご リンゴ ゴ Synonymy Problem • Head Final Japanese 私 が 本 を 買った I‐NOM book‐ACC bought • Scrambling 本 を 私 が 買った book‐ACC I‐NOM bought 本は 私も book‐TOP I‐also 買った bought Japanese • Lots of Omission (Zero Pronoun) 私は 本を 買った I book bought 私は 本を 買って、 私は それを 読んだ I book bought (and) I it read TSUBAKI (Info TSUBAKI (Info‐plosion plosion 2005 2005‐2010) 2010) WISDOM (NICT 2006‐2010) Information Access, Organization, Analysis Improvement of NLP Case Analysis Knowledge Synonym Recognition Synonym Recognition Acquisition TEXT TEXTs Caseframe Distributional Similarity Distributional Similarity Improving NLP by exploiting Big Data 1 Accuracy of Syntactic Analysis 0.9 0.8 Accuracy of Case Structure Analysis 0.7 Accuracy,, Coveragee 0.6 05 0.5 Coverage of f Case Frames 0.4 0.3 0.2 Accuracy of Zero Anaphora Zero Anaphora Resolution Accuracy of Synonym Recognition Synonym Recognition 0.1 0 1 5M 6M 25M 100M 400M 1 6G 6 4G 15G 1.5M 6M 25M 100M 400M 1.6G 6.4G 15G Corpus Size (# of sentences) Predicate Argument Structure Predicate‐Argument Structure ? クロールで 泳いでいる 女の子を 見た crawl swim girl saw ? 望遠鏡で 泳いでいる 女の子を 見た telescope swim girl saw Case frame 泳ぐ swim i {人 person, 子 child,…}が {クロール {クロ ル crawl, 平泳ぎ,…}で crawl 平泳ぎ }で {海 sea, 大海,…}を 見る see {人 person, 者,…}が {望遠鏡 telescope, 双眼鏡 telescope 双眼鏡 ,,…}で }で {姿 figure, 人 person,…}を [Kawahara and Kurohashi, HLT2001, COLING2002, LREC2006] WEB 15G sentences (3G pages) Predicate‐argument structures Cl t i Clustering 2‐day computation using using 1,000 CPU cores! 1,000 CPU cores! Case Case frames for frames for P i & Filt i Parsing & Filtering 120K predicates Distributional Similarity Distributional Similarity に 相談 (consult) 医者 に 相談 (consult) の 診察 (observation) ≒ (doctor) が 手術 (operation) ( ti ) (noun) 医師 (doctor) が 手術 (operation) ‥ ‥ 低迷し (hover around) の 診察 (observation) 低迷し 株価が下がる 景気が冷え込む (stock prices 増税し (the market go down) (increase taxes) cools down) 感動的で 心をとらえる 離さない (i (impressive) i ) (capture one’s ( ’ (k (keep)) ‥ ‥ heart) ≒ (predicate phrase) ≒ (idiom) (hover around) 株価が下がる 景気が悪化する (stock prices 増税し (the market go down) (increase taxes) becomes bad) 感動的で 魅了 離さない (i (impressive) i ) (charm) (keep) ‥ ‥ Knowledge Acquisition Flow Web Corpus (100M pages) Wikipedia JUMAN/KNP Lexical DB Lexical DB entry : hypernym (1 week) k) JUMAN Parsed Web Corpus Suffix POS Classifier (2 weeks) unknown word candidate enumeration (1 day) variant DB POS classification Web auto dic (half day) unknown word unknown word detection distributional similarity Lexical DB JUMAN/KNP ・ detect unknown word detect unknown word ・ assign POS/hypernym ・ assign repform Wikipedia auto dic Wikipedia auto dic Web Corpus (1 billion (1 billion pages) (1 week) variant recognition iti semantic classification Web auto dic (variant merged, w/ semantic) Parsed Web Corpus Web Corpus distributional similarity Lexical DB JUMAN/KNP Parsed Web Corpus distributional similarity Case frames TSUBAKI 『農業の再生を推進する人材の養成』 (Develop Human Resources Promoting Agriculture Regeneration) (Develop Human Resources Promoting Agriculture Regeneration) Parse of Query 【再生する】 (Regenerate) ga wo 【推進する】 (Promote) network,introduction, policy, human resource, … ga government,citizen, … wo activity, regeneration, … education, agriculture,… de area,plan,development, … Develop human resources promoting agriculture regeneration i Agriculture regeneration … develop human resources who promote cooperation … i 農業の再生を推進する 人材を育成する。 育成=養成 Develop “IT leaders of food and agriculture” who have agricultural technologies and … Develop human resources promoting forestry regeneration Human resource development of forest development of forest maintenance B Bag of Words fW d Regeneration plan … agricultural IT training, filling up human resources of IT trainers f i Produce new generation engineering agriculture creators g g g Sightseeing regeneration … human resources Sightseeing regeneration … human resources Use of IT to agriculture … human resources 先進的農業地帯であるが … 生産額が停滞し、再生が切望され … Develop human resources of creation … agriculture regeneration WIDSOM [Akamine+ ACL IJCNLP2009 Demo] ACL‐IJCNLP2009 Demo] Important remarks of p the analysis result “Electric toothbrush is good for teeth?” Definition of “electric Definition of electric toothbrush toothbrush” Distribution of posi/nega opinions over info. sender classes Positive opinions Major/Contradictory Major/Contradictory statements [Kawahara+ COLING 2010] Major keywords [Shibata+ [Shibata+ Web Intelligence 2009] Negative opinions [Nakagawa+ HLT‐NAACL 2010] Info. sender class distribution [Kato+ Web Intelligence 2009] Major info. senders p and their opinions TSUBAKI (Info TSUBAKI (Info‐plosion plosion 2005 2005‐2010) 2010) WISDOM (NICT 2006‐2010) Information Access, Organization, Analysis Improvement of NLP Case Analysis Knowledge Synonym Recognition Synonym Recognition Acquisition TEXT TEXTs Caseframe Distributional Similarity Distributional Similarity CREST “Advanced Core Technologies for Big Data Integration” Establishment of Knowledge-Intensive Structural NLP and Construction of Knowledge Infrastructure (2013/10 – 2019/3) (2013/10 – Kurohashi Kawahara Shibata (K t Univ.) (Kyoto U i ) Bekki (Ochanomizu Univ.) Miyao (NII) Inui Okazaki Watanabe (T h k Univ.) (Tohoku U i ) 16 Knowledge Infrastructure TSUBAKI (Info TSUBAKI (Info‐plosion plosion 2005 2005‐2010) 2010) WISDOM (NICT 2006‐2010) Information Access, Organization, Analysis Improvement of NLP Case Analysis Knowledge Synonym Recognition Synonym Recognition Acquisition Anaphora Resolution TEXT TEXTs Caseframe Distributional Similarity Distributional Similarity Event Schema Textual Inference Textual Inference Improving NLP by exploiting Big Data 1 Accuracy of Syntactic Analysis 0.9 0.8 Accuracy of Case Structure Analysis 0.7 Accuracy,, Coveragee 0.6 05 0.5 Coverage of f Case Frames 0.4 0.3 0.2 Accuracy of Zero Anaphora Zero Anaphora Resolution Accuracy of Synonym Recognition Synonym Recognition 0.1 0 1 5M 6M 25M 100M 400M 1 6G 6 4G 15G 1.5M 6M 25M 100M 400M 1.6G 6.4G 15G Corpus Size (# of sentences) Event Schema Acquisition Event Schema Acquisition Events A search B A arrest B B C convict B Role A: Police B Suspect B: S C: Juryy [Chambers+ 09] • Automatic Automatic acquisition methods have been proposed acquisition methods have been proposed [Chmabers+ 08, 09] – extract two events that share an argument • In Japanese, arguments are often omitted, and it’s hard to extract shared arguments • We proposed a two‐stage W d t t event pairs extraction method t i t ti th d [Shibata+ 11] Two‐stage Event Pairs Acquisition [Shibata+11] he‐ga(nom) purse‐wo(acc) pick up, and police‐ni(dat) bring Web PA1 PA2 text Event d ( ) purse‐wo(acc) pick up, ( 2) (PA k 2) bring b Event1 (PA1)since driver‐ga(nom) P2: bring PA1 PA2 PPA pairs 1: pick up police‐ni(dat) bring after purse‐wo(acc) pick up nom A1: {{man, , person, p , …}} A 1 : {man, {man person, person …} } nom ⇒ extraction PA1…} PA2 acc A 2: {purse, … acc A2: {purse, …} dat A3: {police, …} PA co occurring statistics PA co‐occurring statistics (A1 pick up A pick up A2, and then A and then A1 bring A2 to A to A3) calculation using Apriori calculation using purse‐wo(acc) pick up ⇒ police‐ni(dat) bring case frames pick pick up: 1 up:111 pick pick up: 1 up: pick up: 1 i k pick up: 1 pick up: 3 pick up: 1 ga(nom) man, person, … man, person, … man, person, … ga(nom) ga(nom) man, person, … ga(nom) man, girl, … ga(nom) man, person, … wo(acc) dust, cigarette, … dust, cigarette, … dust, dust, cigarette, … cigarette, … wo(acc) dust cigarette wo(acc) d dust, cigarette, … i wo(acc) ( ) wo(acc) purse, phone, … dust, cigarette, … argument alignment based on case frames bi 1 bring: 1 bring: 1 bring: 2 bring: 1 shop, company, … ga(nom) shop, company, … ga(nom) man, person, … ga(nom) shop, company, … goods goods, item, … wo(acc) goods, item, … d item wo(acc) ( ) purse, money, … wo(acc) goods, item, … you, customer, … ni(dat) you, customer, … ni(dat) police, … ni(dat) you, customer, … Web text distributional similarity case frames (120K pred.) automation ga t ti progress technology ga progress needs ga increase price ga p g decrease events (0.3M event pairs) utilized improved p cost ga decrease downsized b become widespread X is downsized X is downsized → → X become widespread p utilized recognized i d well‐ well known Winograd Schema Challenge Schema Challenge [Jevesque11] • A dataset for the task of resolving definite d f h k f l i d fi i pronouns (2000 problems) • Require the use of world knowledge and reasoning The red team defeated the blue team ? because they made the last penalty kick. X makes a penalty kick → X defeats Y X makes a penalty kick X defeats Y Very Fast and Trainable Abductive Reasoning on First‐Order First Order Logic [Inoue&Inui Logic [Inoue&Inui 12, 13] 12 13] Input ∃x, y q(y) ∧ r(A) ∧ s(x) ILP constraints: C1: hq(y) = 1 P: Potential Elemental Hypotheses C2: rs(x) ≤ hr(x); hs(y)=ht(u) r(x) s(y) ∧ t(u) C3: 2ur(x), r(A) ≤ hr(x) + hr(A) A=x C4: ur(x), r(A) ≤ sx,A y=x C5: sy,A - sx,A - sx,y ≥ -1 ∃x, y q(y) ∧ r(A) ∧ s(x) Step 1. Backward chaining Background Knowledge r(x) → s(x) s(x) ∧ t(y) → q(x) : Step 2. Solve ILP optimization problem Output Output: Most-likely hypothesis H*: ∃x,, y q(y) ∧ r(A) ( ) ∧ s(x) ( ) ∧ r(x) ( ) ∧ x=A ILP representation of search space: Candidate hypothesis: yp ILP variables: hq(y) hr(A) hs(x) hs(y) ht(u) hr(x) sx,A sx,y sy,A H1: q(y) ∧ r(A) ∧ s(x) 1 1 1 0 0 0 0 0 0 0 0 H2: q(y) ∧ r(A) ∧ s(x) ∧ r(x) 1 1 1 1 0 1 0 0 0 0 0 H3: q(y) ∧ r(A) ( ) ∧ s(x) ( ) ∧ r(x) ( ) ∧ x=A 1 1 1 1 0 1 1 0 0 0 1 H4: q(y) ∧ r(A) ∧ s(x) ∧ r(x) ∧ s(y) ∧ t(u) ∧ x=A ∧ x=y 1 1 1 1 1 1 1 1 1 1 1 : : ur(A),r(x) us(x),s(y) Very Fast and Trainable Abductive Reasoning on First‐Order First Order Logic [Inoue&Inui Logic [Inoue&Inui 12, 13] 12 13] • Benchmark performance – Background knowledge: 370,000 rules – Input: 1,600 texts (20 literals on avg.) • # of potential elemental hypotheses: 1,000 on avg. • # of candidate hypotheses: # of combinations of potential elemental hypotheses (i.e., approx. O(2 ( ( 1000)) ≥ 30 min. average runtime g 7 min. ×150 Mulkar‐Mehta et al. [07]: Mulkar‐Mehta et al [07]: Blythe et al [11]: Blythe et al. [11]: Heuristics‐based Markov Logic Network‐based Abduction Abduction 2.6 sec. Inoue & Inui [12]: Inoue & Inui [12]: ILP‐based Abduction Anaphora Resolution with Abductive Reasoning B PK(z) ∧ 決められる(x, y, z) → z) 「xがyにPKを決められる」 = 決める(y, 「yがPKを決める」 (be a scored) (score) (x is scored PK by y) (y score a PK) PK(z) ∧ 決められる(x, y, z) ∧ α→ → 「xがyに負ける」 負ける(x, y) 「xがyにPKを決められる+α」 (be a scored) (beisdefeated) (x is scored PK by y) (x defeated by y) 負ける(x y) → 負かす(y, 負ける(x, 負かす(y x) 「xがyに負ける」 = 「yがxを負かす」 (beisdefeated) (x defeated (defeat) by y) (y defeat x) PK(z) ∧ 決める(x, → 成功させる(y, z) 「PKを決める」 = z) 「PKを成功させる」 (PK) ) (make ((make) ) (score a (score) a kPK) PK(z) ∧ 決められる(x, y, z) (be scored) PK(z) → 「PK」 = ペナルティーキック(z) 「ペナルティーキック」 (penalty (penalty (pe a tykick) kick) c ) H 負ける(x, y) (be defeated) x=x1, y=x0 O 決める (y, z) (score) make a PK= score a PK y=x2 penalty kick=PK z=x3 赤チーム(x0) ∧ 青チーム(x1) ∧ 負かす(x0, x1) ∧ 成功させる(x2, x3) ∧ ペナルティーキック(x3) ( d tteam)) (red (bl team) (blue t ) (defeat) (d f t) (make) ( k ) (penalty ( lt kick) ki k) 赤チームは青チームを負かした。彼らが最後のペナルティーキックを成功させたからだ。 (The red team defeated the blue team because they made the last penalty kick.) Knowledge Infrastructure:”Lithium‐ion battery” Chemical Properties Light, small, big capacity Low memory effect Manufacturing process is complex and costly Manufacturing process is complex and costly Overheated by overvoltage/overcharge Long‐term use under high temperature Began to spread in home electric appliances (1990‐) Current (2010) performance is insufficient for vehicles Fire, liquid leakage Risks of explosion Battery pressure increases Import all of lithium raw materials Producing regions are maldistributed and production is low (ore in Chile) Have biggest market among rechargeable batteries (2012) Toyota launched Priusα with lithium‐ion battery (2011.5) Supply a "battery pack" with a mechanism of preventing overcharge Past Trends It is very unlikely to cause ignition or explosion Trading Concern the gap of commercial supply and demand Recent Trends Application to a Real‐World Problem APU battery from a B787 aircraft (JAL) caught fire at Boston Logan Airport (2013.1) fire at Boston Logan Airport Suspended B787 flights • Hitachi Contact Center: 2000 calls/day A B787 aircraft (ANA) made an emergency landing at Takamatsu Airport due to a battery problem (2013.1) An issue is to develop a technology of recycling used battery JX, Waseda Univ. and Nagoya Univ. will put recycling technology to practice use An issue is to develop a technology of recovering Li from seawater Battery melting in a Mitsubishi "Outlander PHEV" was discovered (2013.3) Discontinued production p and sales A lithium‐ion battery of a Mitsubishi "i‐MiEV" caught fire at the Mizushima plant (2013.3) Suspended production Perspective p Merit: 230 billion tons exist in seawater against 14 million tons onshore GS Yuasa: it takes 10 years to supply mass‐ Airbus abandoned its plans to use lithium‐ produced products (2013.3+10 yrs.) ion batteries for A350 airplanes (2013.4) http://www.hitachi‐systems.com/cloud/ccs/ Demerit: concentration is low (0.1‐0.2 mg per 1 litter of seawater) Many engineers expect that it needs at y p least 10 years for the spread to transportation industries (2013.3+10 yrs.) Technology “Practical recovery of lithium from seawater,” Journal of Ion Exchange 2011 Academic Award Energy Power Systems has been improving lead‐ acid batteries from 2 years ago (2013.4‐2 acid batteries from 2 years ago (2013.4 2 yrs.) yrs.) First elucidation of detailed charge‐discharge mechanism of lithium‐ion batteries, Journal of the American Chemical Society Online (2013.4) : positive 28 : negative Knowledge Infrastructure TSUBAKI (Info TSUBAKI (Info‐plosion plosion 2005 2005‐2010) 2010) WISDOM (NICT 2006‐2010) Information Access, Organization, Analysis Improvement of NLP Case Analysis Knowledge Synonym Recognition Synonym Recognition Acquisition Anaphora Resolution TEXT TEXTs Caseframe Distributional Similarity Distributional Similarity Event Schema Textual Inference Textual Inference Thank you! Thank you! 30
© Copyright 2024 ExpyDoc