自然言語の処理と理解の研究

自然言語の処理と理解の研究
辻井 潤一
東京大学大学院理学系研究科
情報科学専攻
プロジェクトの目的
1.学術的な目的
構造的な言語処理と確率・統計的な言語処理の融合
理論からのアプローチからの工学への貢献
言語処理と知識処理
2.社会的なインパクト
ネットワーク時代の言語処理
テキストからの知識獲得、情報検索、対話システム
3.国際的な情報の発信
積極的な国際的な共同研究
焦点を絞った、実質的なGoalを持った国際Workshop
緊密な研究協力体制の構築
プロジェクトの目的
1.学術的な目的
構造的な言語処理と確率・統計的な言語処理の融合
理論からのアプローチからの工学への貢献
言語処理と知識処理
2.社会的なインパクト
ネットワーク時代の言語処理
テキストからの知識獲得、情報検索、対話システム
3.国際的な情報の発信
積極的な国際的な共同研究
焦点を絞った、実質的なGoalを持った国際Workshop
緊密な研究協力体制の構築
理論言語学からの妥当な文法枠組み
タイプ付素性構造に基づく文法枠組み
処理効率
耐性
文法記述の偏り: 現実テキストへの適用
系統的な文法の拡充
処理効率
Abstract Machine for Unification (T.Makino, et.al.)
Prolog with Typed Feature Structure (LiLFes)
Coling 98, JNE-00
CFG Approximation (K.Torisawa, et.al)
Multi-staged Parsing (TNT)
Coling 98, JNE-00
Preventing Combinatorial Explosion (Y.Miyao)
Packing of FSs
ACL 99
処理効率
Abstract Machine for Unification (T.Makino, et.al.)
Prolog with Typed Feature Structure (LiLFes)
Coling 98, JNE-00
CFG Approximation (K.Torisawa, et.al)
Multi-staged Parsing (TNT)
Coling 98, JNE-00
Preventing Combinatorial Explosion (Y.Miyao)
Packing of FSs
ACL 99
Abstract Machine
(Carpenter and Qu, 1995)
nelist
REST
FIRST
PUSH
FIRST
ADDNEW list
UNIFYVAR 1
POP
list
nelist
FIRST
bot
FIRST
foo
1
2
3
4
5
6
STR
VAR
PTR
STR
VAR
VAR
REST
nelist
REST
list
nelist
bot
4
nelist
foo
list
Abstract machine
code of a TFS
PUSH
REST
UNIFYVAR 1
POP
nelist
FIRST
list
FIRST
foo
1
2
3
4
5
6
STR
VAR
PTR
STR
VAR
VAR
REST
nelist
REST
list
nelist
list
4
nelist
foo
list
TFS data on memory
nelist
FIRST
FIRST
foo
1
2
3
4
5
6
STR
PTR
PTR
STR
VAR
VAR
REST
nelist
REST
list
nelist
4
4
nelist
foo
list
LiLFeS: Performance (2/2)
FASTER
Comparison with other inference engines for
typed feature structures
20
18
16
14
12
10
8
6
4
2
0
LiLFeS: Native Code Compiler
LiLFeS: Byte Code Emulator
ProFIT on SICStus Emulator
ALE 3.1 on SICStus Emulator
HPSG
Intel Pentium II 400Mhz
Grammar : a small grammar
distributed with ALE
処理効率
Abstract Machine for Unification (T.Makino, et.al.)
Prolog with Typed Feature Structure (LiLFes)
Coling 98, JNE-00
CFG Approximation (K.Torisawa, et.al)
Multi-staged Parsing (TNT)
Coling 98, JNE-00
Preventing Combinatorial Explosion (Y.Miyao)
Packing of FSs
ACL 99
Filtering with CFG (1/5)
• 2-phased parsing
– Approximate HPSG with CFG with keeping
important constraints.
– Obtained CFG might over-generate, but can be used in
filtering.
– Rewriting in CFG is far less expensive than that of
application of rule schemata, principles and so on.
HPSG Compile
Input Parsing
Sentences
CFG
Built-in
CFG Parser
+
Feature
Structures
LiLFeS
Unification
Complete parse trees
Output
Evaluation of HPSG Parsers
DFKI, Stanford, U-Tokyo
Processing time per sentence (sec)
Grammar
Corpus
Naïve
(average length: parser
TNT
parser
LKB Parser
(Stanford: DFKI)
words)
LinGO
csli(5.8)
0.68
0.12
0.23
LinGO
aged(8.4)
1.72
0.31
0.61
LinGO
blend (11)
14.71
1.90
3.10
XHPSG
ATIS (7.42)
14.27
0.30
SLUNG
EDR(20.5)
0.88
0.38
Sun UltraSparc, 336 mhz, 6GB main memory
処理効率
Abstract Machine for Unification (T.Makino, et.al.)
Prolog with Typed Feature Structure (LiLFes)
Coling 98, JNE-00
CFG Approximation (K.Torisawa, et.al)
Multi-staged Parsing (TNT)
Coling 98, JNE-00
Preventing Combinatorial Explosion (Y.Miyao)
Packing of FSs
ACL 99
Packed Feature Structure
• Each dependency function for one of the input
feature structures
A set of feature structures
verb
VMODE indicative
PASSIVE false
TENSE past
verb
VMODE past_part
PASSIVE true
TENSE tense
verb
VMODE past_part
PASSIVE false
TENSE tense
Packed feature structure
indicative
past_part
verb
VMODE
PASSIVE
TENSE
1
2
3
false
true
past
tense
Experimental Results (1)
• Execution time for unification
Test data
# of LEs
Unpacked
(msec.)
credited
walked
37
79
36.5
77.2
Packed
(msec.)
Improvement
5.7
9.2
6.4
8.4
• Packing achieved a considerable speed-up
in unification
大規模な文法の構成
英語文法
スタンフォード大学、DFKIとの共同:
LinGO文法(HPSG)
ペンシルベニア大学との共同: XTAG文法の変換
手作業が介在する変換(XHPSG)
2つの文法枠組みの自動変換
日本語文法
SLUNG: Underspecified な日本語文法
KNP: 係り受け解析、高耐性の日本語文法(京都大学)
プロジェクトの目的
1.学術的な目的
構造的な言語処理と確率・統計的な言語処理の融合
理論からのアプローチからの工学への貢献
言語処理と知識処理
2.社会的なインパクト
ネットワーク時代の言語処理
テキストからの知識獲得、情報検索、対話システム
3.国際的な情報の発信
積極的な国際的な共同研究
焦点を絞った、実質的なGoalを持った国際Workshop
緊密な研究協力体制の構築
Overview of GENIA Project
② query
① A researcher with a question
③ GENIA
Information Extraction
•Pre‐processing
Learning
Terminology
Databases
•Named entity
•Template element
⑤ answer to the question
Corpora
•Scenario template
Ontology
WWW Links
Thesaurus
④ information extracted
Information Retrieval
CSNDB
(国立衛生研究所)
• A data- and knowledge- base for signaling
pathways of human cells.
– It compiles the information on biological molecules,
sequences, structures, functions, and biological
reactions which transfer the cellular signals.
– Signaling pathways are compiled as binary
relationships of biomolecules and represented by
graphs drawn automatically.
– CSNDB is constructed on ACEDB and inference
engine CLIPS, and has a linkage to TRANSFAC.
– Final goal is to make a computerized model for various
biological phenomena.
Example. 3
• A Polymerization Reaction
Signal_Reaction:
“Ah receptor + HSP90 ”
Component “Ah receptor” “HSP90”
Effect “activation dissociation”
Interaction
“PAS domain”
“of Ah receptor”
Activity
“inactivation of Ah receptor”
Reference [Powell-Coffman_1998]
Excerpted @[Takai98]
Syntax/Semantics
An active phorbol ester must therefore, presumably
by activation of protein kinase C, cause dissociation
of a cytoplasmic complex of NF-kappa B and I kappa B
by modifying I kappa B.
E1: An active phorbol ester activates protein kinase C.
E2: The active phorbol ester modifies I kappa B.
E3: It dissociates a cytoplasmic complex of NF-kappa B
and I kappa B.
Part-Whole
言語と知識処理: 理解へ
Revised Tag Set and Underlying Ontology
+-name-+-source-+-natural-+-organism-+-multi-cell organism
|
|
|
+-mono-cell organism
part-of +-virus
|
|
|
|
|
+-tissue
| Is-a
|
+-cell type
|
|
+-sub-location of cells
|
+-artificial-+-cell line
|
+-substance-+-compound-+-inorganic
|
+-organic-+-amino-+-protein-+-protein family/group
|
|
+-protein complex
|
|
+-Individual molecule
|
|
+-UnitOfProteinComplex
|
|
+-SubstructureOfProtein
|
|
+-Domain/RegionOfProtein
|
+-peptide
|
+-amino acid monomer
+-DNA-+-DNA family or group
|
+-individual DNA molecule
|
+-domain or region of DNA
|
+-RNA-+-RNA family or group
+-individual RNA molecule
+-domain or region of RNA
Event Ontology
REACTION1
attribute1
attribute2
:
REACTION2
attribute1
attribute2
:
REACTION3
attribute1
attribute2
:
REACTION4
attribute1
attribute2
:
REACTION5
attribute1
attribute2
:
• substance ACTIVATE substance
• substance ACTIVATE protein
• protein ACTIVATE pathway
• PHOSPHORYLATE
•INHIBIT
•REGULATE
Example of NE Annotation
UI - 85146267
TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class"
cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte"
mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">.
AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class"
cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV"
unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll
gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV"
unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of
<NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6">
plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">),
with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK"
cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK"
cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV"
subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14)
and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11"
class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> =
<NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> =
<NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater
than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE
ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK"
cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV"
unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE
ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK"
cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.
TIMS – Tag Information Management System –
Will/TreeEdit
XML Tree Viewer /
XML Tree Editor
JTAG
Manual Tagging
Aid Interface
XML Data
LiLFeS/XHPSG
HPSG-based
Syntactic/
Semantic Parser
XML Data
XML Data
XML Data
Mining
VTAG
Automatic Tagging
Workbench
SXML Document
Management
TIMS
XML Data
XML Database
アブストラクト400件に対するタグ付け
• 文章数:約4,000文
• 単語数:約100,000語
• タグ付けされた項目の数
– 計 約12,000個所
• SOURCE
• DNA
• RNA
• PROTEIN
• その他
3123
945
100
2639
5180
sub class
Count
477
family or group
29
mono-cell organism
20
complex
0
virus
153
molecule
81
tissue
-
213
subunit
0
cell type
-
1478
substructure
41
sub-location of cells
-
79
domain or region
770
other (natural source)
-
1
N/A
24
cell line
-
695
family or group
13
other (artificial source)
-
7
complex
0
family or group
1172
molecule
80
complex
170
molecule
1181
subunit
0
subunit
65
substructure
1
substructure
29
domain or region
2
domain or region
77
N/A
4
N/A
98
other polymer
-
43
peptide
-
40
nucleic acid monomer
-
47
amino acid monomer
-
27
lipid
-
1113
carbohydrate
-
10
other organic
compounds
-
829
inorganic
-
29
atom
-
29
other name
-
2850
TAG NAME
organism
protein
sub class
Count
multi-cell organism
36 semantic subclasses
TAG NAME
DNA
RNA
CLASSの頻度分布
organism
tissue
cell type
other name
sub-location of cells
other (natural source)
atom
cell line
inorganic
artificial source
protein
other organic
compounds
peptide
lipid
carbohydrate
amino acid monomer
nucleic acid monomer
DNA
other polymer
RNA
アブストラクトに頻出する動詞
• CSNDB(国衛研)の925件の出現回数
(Have, be動詞以外)
show
375回
bind
226回
indicate
195回
suggest
183回
induce
162回
inhibit
148回
mediate
140回
report
139回
activate
135回
require
130回
show
• NP show that-clause
– researcher show conclusion
– experiment show conclusion
• NP show NP
– structure show property
• NP be shown to-infinitive
– substance be shown to reaction
• it is shown that-clause
– it is shown conclusion
inhibit
• NP inhibit NP
–
–
–
–
–
–
–
substance inhibit reaction
substance inhibit pathway
substance inhibit substance
substance inhibit source
reaction inhibit substance
reaction inhibit reaction
structure inhibit pathway
頻出動詞の構文・意味パターン
• 辞書のエントリーが何種類必要か
show
5
bind
4
indicate
5
suggest
5
induce
4
inhibit
7
mediate
6
report
5
activate
9
require
4
“indicate” の意味表現の例(LiLFeS)
semantic_primitive(Tnx0Vnx1, indicate,
the structure indicates mechanisms
SYNSEM\LOCAL\CONX\IND\(transitive &
ARG1\chem_struct &
ARG2\mechanism)).
semantic_primitive(Tnx0Vnx1, indicate,
SYNSEM\LOCAL\CONX\IND\(transitive &
these findings indicate an unexpected role
of …
ARG1\research &
ARG2\$OBJ)).
semantic_primitive(Tnx0Vs1, indicate,
SYNSEM\LOCAL\CONX\IND\(transitive &
ARG1\research &
ARG2\$OBJ)).
the data indicate that …
Experiment
(A.Yakushiji et.al, PSB2001)
XHPSG: HPSG-like Grammar translated from
XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98)
Terms (Compound nouns) are chunked beforehand.
180 sentences from abstracts in MEDLINE
The average parse time per sentence: 2.7 sec by a naïve parser
(This has been improved by the multi-stage parser by 10 times)
100000
parse時間(msec)
10000
1000
TNT導入前
TNT導入後
100
10
1
0
5
10
15
20
語数
25
30
35
40
Argument Frame Extractor
133 argument structures, marked by a domain specialist
in 97 sentences among the 180 sentences
Extracted Uniquely
Extracted with ambiguity
Extractable from pp’s
Parsing Not extractable
Failures
Memory limitation,etc
31
32
26
27
17
68%
KNP
企業や <P>
金融機関に <P>
不良債権の
早急な
PARA
処理を
<P>
促し、
特に
金融機関には
「この
過程で
従来のような
横並びの
決算や <P>
配当が <P> PARA
維持されるのではなく、
<P>
経営格差を
顕在化させる<P>
精度:約90%
PARA
覚悟を
求めたい」と
している。<P>
PARA
システムの概要
ユーザ
Mewでメールを送信する方法は?
ユーザインタフェイス
(WWWブラウザ)
メール送信部
対話管理部
知識
データベース
入力解析部
対話データの評価
200
成功:38%
180
失敗:知識
:約30% 減少傾向
失敗:対話管理:約5% 増加傾向
160
無意味
範囲外
失敗:困難
失敗:知識
失敗:対話管理
失敗:入力解析
成功
140
120
100
80
60
40
20
0 Jul05Jul11
Jul19Jul25
Aug02Aug08
Aug16Aug22
Aug30Sep04
Sep13Sep19
Sep27Oct03
Oct11Oct17
Oct25Oct29
Nov08Nov14
Nov22Nov28
Dec06Dec12
Dec20Dec26
Jan03Jan09
Jan17Jan23
研究成果(東京工業大学)
• 再現率の改善
– 複数のシソーラスを利用した検索質問拡張(19981999)
– クラスタベースの情報検索(1997)
– 大規模テキストクラスタリング(1996-1997)
• 精度の改善
– 格フレームを利用した情報検索(1996)
– 索引語の洗練と選択的利用(1999-2000)
• 再現率と精度の両立
– 多段階検索モデル(1999-2000)
シナリオ(東京工業大学)
Query
Therauri
Query
Expansion
Expanded
Query
Initial
Retrieval
Document
Collection
Final
Result
Intermediate
Result
Index term
Refinement
Revised
Query
Second
Retrieval
プロジェクトの目的
1.学術的な目的
構造的な言語処理と確率・統計的な言語処理の融合
理論からのアプローチからの工学への貢献
言語処理と知識処理
2.社会的なインパクト
ネットワーク時代の言語処理
テキストからの知識獲得、情報検索、対話システム
3.国際的な情報の発信
積極的な国際的な共同研究
焦点を絞った、実質的なGoalを持った国際Workshop
緊密な研究協力体制の構築
Workshops
初年度:
二年度:
三年度:
四年度:
立ち上げのためのClosed WS
IRなど応用に焦点 (日立基礎研と協賛)
理論と応用の関係 (日立中研と協賛)
Parsing Strategy(ドイツ)
(DFKI, Stanford大学と協賛)
(論文誌のSpecial Issue, CSLIからの本)
Tutorials
NLP for Bio-Informtaics: Eureka Groupと共同(PSB2001)
Eureka, TIDESと共同(ISMB2001)
共同研究
スタンフォード、DFKI、ペンシルベニア、UMIST、ローマ大
将来の研究課題
1.構造処理と確率的な処理
意味空間まで含めた、豊かな確率空間での処理
2.文法記述間の相互変換、等価性の理論的基礎
言語資源の共有、理論言語学への寄与
3.大規模素性構造のデータベース
XMLデータベースとの相互関連
4.制御された、教師なし学習の機構
意味クラスの同定、データからの文法学習
5.間テキストでの文脈処理