自然言語の処理と理解の研究 辻井 潤一 東京大学大学院理学系研究科 情報科学専攻 プロジェクトの目的 1.学術的な目的 構造的な言語処理と確率・統計的な言語処理の融合 理論からのアプローチからの工学への貢献 言語処理と知識処理 2.社会的なインパクト ネットワーク時代の言語処理 テキストからの知識獲得、情報検索、対話システム 3.国際的な情報の発信 積極的な国際的な共同研究 焦点を絞った、実質的なGoalを持った国際Workshop 緊密な研究協力体制の構築 プロジェクトの目的 1.学術的な目的 構造的な言語処理と確率・統計的な言語処理の融合 理論からのアプローチからの工学への貢献 言語処理と知識処理 2.社会的なインパクト ネットワーク時代の言語処理 テキストからの知識獲得、情報検索、対話システム 3.国際的な情報の発信 積極的な国際的な共同研究 焦点を絞った、実質的なGoalを持った国際Workshop 緊密な研究協力体制の構築 理論言語学からの妥当な文法枠組み タイプ付素性構造に基づく文法枠組み 処理効率 耐性 文法記述の偏り: 現実テキストへの適用 系統的な文法の拡充 処理効率 Abstract Machine for Unification (T.Makino, et.al.) Prolog with Typed Feature Structure (LiLFes) Coling 98, JNE-00 CFG Approximation (K.Torisawa, et.al) Multi-staged Parsing (TNT) Coling 98, JNE-00 Preventing Combinatorial Explosion (Y.Miyao) Packing of FSs ACL 99 処理効率 Abstract Machine for Unification (T.Makino, et.al.) Prolog with Typed Feature Structure (LiLFes) Coling 98, JNE-00 CFG Approximation (K.Torisawa, et.al) Multi-staged Parsing (TNT) Coling 98, JNE-00 Preventing Combinatorial Explosion (Y.Miyao) Packing of FSs ACL 99 Abstract Machine (Carpenter and Qu, 1995) nelist REST FIRST PUSH FIRST ADDNEW list UNIFYVAR 1 POP list nelist FIRST bot FIRST foo 1 2 3 4 5 6 STR VAR PTR STR VAR VAR REST nelist REST list nelist bot 4 nelist foo list Abstract machine code of a TFS PUSH REST UNIFYVAR 1 POP nelist FIRST list FIRST foo 1 2 3 4 5 6 STR VAR PTR STR VAR VAR REST nelist REST list nelist list 4 nelist foo list TFS data on memory nelist FIRST FIRST foo 1 2 3 4 5 6 STR PTR PTR STR VAR VAR REST nelist REST list nelist 4 4 nelist foo list LiLFeS: Performance (2/2) FASTER Comparison with other inference engines for typed feature structures 20 18 16 14 12 10 8 6 4 2 0 LiLFeS: Native Code Compiler LiLFeS: Byte Code Emulator ProFIT on SICStus Emulator ALE 3.1 on SICStus Emulator HPSG Intel Pentium II 400Mhz Grammar : a small grammar distributed with ALE 処理効率 Abstract Machine for Unification (T.Makino, et.al.) Prolog with Typed Feature Structure (LiLFes) Coling 98, JNE-00 CFG Approximation (K.Torisawa, et.al) Multi-staged Parsing (TNT) Coling 98, JNE-00 Preventing Combinatorial Explosion (Y.Miyao) Packing of FSs ACL 99 Filtering with CFG (1/5) • 2-phased parsing – Approximate HPSG with CFG with keeping important constraints. – Obtained CFG might over-generate, but can be used in filtering. – Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. HPSG Compile Input Parsing Sentences CFG Built-in CFG Parser + Feature Structures LiLFeS Unification Complete parse trees Output Evaluation of HPSG Parsers DFKI, Stanford, U-Tokyo Processing time per sentence (sec) Grammar Corpus Naïve (average length: parser TNT parser LKB Parser (Stanford: DFKI) words) LinGO csli(5.8) 0.68 0.12 0.23 LinGO aged(8.4) 1.72 0.31 0.61 LinGO blend (11) 14.71 1.90 3.10 XHPSG ATIS (7.42) 14.27 0.30 SLUNG EDR(20.5) 0.88 0.38 Sun UltraSparc, 336 mhz, 6GB main memory 処理効率 Abstract Machine for Unification (T.Makino, et.al.) Prolog with Typed Feature Structure (LiLFes) Coling 98, JNE-00 CFG Approximation (K.Torisawa, et.al) Multi-staged Parsing (TNT) Coling 98, JNE-00 Preventing Combinatorial Explosion (Y.Miyao) Packing of FSs ACL 99 Packed Feature Structure • Each dependency function for one of the input feature structures A set of feature structures verb VMODE indicative PASSIVE false TENSE past verb VMODE past_part PASSIVE true TENSE tense verb VMODE past_part PASSIVE false TENSE tense Packed feature structure indicative past_part verb VMODE PASSIVE TENSE 1 2 3 false true past tense Experimental Results (1) • Execution time for unification Test data # of LEs Unpacked (msec.) credited walked 37 79 36.5 77.2 Packed (msec.) Improvement 5.7 9.2 6.4 8.4 • Packing achieved a considerable speed-up in unification 大規模な文法の構成 英語文法 スタンフォード大学、DFKIとの共同: LinGO文法(HPSG) ペンシルベニア大学との共同: XTAG文法の変換 手作業が介在する変換(XHPSG) 2つの文法枠組みの自動変換 日本語文法 SLUNG: Underspecified な日本語文法 KNP: 係り受け解析、高耐性の日本語文法(京都大学) プロジェクトの目的 1.学術的な目的 構造的な言語処理と確率・統計的な言語処理の融合 理論からのアプローチからの工学への貢献 言語処理と知識処理 2.社会的なインパクト ネットワーク時代の言語処理 テキストからの知識獲得、情報検索、対話システム 3.国際的な情報の発信 積極的な国際的な共同研究 焦点を絞った、実質的なGoalを持った国際Workshop 緊密な研究協力体制の構築 Overview of GENIA Project ② query ① A researcher with a question ③ GENIA Information Extraction •Pre‐processing Learning Terminology Databases •Named entity •Template element ⑤ answer to the question Corpora •Scenario template Ontology WWW Links Thesaurus ④ information extracted Information Retrieval CSNDB (国立衛生研究所) • A data- and knowledge- base for signaling pathways of human cells. – It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. – Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. – CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC. – Final goal is to make a computerized model for various biological phenomena. Example. 3 • A Polymerization Reaction Signal_Reaction: “Ah receptor + HSP90 ” Component “Ah receptor” “HSP90” Effect “activation dissociation” Interaction “PAS domain” “of Ah receptor” Activity “inactivation of Ah receptor” Reference [Powell-Coffman_1998] Excerpted @[Takai98] Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E1: An active phorbol ester activates protein kinase C. E2: The active phorbol ester modifies I kappa B. E3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B. Part-Whole 言語と知識処理: 理解へ Revised Tag Set and Underlying Ontology +-name-+-source-+-natural-+-organism-+-multi-cell organism | | | +-mono-cell organism part-of +-virus | | | | | +-tissue | Is-a | +-cell type | | +-sub-location of cells | +-artificial-+-cell line | +-substance-+-compound-+-inorganic | +-organic-+-amino-+-protein-+-protein family/group | | +-protein complex | | +-Individual molecule | | +-UnitOfProteinComplex | | +-SubstructureOfProtein | | +-Domain/RegionOfProtein | +-peptide | +-amino acid monomer +-DNA-+-DNA family or group | +-individual DNA molecule | +-domain or region of DNA | +-RNA-+-RNA family or group +-individual RNA molecule +-domain or region of RNA Event Ontology REACTION1 attribute1 attribute2 : REACTION2 attribute1 attribute2 : REACTION3 attribute1 attribute2 : REACTION4 attribute1 attribute2 : REACTION5 attribute1 attribute2 : • substance ACTIVATE substance • substance ACTIVATE protein • protein ACTIVATE pathway • PHOSPHORYLATE •INHIBIT •REGULATE Example of NE Annotation UI - 85146267 TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">. AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class" cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans. TIMS – Tag Information Management System – Will/TreeEdit XML Tree Viewer / XML Tree Editor JTAG Manual Tagging Aid Interface XML Data LiLFeS/XHPSG HPSG-based Syntactic/ Semantic Parser XML Data XML Data XML Data Mining VTAG Automatic Tagging Workbench SXML Document Management TIMS XML Data XML Database アブストラクト400件に対するタグ付け • 文章数:約4,000文 • 単語数:約100,000語 • タグ付けされた項目の数 – 計 約12,000個所 • SOURCE • DNA • RNA • PROTEIN • その他 3123 945 100 2639 5180 sub class Count 477 family or group 29 mono-cell organism 20 complex 0 virus 153 molecule 81 tissue - 213 subunit 0 cell type - 1478 substructure 41 sub-location of cells - 79 domain or region 770 other (natural source) - 1 N/A 24 cell line - 695 family or group 13 other (artificial source) - 7 complex 0 family or group 1172 molecule 80 complex 170 molecule 1181 subunit 0 subunit 65 substructure 1 substructure 29 domain or region 2 domain or region 77 N/A 4 N/A 98 other polymer - 43 peptide - 40 nucleic acid monomer - 47 amino acid monomer - 27 lipid - 1113 carbohydrate - 10 other organic compounds - 829 inorganic - 29 atom - 29 other name - 2850 TAG NAME organism protein sub class Count multi-cell organism 36 semantic subclasses TAG NAME DNA RNA CLASSの頻度分布 organism tissue cell type other name sub-location of cells other (natural source) atom cell line inorganic artificial source protein other organic compounds peptide lipid carbohydrate amino acid monomer nucleic acid monomer DNA other polymer RNA アブストラクトに頻出する動詞 • CSNDB(国衛研)の925件の出現回数 (Have, be動詞以外) show 375回 bind 226回 indicate 195回 suggest 183回 induce 162回 inhibit 148回 mediate 140回 report 139回 activate 135回 require 130回 show • NP show that-clause – researcher show conclusion – experiment show conclusion • NP show NP – structure show property • NP be shown to-infinitive – substance be shown to reaction • it is shown that-clause – it is shown conclusion inhibit • NP inhibit NP – – – – – – – substance inhibit reaction substance inhibit pathway substance inhibit substance substance inhibit source reaction inhibit substance reaction inhibit reaction structure inhibit pathway 頻出動詞の構文・意味パターン • 辞書のエントリーが何種類必要か show 5 bind 4 indicate 5 suggest 5 induce 4 inhibit 7 mediate 6 report 5 activate 9 require 4 “indicate” の意味表現の例(LiLFeS) semantic_primitive(Tnx0Vnx1, indicate, the structure indicates mechanisms SYNSEM\LOCAL\CONX\IND\(transitive & ARG1\chem_struct & ARG2\mechanism)). semantic_primitive(Tnx0Vnx1, indicate, SYNSEM\LOCAL\CONX\IND\(transitive & these findings indicate an unexpected role of … ARG1\research & ARG2\$OBJ)). semantic_primitive(Tnx0Vs1, indicate, SYNSEM\LOCAL\CONX\IND\(transitive & ARG1\research & ARG2\$OBJ)). the data indicate that … Experiment (A.Yakushiji et.al, PSB2001) XHPSG: HPSG-like Grammar translated from XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98) Terms (Compound nouns) are chunked beforehand. 180 sentences from abstracts in MEDLINE The average parse time per sentence: 2.7 sec by a naïve parser (This has been improved by the multi-stage parser by 10 times) 100000 parse時間(msec) 10000 1000 TNT導入前 TNT導入後 100 10 1 0 5 10 15 20 語数 25 30 35 40 Argument Frame Extractor 133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences Extracted Uniquely Extracted with ambiguity Extractable from pp’s Parsing Not extractable Failures Memory limitation,etc 31 32 26 27 17 68% KNP 企業や <P> 金融機関に <P> 不良債権の 早急な PARA 処理を <P> 促し、 特に 金融機関には 「この 過程で 従来のような 横並びの 決算や <P> 配当が <P> PARA 維持されるのではなく、 <P> 経営格差を 顕在化させる<P> 精度:約90% PARA 覚悟を 求めたい」と している。<P> PARA システムの概要 ユーザ Mewでメールを送信する方法は? ユーザインタフェイス (WWWブラウザ) メール送信部 対話管理部 知識 データベース 入力解析部 対話データの評価 200 成功:38% 180 失敗:知識 :約30% 減少傾向 失敗:対話管理:約5% 増加傾向 160 無意味 範囲外 失敗:困難 失敗:知識 失敗:対話管理 失敗:入力解析 成功 140 120 100 80 60 40 20 0 Jul05Jul11 Jul19Jul25 Aug02Aug08 Aug16Aug22 Aug30Sep04 Sep13Sep19 Sep27Oct03 Oct11Oct17 Oct25Oct29 Nov08Nov14 Nov22Nov28 Dec06Dec12 Dec20Dec26 Jan03Jan09 Jan17Jan23 研究成果(東京工業大学) • 再現率の改善 – 複数のシソーラスを利用した検索質問拡張(19981999) – クラスタベースの情報検索(1997) – 大規模テキストクラスタリング(1996-1997) • 精度の改善 – 格フレームを利用した情報検索(1996) – 索引語の洗練と選択的利用(1999-2000) • 再現率と精度の両立 – 多段階検索モデル(1999-2000) シナリオ(東京工業大学) Query Therauri Query Expansion Expanded Query Initial Retrieval Document Collection Final Result Intermediate Result Index term Refinement Revised Query Second Retrieval プロジェクトの目的 1.学術的な目的 構造的な言語処理と確率・統計的な言語処理の融合 理論からのアプローチからの工学への貢献 言語処理と知識処理 2.社会的なインパクト ネットワーク時代の言語処理 テキストからの知識獲得、情報検索、対話システム 3.国際的な情報の発信 積極的な国際的な共同研究 焦点を絞った、実質的なGoalを持った国際Workshop 緊密な研究協力体制の構築 Workshops 初年度: 二年度: 三年度: 四年度: 立ち上げのためのClosed WS IRなど応用に焦点 (日立基礎研と協賛) 理論と応用の関係 (日立中研と協賛) Parsing Strategy(ドイツ) (DFKI, Stanford大学と協賛) (論文誌のSpecial Issue, CSLIからの本) Tutorials NLP for Bio-Informtaics: Eureka Groupと共同(PSB2001) Eureka, TIDESと共同(ISMB2001) 共同研究 スタンフォード、DFKI、ペンシルベニア、UMIST、ローマ大 将来の研究課題 1.構造処理と確率的な処理 意味空間まで含めた、豊かな確率空間での処理 2.文法記述間の相互変換、等価性の理論的基礎 言語資源の共有、理論言語学への寄与 3.大規模素性構造のデータベース XMLデータベースとの相互関連 4.制御された、教師なし学習の機構 意味クラスの同定、データからの文法学習 5.間テキストでの文脈処理
© Copyright 2024 ExpyDoc