Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan Texts are one of the major sources of information and knowledge. However, they are not transparent. They have to be systematically integrated with the other sources like data bases, numerical data, etc. Natural Language Processing--IE Overview of GENIA System Retrieval Module Corpus Module •Markup generation / compilation •Annotated corpus construction Text Structure Interface Module Annotated Event Security Data model Concept Module •BK design / construction / compilation •IR Request •Abstract •Full Paper Database Document Named-Entity Markup language User •GUI •HTML conversion •System integration Background Knowledge Ontology MEDLINE •Identify & classify terms •Identify events Corpus Raw(OCR) •Request enhancement •Spawn request •Classify documents Information Extraction Module Database Module •DB design / access / management •DB construction Plan 1. What is IE ? 2. General Framework of NLP 3. Basic IE techniques 4. IE in Biology Automatic Term Recognition (S. Ananiadou) What is IE ? Application Tasks of NLP (1)Information Retrieval/Detection To search and retrieve documents in response to queries for information (2)Passage Retrieval To search and retrieve part of documents in response to queries for information (3)Information Extraction To extract information that fits pre-defined database schemas or templates, specifying the output formats (4) Question/Answering Tasks To answer general questions by using texts as knowledge base: Fact retrieval, combination of IR and IE (5)Text Understanding To understand texts as people do: Artificial Intelligence Ranges of Queries (1)Information Retrieval/Detection (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding Pre-Defined: Fixed aspects of information carried in texts Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 FASTUS Based on finite states automata (FSA) set up new Twaiwan dallors 1.Complex Words: a Japanese trading house had set up 2.Basic Phrases: production of 20, 000 iron and metal wood clubs 3.Complex phrases: Recognition of multi-words and proper names Simple noun groups, verb groups and particles Complex noun groups and verb groups 4.Domain Events: [company] [set up] [Joint-Venture] with [company] Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 Information Extraction ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ………. Name of the Venture: Yaxing Benz Products: buses and bus chassis Location: Yangzhou,China Companies involved: (1)Name: X? Country: German (2)Name: Y? Country: China Information Extraction A German vehicle-firm executive was stabbed to death …. ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ………. Different template Crime-Type: Murder for crimes Type: Stabbing The killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general manager Location: Nanjing, China Interpretation of Texts (1)Information Retrieval/Detection User (2)Passage Retrieval User (3)Information Extraction System (4) Question/Answering Tasks System (5)Text Understanding System Characterization of Texts IR System Queries Collection of Texts Knowledge Interpretation Characterization of Texts IR System Queries Collection of Texts Knowledge Interpretation Characterization of Texts Passage IR System Collection of Texts Queries Knowledge Characterization of Texts Interpretation Passage IR System IE System Queries Structures of Sentences NLP Collection of Texts Texts Templates Knowledge Interpretation IE System Texts Templates IE as compromise NLP Knowledge Interpretation IE System General Framework of NLP/NLU Texts Templates Predefined Performance Evaluation (1)Information Retrieval/Detection Rather clear (2)Passage Retrieval A bit vague (3)Information Extraction Rather clear (4) Question/Answering Tasks A bit vague (5)Text Understanding Very vague Query N: Correct Documents M:Retrieved Documents C: Correct Documents that are actually retrieved N Collection of Documents M Precision: C M C Recall: N F-Value: 2P・R P+R P C R Query N: Correct Templates M:Retrieved Templates C: Correct Templates that are actually retrieved N Collection of Documents M Precision: C M C Recall: N F-Value: 2P・R P+R P C R More complicated due to partially filled templates General Framework of NLP General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation General Framework of NLP John runs. John run+s. P-N V N Morphological and Lexical Processing 3-pre plu Syntactic Analysis Semantic Analysis Context processing Interpretation General Framework of NLP John runs. John run+s. P-N V N Morphological and Lexical Processing 3-pre plu S Syntactic Analysis Semantic Analysis Context processing Interpretation NP VP P-N V John run General Framework of NLP John runs. John run+s. P-N V N Morphological and Lexical Processing 3-pre plu S Syntactic Analysis Pred: RUN Agent:John Semantic Analysis Context processing Interpretation NP VP P-N V John run General Framework of NLP John runs. John run+s. P-N V N Morphological and Lexical Processing 3-pre plu S Syntactic Analysis Pred: RUN Agent:John John is a student. He runs. Semantic Analysis Context processing Interpretation NP VP P-N V John run General Framework of NLP Tokenization Morphological and Part of Speech Tagging Lexical Processing Inflection/Derivation Compounding Syntactic Analysis Term recognition (Ananiadou) Semantic Analysis Context processing Interpretation Domain Analysis Appelt:1999 Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Morphological and Open class words Lexical Processing Terms Term recognition Named Entities Syntactic Analysis Company names Locations Numerical expressions Semantic Analysis Context processing Interpretation Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Most words in English Morphological and are ambiguous in terms (2) Ambiguities: Lexical Processing of their part of speeches. Combinatorial runs: v/3pre, n/plu Explosion clubs: v/3pre, n/plu Syntactic Analysis and two meanings Semantic Analysis Context processing Interpretation Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Structural Ambiguities Semantic Analysis Predicate-argument Ambiguities Context processing Interpretation Structural Ambiguities (1)Attachment Ambiguities Semantic Ambiguities(1) John bought a car with Mary. $3000 can buy a nice car. John bought a car with large seats. John bought a car with $3000. The manager of Yaxing Benz, a Sino-German joint venture The manager of Yaxing Benz, Mr. John Smith (2) Scope Ambiguities Semantic Ambiguities(2) young women and men in the room Every man loves a woman. (3)Analytical Ambiguities Visiting relatives can be boring. Co-reference Ambiguities Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Combinatorial Explosion Morphological and Lexical Processing Syntactic Analysis Structural Ambiguities Semantic Analysis Predicate-argument Ambiguities Context processing Interpretation Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More comprehensive knowledge: More ambiguities Adaptability: Tuning, Learning Framework of IE IE as compromise NLP Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Context processing Interpretation Incomplete Domain Knowledge Interpretation Rules Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences (4) Adaptation Techniques: Machine Learning, Trainable systems General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation 95 % FSA rules Part of Speech Tagger Statistic taggers Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Local Context Statistical Bias Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word> Machine Learning: HMM, Decision Trees Rules + Machine Learning F-Value 90 Domain Dependent FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups Semantic Anaysis 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups Semantic Anaysis 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups Semantic Analysis 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Chomsky Hierarchy of Grammar Hierarchy of Automata Regular Grammar Finite State Automata Context Free Grammar Push Down Automata Context Sensitive Grammar Linear Bounded Automata Type 0 Grammar Turing Machine Computationally more complex, Less Efficiency Chomsky Hierarchy of Grammar Hierarchy of Automata Regular Grammar Finite State Automata AnB n Context Free Grammar Push Down Automata Context Sensitive Grammar Linear Bounded Automata Type 0 Grammar Turing Machine Computationally more complex, Less Efficiency 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 Pattern-maching {PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)* PN ’s (ADJ)* N P Art (ADJ)* N 1 ’s PN 0 Art 2 ADJ N ’s 3 John’s interesting book with a nice cover Art P PN 4 FASTUS General Framework of NLP Based on finite states automata (FSA) 1.Complex Words: Morphological and Lexical Processing Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles Syntactic Analysis 3.Complex phrases: Complex noun groups and verb groups Semantic Analysis 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. Context processing Interpretation 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 “metal wood” clubs a month. 1.Complex words Attachment Ambiguities are not made explicit 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and {{ a Japanese trading house to }} produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 “metal wood” clubs a month. 1.Complex words a Japanese tea house a [Japanese tea] house a Japanese [tea house] 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 “metal wood” clubs a month. 1.Complex words Structural Ambiguities of NP are ignored 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location Example of IE: FASTUS(1993) Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 “metal wood” clubs a month. 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3.Complex Phrases Example of IE: FASTUS(1993) [COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] and [COMPNY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE], [COMPNY], capitalized at 20 million [CURRENCY-UNIT] [START] production in [TIME] with production of 20,000 [PRODUCT] a month. 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3.Complex Phrases Some syntactic structures like … Example of IE: FASTUS(1993) [COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE] capitalized at [CURRENCY] [START] production in [TIME] with production of [PRODUCT] a month. 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location 3.Complex Phrases Syntactic structures relevant to information to be extracted are dealt with. Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota. Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota. S NP GM [SET-UP] VP V signed NP VP N agreement V GM plans to set up a joint venture with Toyota. setting up GM expects to set up a joint venture with Toyota. Syntactic variations GM set up a joint venture with Toyota. GM announced it was setting up a joint venture with Toyota. GM signed an agreement setting up a joint venture with Toyota. GM announced it was signing an agreement to set up a joint venture with Toyota. S NP GM [SET-UP] VP V set up GM plans to set up a joint venture with Toyota. GM expects to set up a joint venture with Toyota. Example of IE: FASTUS(1993) [COMPNY] [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] to produce [PRODUCT] to be supplied to [LOCATION]. [JOINT-VENTURE] capitalized at [CURRENCY] [START] production in [TIME] with production of [PRODUCT] a month. 3.Complex Phrases 4.Domain Events [COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY] [COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY] The attachment positions of PP are determined at this stage. Irrelevant parts of sentences are ignored. Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG] Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG] Complications caused by syntactic variations Relative clause The mayor, who was kidnapped yesterday, was found dead today. [NG] Relpro {NG/others}* [VG] {NG/others}*[VG] [NG] Relpro {NG/others}* [VG] Basic patterns Surface Pattern Generator Patterns used by Domain Event Relative clause construction Passivization, etc. FASTUS Based on finite states automata (FSA) 1.Complex Words: NP, who was kidnapped, was found. 2.Basic Phrases: 3.Complex phrases: 4.Domain Events: Piece-wise recognition Patterns for events of interest to the application of basic templates Basic templates are to be built. 5. Merging Structures: Reconstructing information Templates from different parts of the texts are carried via syntactic structures merged if they provide information about the by merging basic templates same entity or event. FASTUS Based on finite states automata (FSA) 1.Complex Words: NP, who was kidnapped, was found. 2.Basic Phrases: 3.Complex phrases: 4.Domain Events: Piece-wise recognition Patterns for events of interest to the application of basic templates Basic templates are to be built. 5. Merging Structures: Reconstructing information Templates from different parts of the texts are carried via syntactic structures merged if they provide information about the by merging basic templates same entity or event. FASTUS Based on finite states automata (FSA) 1.Complex Words: NP, who was kidnapped, was found. 2.Basic Phrases: 3.Complex phrases: 4.Domain Events: Piece-wise recognition Patterns for events of interest to the application of basic templates Basic templates are to be built. 5. Merging Structures: Reconstructing information Templates from different parts of the texts are carried via syntactic structures merged if they provide information about the by merging basic templates same entity or event. Current state of the arts of IE 1. Carefully constructed IE systems F-60 level (interannotater agreement: 60-80%) Domain: telegraphic messages about naval operation (MUC-1:87, MUC-2:89) news articles and transcriptions of radio broadcasts Latin American terrorism (MUC-3:91, MUC-4:1992) News articles about joint ventures (MUC-5, 93) News articles about management changes (MUC-6, 95) News articles about space vehicle (MUC-7, 97) 2. Handcrafted rules (named entity recognition, domain events, etc) Automatic learning from texts: Supervised learning : corpus preparation Non-supervised, or controlled learning IE in Biology CSNDB (National Institute of Health Sciences) • A data- and knowledge- base for signaling pathways of human cells. – It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. – Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. – CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC. – Final goal is to make a computerized model for various biological phenomena. Example. 1 • A Standard Reaction Signal_Reaction: “EGF receptor Grb2” From_molecule “EGF receptor” To_molecule “Grb2” Tissue “liver” Effect “activation” Interaction “SH2+phosphorylated Tyr” Reference [Yamauchi_1997] Excerpted @[Takai98] Example. 3 • A Polymerization Reaction Signal_Reaction: “Ah receptor + HSP90 ” Component “Ah receptor” “HSP90” Effect “activation dissociation” Interaction “PAS domain” “of Ah receptor” Activity “inactivation of Ah receptor” Reference [Powell-Coffman_1998] Excerpted @[Takai98] FASTUS Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. FASTUS Is separation of stages possible ? Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. FASTUS Is separation of stages possible ? Open word classes: techical terms very long specific formation rules many semantic classes acronyms variants fairly ambiguous [[Term recognition]] Coordination across word formation A or B and C D Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. FASTUS Is separation of stages possible ? Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E1: An active phorbol ester activates protein kinase C. Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E1: An active phorbol ester activates protein kinase C. E2: The active phorbol ester modifies I kappa B. Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E1: An active phorbol ester activates protein kinase C. E2: The active phorbol ester modifies I kappa B. E3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B. Part-Whole Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E1: An active phorbol ester activates protein kinase C. E2: The active phorbol ester modifies I kappa B. E3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B. Part-Whole Full parser based on good grammar formalisms 1. Several attempts of using full parsers : To improve the Precision 2. Systematic treatment of interaction of the different phases : Unification-based grammar formalisms The two papers in the NLP session of PSB 2001 Experiment (A.Yakushiji et.al, PSB2001) XHPSG: HPSG-like Grammar translated from XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98) Automatic conversion: Detailed, empirical comparison of grammars of different formalisms (+LFG) Terms (Compound nouns) are chunked beforehand. 180 sentences from abstracts in MEDLINE The average parse time per sentence: 2.7 sec by a naïve parser (This can be improved by the multi-stage parser by 50 times) Argument Frame Extractor 133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences Extracted Uniquely Extracted with ambiguity Extractable from pp’s Parsing Not extractable Failures Memory limitation,etc 31 32 26 27 17 68% Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names More refined semantic classes with part-whole relationships, properties, Etc. Acronyms, variants, Etc. Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names More refined semantic classes with part-whole relationships, properties, Etc. Acronyms, variants, Etc. Bio Term Bank T BB A database for all sort of biological terms collected from genome databases and biological texts. It will contain 2 million terms in 2001 and 5 million terms until 2005. Terms are classified by biochemical and terminological attributes, grounded on their resources. Biological ontology committee Japan organized by T. Takagi and T. Takai, U.Tokyo in Genome Projects of MESSC (2000.4~ 2005.3) Ontology: Knowledge of the Domain Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names More refined semantic classes with part-whole relationships, properties, Etc. Acronyms, variants, Etc. GENIA ontology (current version) +-name-+-source-+-natural-+-organism-+-multi-cell organism | | | +-mono-cell organism | | | +-virus | | +-tissue | | +-cell type | | +-sub-location of cells | +-artificial-+-cell line | +-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group | | +-protein complex | | +-individual protein molecule | | +-subunit of protein complex | | +-substructure of protein | | +-domain or region of protein | +-peptide | +-amino acid monomer | +-nucleic-+-DNA-+-DNA family or group | +-individual DNA molecule | +-domain or region of DNA | +-RNA-+-RNA family or group +-individual RNA molecule +-domain or region of RNA Expansion of GENIA Ontology • Try to tag all NPs in some MEDLINE abstracts and find the classes that appears in abstracts but not in current ontology • Find frequent verbs and what class of arguments they take Expansion of GENIA Ontology • • • • Chemical class of substance and their substrucutres Sources Biological role, or function, of substances Reaction – Biological reaction – Pathway – Disease • Structure themselves • Experiment , experimental results, and researchers • Measure Example of Entities in Expanded • Biological role, or function, of substances – receptor, inhibitor, … • Biological reaction – activation, binding, inhibition, apoptosis, G2 arrest – pathway, signal – immune dysfunction, Ataxia telangiectasia (AT) • Structure themselves – alpha-helix, • Experiment, experimental results, researchers – our results, these studies, we Verbs Related to Biological Events Frequent Verbs in 100 MEDLINE Abstracts Ver b be induce bind show suggest activate factor demonstrate inhibit have reveal require regulate indicate find result play interact mediate contain C ount 255 56 50 49 42 42 36 35 26 25 21 21 21 21 21 20 19 18 17 17 Ver b C ount involve 16 identify 16 act 15 stimulate 14 provide 14 express 13 affect 13 type 12 report 12 form 12 contribute 12 study 11 observe 11 lead 11 function 11 assay 11 appear 11 occur 10 increase 10 phosphorylate 9 Ver b determine construct associate reduce prevent locate line differ trigger synergize examine block become analyze target signal remain produce present possess C ount 9 9 9 8 8 8 8 8 7 7 7 7 7 7 6 6 6 6 6 6 Ver b explain exert enhance display characterize participate localize investigate imply establish conclude compare use transform transfect test suppress support substitute share C ount 6 6 6 6 6 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 Verbs Related to Biological Events Verbs that take biological entities as arguments • induce – noun BE INDUCED BY noun activation of these PROTEIN was induced by PROTEIN – noun INDUCE noun PROTEIN induced the tyrosine phosphorylation • bind – noun BIND TO noun the drugs bind to two different PROTEIN – noun BIND noun – noun BINDING noun – the BINDING of noun motifs previously found to bind the cellular factors the TATA-box binding protein the binding of PROTEIN semantic class: substance structure source experiment fact reaction Verbs Related to Biological Events Verbs that take description entities • report – noun REPORT that-clause – noun REPORT noun – noun REPORT noun we report here that PROTEIN is activated by PROTEIN we report the characterization of PROTEIN we report a novel structure of PROTEIN semantic class: substance structure source experiment fact reaction Verbs Related to Biological Events Verbs whose arguments depend on syntactic patterns • show – noun BE SHOWN to-infinitive – noun SHOW that-clause PROTEIN has been shown to trigger cellular PROTEIN activity the data show that PROTEIN stimulation is also not sufficient – noun SHOW noun SOURCE showed a dose-dependent inhibition of PROTEIN activity semantic class: substance source experiment fact Verbs Related to Biological Events Verbs that take both entities • indicate – noun INDICATE that-clause – noun INDICATE noun the data indicate that PROTEIN is required in CELL prolifiration these findings indicate an unexpected role of DNA – noun INDICATE that-clause – noun INDICATE noun the structure indicates that it represents a unique class of PROTEIN the structure indicates mechanisms for allosteric effector action semantic class: substance structure source experiment fact reaction role Example of NE Annotation UI - 85146267 TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">. AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class" cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans. Available from our website: Definition of ontological classes Manual of GMPL: extention of XML to annonate texts Manual of Text Annotation Soon: Annotated texts (1000 abstracts) by the end of March 1. IE can contribute to Bio-informatics significantly. 2. However, the domains in Bio-chemistry seem more structurally rich than the domains we have dealt with so far. Term formation, rich ontologies, complex syntactic structures. 3. It requires substantial efforts in resource building. 4. However, those resources can contribute to other applications : Knowledge sharing, Intelligent IR, Knowledge discovery One of the crucial techniques is ATR …. Overview of GENIA System Retrieval Module Corpus Module •Markup generation / compilation •Annotated corpus construction Text Structure Interface Module Annotated Event Security Data model Concept Module •BK design / construction / compilation •IR Request •Abstract •Full Paper Database Document Named-Entity Markup language User •GUI •HTML conversion •System integration Background Knowledge Ontology MEDLINE •Identify & classify terms •Identify events Corpus Raw(OCR) •Request enhancement •Spawn request •Classify documents Information Extraction Module Database Module •DB design / access / management •DB construction
© Copyright 2024 ExpyDoc