Natural Language Processing and Information

Information Extraction
from
Scientific Texts
Junichi Tsujii
Graduate School of Science
University of Tokyo
Japan
Texts are one of the major sources of information and knowledge.
However, they are not transparent.
They have to be systematically integrated with
the other sources like data bases, numerical data,
etc.
Natural Language Processing--IE
Overview of GENIA System
Retrieval Module
Corpus Module
•Markup generation / compilation
•Annotated corpus construction
Text
Structure
Interface Module
Annotated
Event
Security
Data model
Concept Module
•BK design / construction / compilation
•IR Request
•Abstract
•Full Paper
Database
Document Named-Entity
Markup
language
User
•GUI
•HTML conversion
•System integration
Background Knowledge
Ontology
MEDLINE
•Identify & classify terms
•Identify events
Corpus
Raw(OCR)
•Request enhancement
•Spawn request
•Classify documents
Information Extraction
Module
Database Module
•DB design / access / management
•DB construction
Plan
1. What is IE ?
2. General Framework of NLP
3. Basic IE techniques
4. IE in Biology
Automatic Term Recognition
(S. Ananiadou)
What is IE ?
Application Tasks of NLP
(1)Information Retrieval/Detection
To search and retrieve documents in response to queries
for information
(2)Passage Retrieval
To search and retrieve part of documents in response
to queries for information
(3)Information Extraction
To extract information that fits pre-defined database schemas
or templates, specifying the output formats
(4) Question/Answering Tasks
To answer general questions by using texts as knowledge
base: Fact retrieval, combination of IR and IE
(5)Text Understanding
To understand texts as people do: Artificial Intelligence
Ranges of Queries
(1)Information Retrieval/Detection
(2)Passage Retrieval
(3)Information Extraction
(4) Question/Answering Tasks
(5)Text Understanding
Pre-Defined: Fixed aspects
of information carried in texts
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
FASTUS
Based on finite states automata (FSA)
set up
new Twaiwan dallors
1.Complex Words:
a Japanese trading house
had set up
2.Basic Phrases:
production of
20, 000 iron and
metal wood clubs
3.Complex phrases:
Recognition of multi-words and proper names
Simple noun groups, verb groups and particles
Complex noun groups and verb groups
4.Domain Events:
[company]
[set up]
[Joint-Venture]
with
[company]
Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1
Relationship: TIE-UP
Entities: “Bridgestone Sport Co.”
“a local concern”
“a Japanese trading house”
Joint Venture Company:
“Bridgestone Sports Taiwan Co.”
Activity:
ACTIVITY-1
Amount:
NT$200000000
ACTIVITY-1
Activity: PRODUCTION
Company:
“Bridgestone Sports Taiwan Co.”
Product:
“iron and ‘metal wood’ clubs”
Start Date:
DURING: January 1990
Information Extraction
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Name of the Venture: Yaxing Benz
Products:
buses and bus chassis
Location:
Yangzhou,China
Companies involved: (1)Name: X?
Country: German
(2)Name: Y?
Country: China
Information Extraction
A German vehicle-firm executive was stabbed to death ….
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Different template
Crime-Type: Murder
for crimes
Type: Stabbing
The killed: Name: Jurgen Pfrang
Age:
51
Profession: Deputy general manager
Location: Nanjing, China
Interpretation of Texts
(1)Information Retrieval/Detection
User
(2)Passage Retrieval
User
(3)Information Extraction
System
(4) Question/Answering Tasks
System
(5)Text Understanding
System
Characterization of Texts
IR System
Queries
Collection of Texts
Knowledge
Interpretation
Characterization of Texts
IR System
Queries
Collection of Texts
Knowledge
Interpretation
Characterization of Texts
Passage
IR System
Collection of Texts
Queries
Knowledge
Characterization of Texts
Interpretation
Passage
IR System
IE System
Queries
Structures
of
Sentences
NLP
Collection of Texts
Texts
Templates
Knowledge
Interpretation
IE System
Texts
Templates
IE as
compromise NLP
Knowledge
Interpretation
IE System
General Framework
of
NLP/NLU
Texts
Templates
Predefined
Performance Evaluation
(1)Information Retrieval/Detection
Rather clear
(2)Passage Retrieval
A bit vague
(3)Information Extraction
Rather clear
(4) Question/Answering Tasks
A bit vague
(5)Text Understanding
Very vague
Query
N: Correct Documents
M:Retrieved Documents
C: Correct Documents that are
actually retrieved
N
Collection of Documents
M
Precision: C
M
C
Recall:
N
F-Value: 2P・R
P+R
P
C
R
Query
N: Correct Templates
M:Retrieved Templates
C: Correct Templates that are
actually retrieved
N
Collection of Documents
M
Precision: C
M
C
Recall:
N
F-Value: 2P・R
P+R
P
C
R
More complicated due to partially
filled templates
General Framework of NLP
General Framework of NLP
John runs.
Morphological and
Lexical Processing
Syntactic Analysis
Semantic Analysis
Context processing
Interpretation
General Framework of NLP
John runs.
John run+s.
P-N
V
N
Morphological and
Lexical Processing
3-pre
plu
Syntactic Analysis
Semantic Analysis
Context processing
Interpretation
General Framework of NLP
John runs.
John run+s.
P-N
V
N
Morphological and
Lexical Processing
3-pre
plu
S
Syntactic Analysis
Semantic Analysis
Context processing
Interpretation
NP
VP
P-N
V
John
run
General Framework of NLP
John runs.
John run+s.
P-N
V
N
Morphological and
Lexical Processing
3-pre
plu
S
Syntactic Analysis
Pred: RUN
Agent:John
Semantic Analysis
Context processing
Interpretation
NP
VP
P-N
V
John
run
General Framework of NLP
John runs.
John run+s.
P-N
V
N
Morphological and
Lexical Processing
3-pre
plu
S
Syntactic Analysis
Pred: RUN
Agent:John
John is a student.
He runs.
Semantic Analysis
Context processing
Interpretation
NP
VP
P-N
V
John
run
General Framework of NLP
Tokenization
Morphological and
Part of Speech Tagging
Lexical Processing
Inflection/Derivation
Compounding
Syntactic Analysis Term recognition
(Ananiadou)
Semantic Analysis
Context processing
Interpretation
Domain Analysis
Appelt:1999
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Morphological and
Lexical Processing
Syntactic Analysis
Semantic Analysis
Context processing
Interpretation
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Incomplete Lexicons
Morphological and Open class words
Lexical Processing Terms
Term recognition
Named Entities
Syntactic Analysis Company names
Locations
Numerical expressions
Semantic Analysis
Context processing
Interpretation
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Morphological and
Lexical Processing
Incomplete Grammar
Syntactic Coverage
Domain Specific
Constructions
Ungrammatical
Constructions
Syntactic Analysis
Semantic Analysis
Context processing
Interpretation
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Morphological and
Lexical Processing
Syntactic Analysis
Predefined
Aspects of
Information
Semantic Analysis
Context processing
Interpretation
Incomplete
Domain Knowledge
Interpretation Rules
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
(2) Ambiguities:
Combinatorial
Explosion
Morphological and
Lexical Processing
Syntactic Analysis
Semantic Analysis
Context processing
Interpretation
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Most words in English
Morphological and are ambiguous in terms
(2) Ambiguities:
Lexical Processing of their part of speeches.
Combinatorial
runs: v/3pre, n/plu
Explosion
clubs: v/3pre, n/plu
Syntactic Analysis
and two meanings
Semantic Analysis
Context processing
Interpretation
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
(2) Ambiguities:
Combinatorial
Explosion
Morphological and
Lexical Processing
Syntactic Analysis
Structural Ambiguities
Semantic Analysis
Predicate-argument
Ambiguities
Context processing
Interpretation
Structural Ambiguities
(1)Attachment Ambiguities
Semantic Ambiguities(1)
John bought a car with Mary.
$3000 can buy a nice car.
John bought a car with large seats.
John bought a car with $3000.
The manager of Yaxing Benz, a Sino-German joint venture
The manager of Yaxing Benz, Mr. John Smith
(2) Scope Ambiguities
Semantic Ambiguities(2)
young women and men in the room Every man loves a woman.
(3)Analytical Ambiguities
Visiting relatives can be boring.
Co-reference Ambiguities
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
(2) Ambiguities:
Combinatorial
Explosion
Combinatorial
Explosion
Morphological and
Lexical Processing
Syntactic Analysis
Structural Ambiguities
Semantic Analysis
Predicate-argument
Ambiguities
Context processing
Interpretation
Note:
Ambiguities vs Robustness
More comprehensive knowledge: More Robust
big dictionaries
comprehensive grammar
More comprehensive knowledge: More ambiguities
Adaptability: Tuning, Learning
Framework of IE
IE as compromise NLP
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Morphological and
Lexical Processing
Syntactic Analysis
Predefined
Aspects of
Information
Semantic Analysis
Context processing
Interpretation
Incomplete
Domain Knowledge
Interpretation Rules
Difficulties of NLP
General Framework of NLP
(1) Robustness:
Incomplete Knowledge
Morphological and
Lexical Processing
Syntactic Analysis
Predefined
Aspects of
Information
Semantic Analysis
Context processing
Interpretation
Incomplete
Domain Knowledge
Interpretation Rules
Techniques in IE
(1) Domain Specific Partial Knowledge:
Knowledge relevant to information to be extracted
(2) Ambiguities:
Ignoring irrelevant ambiguities
Simpler NLP techniques
(3) Robustness:
Coping with Incomplete dictionaries
(open class words)
Ignoring irrelevant parts of sentences
(4) Adaptation Techniques:
Machine Learning, Trainable systems
General Framework of NLP
Morphological and
Lexical Processing
Syntactic Analysis
Semantic Anaysis
Context processing
Interpretation
95 %
FSA rules
Part of Speech Tagger
Statistic taggers
Open class words:
Named entity recognition
(ex) Locations
Persons
Companies
Organizations
Position names
Local Context
Statistical Bias
Domain specific rules:
<Word><Word>, Inc.
Mr. <Cpt-L>. <Word>
Machine Learning:
HMM, Decision Trees
Rules + Machine Learning
F-Value
90
Domain
Dependent
FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words:
Morphological and
Lexical Processing
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
Syntactic Analysis
3.Complex phrases:
Complex noun groups and verb groups
Semantic Anaysis
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
Context processing
Interpretation
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words:
Morphological and
Lexical Processing
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
Syntactic Analysis
3.Complex phrases:
Complex noun groups and verb groups
Semantic Anaysis
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
Context processing
Interpretation
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words:
Morphological and
Lexical Processing
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
Syntactic Analysis
3.Complex phrases:
Complex noun groups and verb groups
Semantic Analysis
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
Context processing
Interpretation
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
Chomsky Hierarchy
of Grammar
Hierarchy
of Automata
Regular Grammar
Finite State Automata
Context Free Grammar
Push Down Automata
Context Sensitive Grammar
Linear Bounded Automata
Type 0 Grammar
Turing Machine
Computationally more complex, Less Efficiency
Chomsky Hierarchy
of Grammar
Hierarchy
of Automata
Regular Grammar
Finite State Automata
AnB n
Context Free Grammar
Push Down Automata
Context Sensitive Grammar
Linear Bounded Automata
Type 0 Grammar
Turing Machine
Computationally more complex, Less Efficiency
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
Pattern-maching
{PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)*
PN ’s (ADJ)* N P Art (ADJ)* N
1
’s
PN
0
Art
2
ADJ
N
’s
3
John’s interesting
book with a nice cover
Art
P
PN
4
FASTUS
General Framework of NLP
Based on finite states automata (FSA)
1.Complex Words:
Morphological and
Lexical Processing
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
Syntactic Analysis
3.Complex phrases:
Complex noun groups and verb groups
Semantic Analysis
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
Context processing
Interpretation
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 “metal wood” clubs a month.
1.Complex words
Attachment
Ambiguities
are not made
explicit
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and
{{ a Japanese trading house to
}}
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 “metal wood” clubs a month.
1.Complex words
a Japanese tea house
a [Japanese tea] house
a Japanese [tea house]
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 “metal wood” clubs a month.
1.Complex words
Structural
Ambiguities of
NP are ignored
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture
in Taiwan with a local concern and a Japanese trading house to
produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990
with production of 20,000 “metal wood” clubs a month.
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
3.Complex Phrases
Example of IE: FASTUS(1993)
[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]
in [LOCATION] with [COMPANY] and [COMPNY] to
produce [PRODUCT] to be supplied to [LOCATION].
[JOINT-VENTURE], [COMPNY], capitalized at 20 million
[CURRENCY-UNIT] [START] production in [TIME]
with production of 20,000 [PRODUCT] a month.
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
3.Complex Phrases
Some syntactic structures
like …
Example of IE: FASTUS(1993)
[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE]
in [LOCATION] with [COMPANY] to
produce [PRODUCT] to be supplied to [LOCATION].
[JOINT-VENTURE] capitalized at [CURRENCY] [START]
production in [TIME]
with production of [PRODUCT] a month.
2.Basic Phrases:
Bridgestone Sports Co.: Company name
said
: Verb Group
Friday
: Noun Group
it
: Noun Group
had set up
: Verb Group
a joint venture
: Noun Group
in
: Preposition
Taiwan
: Location
3.Complex Phrases
Syntactic structures relevant
to information to be extracted
are dealt with.
Syntactic variations
GM set up a joint venture with Toyota.
GM announced it was setting up a joint venture with Toyota.
GM signed an agreement setting up a joint venture with Toyota.
GM announced it was signing an agreement to set up a joint
venture with Toyota.
Syntactic variations
GM set up a joint venture with Toyota.
GM announced it was setting up a joint venture with Toyota.
GM signed an agreement setting up a joint venture with Toyota.
GM announced it was signing an agreement to set up a joint
venture with Toyota.
S
NP
GM
[SET-UP]
VP
V
signed
NP
VP
N
agreement
V
GM plans to set up a joint venture with Toyota. setting up
GM expects to set up a joint venture with Toyota.
Syntactic variations
GM set up a joint venture with Toyota.
GM announced it was setting up a joint venture with Toyota.
GM signed an agreement setting up a joint venture with Toyota.
GM announced it was signing an agreement to set up a joint
venture with Toyota.
S
NP
GM
[SET-UP]
VP
V
set up
GM plans to set up a joint venture with Toyota.
GM expects to set up a joint venture with Toyota.
Example of IE: FASTUS(1993)
[COMPNY] [SET-UP] [JOINT-VENTURE]
in [LOCATION] with [COMPANY] to
produce [PRODUCT] to be supplied to [LOCATION].
[JOINT-VENTURE] capitalized at [CURRENCY] [START]
production in [TIME]
with production of [PRODUCT] a month.
3.Complex Phrases
4.Domain Events
[COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY]
[COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]
The attachment positions of PP are determined at this stage.
Irrelevant parts of sentences are ignored.
Complications caused by syntactic variations
Relative clause
The mayor, who was kidnapped yesterday, was found dead today.
[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]
[NG] Relpro {NG/others}* [VG]
Complications caused by syntactic variations
Relative clause
The mayor, who was kidnapped yesterday, was found dead today.
[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]
[NG] Relpro {NG/others}* [VG]
Complications caused by syntactic variations
Relative clause
The mayor, who was kidnapped yesterday, was found dead today.
[NG] Relpro {NG/others}* [VG] {NG/others}*[VG]
[NG] Relpro {NG/others}* [VG]
Basic patterns
Surface Pattern
Generator
Patterns used
by Domain Event
Relative clause construction
Passivization, etc.
FASTUS
Based on finite states automata (FSA)
1.Complex Words:
NP, who was kidnapped, was found.
2.Basic Phrases:
3.Complex phrases:
4.Domain Events:
Piece-wise recognition
Patterns for events of interest to the application
of basic templates
Basic templates are to be built.
5. Merging Structures:
Reconstructing information
Templates from different parts of the texts are carried via syntactic structures
merged if they provide information about the
by merging basic templates
same entity or event.
FASTUS
Based on finite states automata (FSA)
1.Complex Words:
NP, who was kidnapped, was found.
2.Basic Phrases:
3.Complex phrases:
4.Domain Events:
Piece-wise recognition
Patterns for events of interest to the application
of basic templates
Basic templates are to be built.
5. Merging Structures:
Reconstructing information
Templates from different parts of the texts are carried via syntactic structures
merged if they provide information about the
by merging basic templates
same entity or event.
FASTUS
Based on finite states automata (FSA)
1.Complex Words:
NP, who was kidnapped, was found.
2.Basic Phrases:
3.Complex phrases:
4.Domain Events:
Piece-wise recognition
Patterns for events of interest to the application
of basic templates
Basic templates are to be built.
5. Merging Structures:
Reconstructing information
Templates from different parts of the texts are carried via syntactic structures
merged if they provide information about the
by merging basic templates
same entity or event.
Current state of the arts of IE
1. Carefully constructed IE systems
F-60 level (interannotater agreement: 60-80%)
Domain: telegraphic messages about naval operation
(MUC-1:87, MUC-2:89)
news articles and transcriptions of radio broadcasts
Latin American terrorism (MUC-3:91, MUC-4:1992)
News articles about joint ventures (MUC-5, 93)
News articles about management changes (MUC-6, 95)
News articles about space vehicle (MUC-7, 97)
2. Handcrafted rules (named entity recognition, domain events, etc)
Automatic learning from texts:
Supervised learning : corpus preparation
Non-supervised, or controlled learning
IE in Biology
CSNDB
(National Institute of Health Sciences)
• A data- and knowledge- base for signaling
pathways of human cells.
– It compiles the information on biological molecules,
sequences, structures, functions, and biological
reactions which transfer the cellular signals.
– Signaling pathways are compiled as binary
relationships of biomolecules and represented by
graphs drawn automatically.
– CSNDB is constructed on ACEDB and inference
engine CLIPS, and has a linkage to TRANSFAC.
– Final goal is to make a computerized model for various
biological phenomena.
Example. 1
• A Standard Reaction
Signal_Reaction:
“EGF receptor  Grb2”
From_molecule “EGF receptor”
To_molecule “Grb2”
Tissue “liver”
Effect “activation”
Interaction
“SH2+phosphorylated Tyr”
Reference [Yamauchi_1997]
Excerpted @[Takai98]
Example. 3
• A Polymerization Reaction
Signal_Reaction:
“Ah receptor + HSP90 ”
Component “Ah receptor” “HSP90”
Effect “activation dissociation”
Interaction
“PAS domain”
“of Ah receptor”
Activity
“inactivation of Ah receptor”
Reference [Powell-Coffman_1998]
Excerpted @[Takai98]
FASTUS
Based on finite states automata (FSA)
1.Complex Words:
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
3.Complex phrases:
Complex noun groups and verb groups
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
FASTUS
Is separation of
stages possible ?
Based on finite states automata (FSA)
1.Complex Words:
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
3.Complex phrases:
Complex noun groups and verb groups
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
FASTUS
Is separation of
stages possible ?
Open word classes:
techical terms
very long
specific formation rules
many semantic classes
acronyms
variants
fairly ambiguous
[[Term recognition]]
Coordination across word
formation
A or B and C D
Based on finite states automata (FSA)
1.Complex Words:
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
3.Complex phrases:
Complex noun groups and verb groups
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
FASTUS
Is separation of
stages possible ?
Based on finite states automata (FSA)
1.Complex Words:
Recognition of multi-words and proper names
2.Basic Phrases:
Simple noun groups, verb groups and particles
3.Complex phrases:
Complex noun groups and verb groups
4.Domain Events:
Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:
Templates from different parts of the texts are
merged if they provide information about the
same entity or event.
Syntax/Semantics
An active phorbol ester must therefore, presumably
by activation of protein kinase C, cause dissociation
of a cytoplasmic complex of NF-kappa B and I kappa B
by modifying I kappa B.
E1: An active phorbol ester activates protein kinase C.
Syntax/Semantics
An active phorbol ester must therefore, presumably
by activation of protein kinase C, cause dissociation
of a cytoplasmic complex of NF-kappa B and I kappa B
by modifying I kappa B.
E1: An active phorbol ester activates protein kinase C.
E2: The active phorbol ester modifies I kappa B.
Syntax/Semantics
An active phorbol ester must therefore, presumably
by activation of protein kinase C, cause dissociation
of a cytoplasmic complex of NF-kappa B and I kappa B
by modifying I kappa B.
E1: An active phorbol ester activates protein kinase C.
E2: The active phorbol ester modifies I kappa B.
E3: It dissociates a cytoplasmic complex of NF-kappa B
and I kappa B.
Part-Whole
Syntax/Semantics
An active phorbol ester must therefore, presumably
by activation of protein kinase C, cause dissociation
of a cytoplasmic complex of NF-kappa B and I kappa B
by modifying I kappa B.
E1: An active phorbol ester activates protein kinase C.
E2: The active phorbol ester modifies I kappa B.
E3: It dissociates a cytoplasmic complex of NF-kappa B
and I kappa B.
Part-Whole
Full parser based on good grammar formalisms
1. Several attempts of using full parsers :
To improve the Precision
2. Systematic treatment of interaction of the
different phases :
Unification-based grammar formalisms
The two papers in the NLP session of PSB 2001
Experiment
(A.Yakushiji et.al, PSB2001)
XHPSG: HPSG-like Grammar translated from
XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98)
Automatic conversion: Detailed, empirical comparison of
grammars of different formalisms (+LFG)
Terms (Compound nouns) are chunked beforehand.
180 sentences from abstracts in MEDLINE
The average parse time per sentence: 2.7 sec by a naïve parser
(This can be improved by the multi-stage parser by 50 times)
Argument Frame Extractor
133 argument structures, marked by a domain specialist
in 97 sentences among the 180 sentences
Extracted Uniquely
Extracted with ambiguity
Extractable from pp’s
Parsing Not extractable
Failures
Memory limitation,etc
31
32
26
27
17
68%
Ontology: Knowledge of the Domain
Open class words:
Named entity recognition
(ex) Locations
Persons
Companies
Organizations
Position names
More refined semantic
classes with part-whole
relationships, properties,
Etc.
Acronyms, variants,
Etc.
Ontology: Knowledge of the Domain
Open class words:
Named entity recognition
(ex) Locations
Persons
Companies
Organizations
Position names
More refined semantic
classes with part-whole
relationships, properties,
Etc.
Acronyms, variants,
Etc.
Bio Term Bank
T
BB
 A database for all sort of biological terms collected
from genome databases and biological texts.
 It will contain 2 million terms in 2001 and 5 million
terms until 2005.
 Terms are classified by biochemical and
terminological attributes, grounded on their
resources.
Biological ontology committee Japan
organized by T. Takagi and T. Takai, U.Tokyo
in Genome Projects of MESSC (2000.4~
2005.3)
Ontology: Knowledge of the Domain
Open class words:
Named entity recognition
(ex) Locations
Persons
Companies
Organizations
Position names
More refined semantic
classes with part-whole
relationships, properties,
Etc.
Acronyms, variants,
Etc.
GENIA ontology
(current version)
+-name-+-source-+-natural-+-organism-+-multi-cell organism
|
|
|
+-mono-cell organism
|
|
|
+-virus
|
|
+-tissue
|
|
+-cell type
|
|
+-sub-location of cells
|
+-artificial-+-cell line
|
+-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group
|
|
+-protein complex
|
|
+-individual protein molecule
|
|
+-subunit of protein complex
|
|
+-substructure of protein
|
|
+-domain or region of protein
|
+-peptide
|
+-amino acid monomer
|
+-nucleic-+-DNA-+-DNA family or group
|
+-individual DNA molecule
|
+-domain or region of DNA
|
+-RNA-+-RNA family or group
+-individual RNA molecule
+-domain or region of RNA
Expansion of GENIA Ontology
• Try to tag all NPs in some MEDLINE abstracts
and find the classes that appears in abstracts
but not in current ontology
• Find frequent verbs and what class of arguments
they take
Expansion of GENIA Ontology
•
•
•
•
Chemical class of substance and their substrucutres
Sources
Biological role, or function, of substances
Reaction
– Biological reaction
– Pathway
– Disease
• Structure themselves
• Experiment , experimental results, and researchers
• Measure
Example of Entities in Expanded
• Biological role, or function, of substances
– receptor, inhibitor, …
• Biological reaction
– activation, binding, inhibition, apoptosis, G2 arrest
– pathway, signal
– immune dysfunction, Ataxia telangiectasia (AT)
• Structure themselves
– alpha-helix,
• Experiment, experimental results, researchers
– our results, these studies, we
Verbs Related to Biological Events
Frequent Verbs in 100 MEDLINE Abstracts
Ver b
be
induce
bind
show
suggest
activate
factor
demonstrate
inhibit
have
reveal
require
regulate
indicate
find
result
play
interact
mediate
contain
C ount
255
56
50
49
42
42
36
35
26
25
21
21
21
21
21
20
19
18
17
17
Ver b
C ount
involve
16
identify
16
act
15
stimulate
14
provide
14
express
13
affect
13
type
12
report
12
form
12
contribute
12
study
11
observe
11
lead
11
function
11
assay
11
appear
11
occur
10
increase
10
phosphorylate
9
Ver b
determine
construct
associate
reduce
prevent
locate
line
differ
trigger
synergize
examine
block
become
analyze
target
signal
remain
produce
present
possess
C ount
9
9
9
8
8
8
8
8
7
7
7
7
7
7
6
6
6
6
6
6
Ver b
explain
exert
enhance
display
characterize
participate
localize
investigate
imply
establish
conclude
compare
use
transform
transfect
test
suppress
support
substitute
share
C ount
6
6
6
6
6
5
5
5
5
5
5
5
4
4
4
4
4
4
4
4
Verbs Related to Biological Events
Verbs that take biological entities as arguments
• induce
– noun BE INDUCED BY noun
activation of these PROTEIN was induced by PROTEIN
– noun INDUCE noun
PROTEIN induced the tyrosine phosphorylation
• bind
– noun BIND TO noun
the drugs bind to two different PROTEIN
– noun BIND noun
– noun BINDING noun
– the BINDING of noun
motifs previously found to bind the cellular factors
the TATA-box binding protein
the binding of PROTEIN
semantic class: substance structure source experiment fact reaction
Verbs Related to Biological Events
Verbs that take description entities
• report
– noun REPORT that-clause
– noun REPORT noun
– noun REPORT noun
we report here that PROTEIN is activated by PROTEIN
we report the characterization of PROTEIN
we report a novel structure of PROTEIN
semantic class: substance structure source experiment fact reaction
Verbs Related to Biological Events
Verbs whose arguments depend on syntactic patterns
• show
– noun BE SHOWN to-infinitive
– noun SHOW that-clause
PROTEIN has been shown to trigger cellular PROTEIN activity
the data show that PROTEIN stimulation is also not sufficient
– noun SHOW noun
SOURCE showed a dose-dependent inhibition of PROTEIN activity
semantic class: substance source experiment fact
Verbs Related to Biological Events
Verbs that take both entities
• indicate
– noun INDICATE that-clause
– noun INDICATE noun
the data indicate that PROTEIN is required in CELL prolifiration
these findings indicate an unexpected role of DNA
– noun INDICATE that-clause
– noun INDICATE noun
the structure indicates that it represents a unique class of PROTEIN
the structure indicates mechanisms for allosteric effector action
semantic class: substance structure source experiment fact reaction role
Example of NE Annotation
UI - 85146267
TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class"
cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte"
mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">.
AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class"
cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV"
unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll
gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV"
unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of
<NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6">
plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">),
with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK"
cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK"
cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV"
subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14)
and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11"
class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> =
<NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> =
<NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater
than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE
ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK"
cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV"
unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE
ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK"
cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.
Available from our website:
Definition of ontological classes
Manual of GMPL: extention of XML to annonate texts
Manual of Text Annotation
Soon: Annotated texts (1000 abstracts) by the end of March
1. IE can contribute to Bio-informatics significantly.
2. However, the domains in Bio-chemistry seem more
structurally rich than the domains we have dealt with so
far.
Term formation, rich ontologies, complex syntactic
structures.
3. It requires substantial efforts in resource building.
4. However, those resources can contribute to other
applications :
Knowledge sharing, Intelligent IR, Knowledge discovery
One of the crucial techniques is ATR ….
Overview of GENIA System
Retrieval Module
Corpus Module
•Markup generation / compilation
•Annotated corpus construction
Text
Structure
Interface Module
Annotated
Event
Security
Data model
Concept Module
•BK design / construction / compilation
•IR Request
•Abstract
•Full Paper
Database
Document Named-Entity
Markup
language
User
•GUI
•HTML conversion
•System integration
Background Knowledge
Ontology
MEDLINE
•Identify & classify terms
•Identify events
Corpus
Raw(OCR)
•Request enhancement
•Spawn request
•Classify documents
Information Extraction
Module
Database Module
•DB design / access / management
•DB construction