Conditional Random Fields を用いた略語抽出

LOGO
Naoaki Okazaki 1, Sophia Ananiadou 2, 3
Jun’ichi Tsujii 1, 2, 3
1 The
University of Tokyo
2 The University of Manchester
3 The National Centre for Text Mining
Abbreviation recognition (AR)
To extract abbreviations and their expanded
forms appearing in actual text
The PCR-RFLP technique was applied to
analyze the distribution of estrogen
receptor (ESR) gene, follicle-stimulating
hormone beta subunit (FSH beta) gene and
prolactin receptor (PRLR) gene in three
lines of Jinhua pigs.
PMID: 14986424
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
2
Abbreviation recognition (AR)
To extract abbreviations and their expanded
forms appearing in actual text
Definitions
The PCR-RFLP technique was applied to
analyze the distribution of estrogen
receptor (ESR) gene, follicle-stimulating
hormone beta subunit (FSH beta) gene and
prolactin receptor (PRLR) gene in three
lines of Jinhua pigs.
PMID: 14986424
Abbreviations (short forms)
Expanded forms (long forms; full forms)
2008-08-22
Term
variation
The 22nd International Conference on Computational Linguistics (Coling 2008)
3
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)
Training corpora for disambiguation (context
information of expanded forms)
Local definitions
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
4
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)

What can ‘CT’ stand for?
- Acromine
http://www.nactem.ac.uk/software/acromine/
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
5
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)
Training corpora for disambiguation (context
information of expanded forms)
... evaluated using
preoperative computed
tomography (CT) scan, ...
... by oral administration
with the adjuvant cholera
toxin (CT), ...
Sentences (contexts) in which CT is defined
Biopsies from bone
metastatic lesions
were performed
under CT scan, ...
2008-08-22
Training
CT =
computed tomography
Classifier
The 22nd International Conference on Computational Linguistics (Coling 2008)
6
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)
Training corpora for disambiguation (context
information of expanded forms)
Local definitions (one-sense-per-discourse assumption)
Mice can be sensitized to food proteins
by oral administration with the adjuvant
cholera toxin (CT), ...
BALB/c mice were fed with CT or PBS. The
impact of CT on DC subsets ...
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
7
AR for disambiguating abbreviations
Sense inventories (abbreviation dictionaries)
Training corpora for disambiguation (context
information of expanded forms)
Local definitions
AR plays a key role in managing abbreviations in
text
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
8
Outline of this presentation
Introduction (done)
Methodologies
Abbreviation candidate
Common
Region for definitions
Previous
work
2008-08-22
Unsolved
Problems
Abbreviation alignment
Computing features
This study
Maximum entropy modeling
Experiments
Conclusion
The 22nd International Conference on Computational Linguistics (Coling 2008)
9
Step 0: Sample text
We investigate the effect of thyroid
transcription factor 1 (TTF-1) in human C
cells ...
The task
Extract an abbreviation definition from this text
We do not extract one if no abbreviation definition is
found in the text


TTF-1: thyroid transcription factor 1
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
10
Step 1: Abbreviation candidates in parentheses
We investigate the effect of thyroid
transcription factor 1 (TTF-1) in human C
cells ...
• Parenthetical expressions as clues for abbreviations
• Requirements for abbreviation candidates (Schwartz
and Hearst, 03):
–
–
–
–
the inner expression consists of two words at most
the length is between two to ten characters
the expression contains at least an alphabetic letter
the first character is alphanumeric
• Abbreviation candidate: y = ‘TTF-1’
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
11
Step 2: Region for extracting abbreviation definitions
We investigate the effect of thyroid
transcription factor 1 (TTF-1) in human C
cells ...
Heuristics for regions for finding expanded forms
(Schwartz and Hearst, 03)

min(m + 5, 2m) words before the abbreviation, where m is
the number of alphanumeric letters in the abbreviation
Take 8 words before the parentheses (m = 4)
The remaining task is to extract a true expanded
form (if any) in this region
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
12
Previous studies: Finding expanded forms
Rule-based

Deterministic algorithm (Schwartz & Hearst, 03)

Maps all alpha-numerical letters in the abbreviation to the
expanded form, starting from the end of both the abbreviations
and expanded forms right to left.
We investigate the effect of thyroid
transcription factor 1 (TTF-1) in human C
cells ...
TTF-1: transcription factor 1
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
13
Previous studies: Finding expanded forms
Rule-based
Deterministic algorithm (Schwartz & Hearst, 03)
Four scoring rules for multiple candidates (Adar, 04)






+1: for every abbreviation letter from the head of a word
-1: for every extra word between the definition and parentheses
+1: for definitions followed by the parentheses immediately
-1: for every extra word
+4 transcription factor 1 (TTF-1)
+5 thyroid transcription factor 1 (TTF-1)
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
14
Previous studies: Finding expanded forms
Rule-based



Deterministic algorithm (Schwartz & Hearst, 03)
Four scoring rules for multiple candidates (Adar, 04)
Detailed rules (Ao & Takagi, 05)

2008-08-22
Two pages long (!) in their paper
The 22nd International Conference on Computational Linguistics (Coling 2008)
15
Previous studies: Finding expanded forms
Rule-based



Deterministic algorithm (Schwartz & Hearst, 03)
Four scoring rules for multiple candidates (Adar, 04)
Detailed rules (Ao & Takagi, 05)
Machine-learning based (Nadeau & Turney 05;
Chang & 06)


Aimed at obtaining an optimal set of rules through
training
Uses 10-20 features that roughly correspond to the rules
proposed by the former work


2008-08-22
"# of abbreviation letters matching the first letter of a word“
“# of abbreviation letters that are capitalized in the definition”
The 22nd International Conference on Computational Linguistics (Coling 2008)
16
Problems in previous studies
– Difficult in tweaking the extraction rules by hand
•
•
of blood lactate accumulation (OBLA) or
onset of blood lactate accumulation (OBLA)
– Difficult in handling non-definitions (negatives)
•
of postoperative AF in patients submitted to CABG without
cardiopulmonary bypass (off-pump)
– Difficult in recognizing shuffled abbreviations
•
receptor of estrogen (ER)
– No breakthrough was reported by applying ML
•
2008-08-22
Previous studies used a few features that are
reproductions from the rule-based methods
The 22nd International Conference on Computational Linguistics (Coling 2008)
17
This study
Predict origins of abbreviation letters (alignment)
alignment
surrounding expression
abbreviation candidate
aˆ  argmaxP(a | x, y )
aC ( x , y )
possible alignments
given by steps
1 and 2
Maximum Entropy Modeling
Discriminative training of the abbreviation
alignment model


2008-08-22
A large amount of features that directly express the
events where letters in an expanded form produce or do
not produce abbreviation letters
A corpus with abbreviation alignment annotated
The 22nd International Conference on Computational Linguistics (Coling 2008)
18
Step 3: C(x, y): Alignment candidates
Distortion = 1
(swap ‘thyroid’ and
‘transcription’)
Always include a negative
alignment (in case that no
definition is appropriate)
 x の中で略語に含まれる文字をマークする
Mark positions of letters
Associate every alpha y の各文字に対して,x
中の文字をひとつだけ割り当て
that also appear in the
numeric letter in the
るabbreviations
abbreviation with a letter
 まったく文字を割り当てないアライメントを必ず含め
る
 略語の文字(列)は d 回だけ並び替えてもよい
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
19
Step 4: Abbreviation alignment
Letters ‘t’ and ‘f’ at
these points did not
produce abbreviation
2008-08-22
Prefer non-null
mappings that is the
most adjacent to t
The 22nd International Conference on Computational Linguistics (Coling 2008)
20
Atomic feature functions
• Atomic functions for x
–
–
–
–
–
letter type: x_ctype
letter position: x_position
lower-cased letter: x_char
lower-cased word: x_word
part-of-speech code: x_pos
• Atomic functions for y
–
–
–
letter type: y_ctype
letter position: y_position
lower-cased letter: y_char
• Atomic function for a
–
◦ Offset parameter δ
◦ Features are expressed by
combinations of atomic functions
a_state: SKIP, MATCH, ABBR
• Atomic functions for adjacent x
–
–
distance in letters: x_diff
distance in words: x_diff_wd
• Atomic function for adjacent y
–
2008-08-22
distance in letters : y_diff
◦ Refer to Table 2 for the complete
list of combination rules
The 22nd International Conference on Computational Linguistics (Coling 2008)
21
Step 5: Computing features
Bigram features
y_ctype0=U/MATCH/ 0.7
y_position0=H/MATCH/ 0.9
y_ctype0=U;y_position0=H/MATCH/ 1.6
y_char0=t/MATCH/ -0.1
…
9.1
MATCH/MATCH/ 1.4
x_diff_wd=1/MATCH/MATCH/ 3.3
y_diff=1/MATCH/MATCH/ 0.8
…
5.5
Unigram features
x_ctype0=L/y_ctype0=U/MATCH/ 0.5
x_position0=H/y_position0=H/MATCH/ 0.4
x_ctype0=L/y_position0=H/MATCH/ 0.3
x_position0=H/y_ctype0=U/MATCH/ 1.1
…
2008-08-22
x_ctype0=L/MATCH/ 0.1
x_position0=H/MATCH/ 1.9
x_ctype0=L;x_position0=H/MATCH/ 1.2
x_ctype1=L/MATCH/ 0.3
x_position1=I/MATCH/ 0.2
The 22nd International Conference on Computational Linguistics (Coling 2008)
22
Step 6: Probabilities of alignments
1.4
2.3
2.3
1.9
-3.6
2.1
0.8
2.3
2.1
3.7
4.9
2.7 2.7 2.9
9.1
The sum of feature weights on
the alignment
95.5
83.2
37.5
Take exponents of these values
and normalize as probabilities
0.99
4.55e-6
6.47e-26
2008-08-22
6.2
6.2
5.5
8.5
5.9
7.7
5.7
6.9
5.3
The 22nd International Conference on Computational Linguistics (Coling 2008)
23
Maximum entropy modeling
Conditional probability modeled by MaxEnt
P (a | x, y ) 
Z (x, y ) 
1
expΛ  F (a, x, y )
Z ( x, y )
 expΛ  F (a, x, y )
aC ( x , y )
Possible alignments
Parameter estimation (training)

Vector of the number of
occurrences of features
on the alignment a
Vector of feature
weights
Maximize the log-likelihood of the probability distribution
by applying maximum a posteriori (MAP) estimation
N
L1   log P(a
n 1
N
(n)
| x , y )
(n)
(n)
Log-likelihood
L2   log P(a ( n ) | x ( n ) , y ( n ) ) 
n 1
2008-08-22
Sum of feature weights
on the alignment a
Λ1
1
Λ
2
L1 regularization; solved by
OW-LQN method (Andrew, 07)
2
2
2
2
L2 regularization; solved by
L-BFGS method (Nocedal, 80)
The 22nd International Conference on Computational Linguistics (Coling 2008)
24
Training corpus
• Corpus for training/evaluation
–
–
Selected 1,000 abstracts randomly from MEDLINE
Annotated 1,420 parenthetical expressions manually, and
obtained 864 positive instances (aligned abbreviation
definitions) and 556 negative instances (non-definitions)
! Measurement of hepatitis C virus (HCV) RNA may
be beneficial in managing the treatment of …
1
2 3
(HCV)
! The mean reduction of UCCA at month 48 was 5.7%
for patients initially on placebo who received t
reatment at 24 months (RRMS) or …
(RRMS)
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
25
Experiments
• Experimental settings (parameters)
–
–
–
–
L1-regularization or L2-regularization (σ = 3)
No distortion (d = 0) or distortion (d = 1)
Average number of alignment candidates per instance:
– 8.46 (d = 0) and 69.1 (d = 1)
Total number of features generated (d=0): 850,009
– Baseline systems
–
–
Schwartz & Hearst (03), SaRAD (Adar, 04), ALICE (Ao, 05)
Chang & Schutze (06), Nadeau & Turney (05)
– Test data
–
–
2008-08-22
Our abbreviation alignment corpus (10-fold cross validation)
Medstract corpus
– Our method is trained on our corpus, and tested on this
The 22nd International Conference on Computational Linguistics (Coling 2008)
26
Performance on our corpus
System
P
R
F1
Schwartz & Hearst (2003)
.978
.940
.959
SaRAD (Adar, 04)
.891
.919
.905
ALICE (Ao, 05)
.961
.920
.940
Chang & Schutze (2006)
.942
.900
.921
Nadeau & Turney (2005)
.954
.871
.910
Proposed (d = 0; L1 regularization)
.973
.969
.971
Proposed (d = 0; L2 regularization)
.964
.968
.966
Proposed (d = 1; L1 regularization)
.960
.981
.971
Proposed (d = 1; L2 regularization)
.957
.976
.967
(simple)
rule-based
(complex)
Machine
learning
No distortion
Proposed
Distortion
• The
L1 regularization
Baseline
inclusion
previous
proposed
systems
approaches
ofmethod
distorted
performed
with refined
achieved
with
abbreviations
better
heuristics
machine
the
than
best
learning
L(d=1)
(SaRAD
F1 score
gained
(C&S
andof
because
ALICE)
the
and
all
2 probably
higest
could
N&T)
the
number
were
not
recall
outperform
roughly
of
and
features
F1 comparable
the
aresimlest
far larger
tosystem
rule-based
than(S&H)
thatmethods
of instances
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
27
Performance on Medstract corpus
System
P
R
F1
Schwartz & Hearst (2003)
.942
.891
.916
SaRAD (Adar, 04)
.909
.859
.884
ALICE (Ao, 05)
.960
.945
.953
Chang & Schutze (2006)
.858
.852
.855
Nadeau & Turney (2005)
.889
.875
.882
Proposed (d = 1; L1 regularization)
.976
.945
.960
(simple)
rule-based
(complex)
Machine
learning
Proposed
 The proposed method was trained on our corpus, and
applied to the Medstract corpus

Still outperformed the baseline systems
 ALICE delivered much better results than S&H

2008-08-22
The rules in ALICE might be tuned for this corpus
The 22nd International Conference on Computational Linguistics (Coling 2008)
28
Alignment examples (1)
 Shuffled abbreviations were successfully recognized
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
29
Alignment examples (2)
There are some confusing cases

2008-08-22
The proposed method failed to choose the third
alignment
The 22nd International Conference on Computational Linguistics (Coling 2008)
30
Top seven features with high weights
Rank
Feature
Weight
1 U: x_position0=H;y_ctype0=U;y_position0=H/M
1.7370
2 B: y_position0=I/y_position0=I/x_diff=1/M-M
1.3470
3 U: x_ctype-1=L;x_ctype0=L/S
0.96342
4 B: x_ctype0=L/x_ctype0=L/x_diff_wd=0/M-M
0.94009
5 U: x_position0=I;x_char1=‘t’/S
0.91645
6 U: x_position0=H;x_pos0=NN;y_ctype0=U/M
0.86786
7 U: x_ctype-1=S;x_ctype0=L/M
0.86474
#1: “Associate a head letter in a definition with an uppercase head letter of the
abbreviation”
#2: “Produce two abbreviation letters from two consecutive letters in the definition”
#3: “Do not produce an abbreviation letter from a lowercase letter whose preceding
letter is also lowercase.”
#4: “Produce two abbreviation letters from two lowercase letters in the same word”
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
31
Conclusion
• Abbreviation recognition successfully formalized as a
sequential alignment problem
–
–
howed remarkable improvements over previous methods
S
Obtained fine-grained features that express the events wherein
an expanded form produces an abbreviation letter
• Future work
–
–
–
2008-08-22
To handle different patters (e.g., ”aka”, “abbreviated as”)
To combine with the statistical approach (Okazaki, 06)
• Construct a comprehensible abbreviation dictionary based
on the n-best solutions and statistics of occurrences
To train the alignment model from a non-aligned corpus,
inducing abbreviation alignments simultaneously
The 22nd International Conference on Computational Linguistics (Coling 2008)
32
More and more abbreviations produced
SaRAD (Adar, 04) extracted 6,574,953 abbreviation definitions
from the whole of the MEDLINE database released in 2006
2008-08-22
The 22nd International Conference on Computational Linguistics (Coling 2008)
33