LOGO Naoaki Okazaki 1, Sophia Ananiadou 2, 3 Jun’ichi Tsujii 1, 2, 3 1 The University of Tokyo 2 The University of Manchester 3 The National Centre for Text Mining Abbreviation recognition (AR) To extract abbreviations and their expanded forms appearing in actual text The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID: 14986424 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 2 Abbreviation recognition (AR) To extract abbreviations and their expanded forms appearing in actual text Definitions The PCR-RFLP technique was applied to analyze the distribution of estrogen receptor (ESR) gene, follicle-stimulating hormone beta subunit (FSH beta) gene and prolactin receptor (PRLR) gene in three lines of Jinhua pigs. PMID: 14986424 Abbreviations (short forms) Expanded forms (long forms; full forms) 2008-08-22 Term variation The 22nd International Conference on Computational Linguistics (Coling 2008) 3 AR for disambiguating abbreviations Sense inventories (abbreviation dictionaries) Training corpora for disambiguation (context information of expanded forms) Local definitions 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 4 AR for disambiguating abbreviations Sense inventories (abbreviation dictionaries) What can ‘CT’ stand for? - Acromine http://www.nactem.ac.uk/software/acromine/ 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 5 AR for disambiguating abbreviations Sense inventories (abbreviation dictionaries) Training corpora for disambiguation (context information of expanded forms) ... evaluated using preoperative computed tomography (CT) scan, ... ... by oral administration with the adjuvant cholera toxin (CT), ... Sentences (contexts) in which CT is defined Biopsies from bone metastatic lesions were performed under CT scan, ... 2008-08-22 Training CT = computed tomography Classifier The 22nd International Conference on Computational Linguistics (Coling 2008) 6 AR for disambiguating abbreviations Sense inventories (abbreviation dictionaries) Training corpora for disambiguation (context information of expanded forms) Local definitions (one-sense-per-discourse assumption) Mice can be sensitized to food proteins by oral administration with the adjuvant cholera toxin (CT), ... BALB/c mice were fed with CT or PBS. The impact of CT on DC subsets ... 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 7 AR for disambiguating abbreviations Sense inventories (abbreviation dictionaries) Training corpora for disambiguation (context information of expanded forms) Local definitions AR plays a key role in managing abbreviations in text 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 8 Outline of this presentation Introduction (done) Methodologies Abbreviation candidate Common Region for definitions Previous work 2008-08-22 Unsolved Problems Abbreviation alignment Computing features This study Maximum entropy modeling Experiments Conclusion The 22nd International Conference on Computational Linguistics (Coling 2008) 9 Step 0: Sample text We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ... The task Extract an abbreviation definition from this text We do not extract one if no abbreviation definition is found in the text TTF-1: thyroid transcription factor 1 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 10 Step 1: Abbreviation candidates in parentheses We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ... • Parenthetical expressions as clues for abbreviations • Requirements for abbreviation candidates (Schwartz and Hearst, 03): – – – – the inner expression consists of two words at most the length is between two to ten characters the expression contains at least an alphabetic letter the first character is alphanumeric • Abbreviation candidate: y = ‘TTF-1’ 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 11 Step 2: Region for extracting abbreviation definitions We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ... Heuristics for regions for finding expanded forms (Schwartz and Hearst, 03) min(m + 5, 2m) words before the abbreviation, where m is the number of alphanumeric letters in the abbreviation Take 8 words before the parentheses (m = 4) The remaining task is to extract a true expanded form (if any) in this region 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 12 Previous studies: Finding expanded forms Rule-based Deterministic algorithm (Schwartz & Hearst, 03) Maps all alpha-numerical letters in the abbreviation to the expanded form, starting from the end of both the abbreviations and expanded forms right to left. We investigate the effect of thyroid transcription factor 1 (TTF-1) in human C cells ... TTF-1: transcription factor 1 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 13 Previous studies: Finding expanded forms Rule-based Deterministic algorithm (Schwartz & Hearst, 03) Four scoring rules for multiple candidates (Adar, 04) +1: for every abbreviation letter from the head of a word -1: for every extra word between the definition and parentheses +1: for definitions followed by the parentheses immediately -1: for every extra word +4 transcription factor 1 (TTF-1) +5 thyroid transcription factor 1 (TTF-1) 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 14 Previous studies: Finding expanded forms Rule-based Deterministic algorithm (Schwartz & Hearst, 03) Four scoring rules for multiple candidates (Adar, 04) Detailed rules (Ao & Takagi, 05) 2008-08-22 Two pages long (!) in their paper The 22nd International Conference on Computational Linguistics (Coling 2008) 15 Previous studies: Finding expanded forms Rule-based Deterministic algorithm (Schwartz & Hearst, 03) Four scoring rules for multiple candidates (Adar, 04) Detailed rules (Ao & Takagi, 05) Machine-learning based (Nadeau & Turney 05; Chang & 06) Aimed at obtaining an optimal set of rules through training Uses 10-20 features that roughly correspond to the rules proposed by the former work 2008-08-22 "# of abbreviation letters matching the first letter of a word“ “# of abbreviation letters that are capitalized in the definition” The 22nd International Conference on Computational Linguistics (Coling 2008) 16 Problems in previous studies – Difficult in tweaking the extraction rules by hand • • of blood lactate accumulation (OBLA) or onset of blood lactate accumulation (OBLA) – Difficult in handling non-definitions (negatives) • of postoperative AF in patients submitted to CABG without cardiopulmonary bypass (off-pump) – Difficult in recognizing shuffled abbreviations • receptor of estrogen (ER) – No breakthrough was reported by applying ML • 2008-08-22 Previous studies used a few features that are reproductions from the rule-based methods The 22nd International Conference on Computational Linguistics (Coling 2008) 17 This study Predict origins of abbreviation letters (alignment) alignment surrounding expression abbreviation candidate aˆ argmaxP(a | x, y ) aC ( x , y ) possible alignments given by steps 1 and 2 Maximum Entropy Modeling Discriminative training of the abbreviation alignment model 2008-08-22 A large amount of features that directly express the events where letters in an expanded form produce or do not produce abbreviation letters A corpus with abbreviation alignment annotated The 22nd International Conference on Computational Linguistics (Coling 2008) 18 Step 3: C(x, y): Alignment candidates Distortion = 1 (swap ‘thyroid’ and ‘transcription’) Always include a negative alignment (in case that no definition is appropriate) x の中で略語に含まれる文字をマークする Mark positions of letters Associate every alpha y の各文字に対して,x 中の文字をひとつだけ割り当て that also appear in the numeric letter in the るabbreviations abbreviation with a letter まったく文字を割り当てないアライメントを必ず含め る 略語の文字(列)は d 回だけ並び替えてもよい 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 19 Step 4: Abbreviation alignment Letters ‘t’ and ‘f’ at these points did not produce abbreviation 2008-08-22 Prefer non-null mappings that is the most adjacent to t The 22nd International Conference on Computational Linguistics (Coling 2008) 20 Atomic feature functions • Atomic functions for x – – – – – letter type: x_ctype letter position: x_position lower-cased letter: x_char lower-cased word: x_word part-of-speech code: x_pos • Atomic functions for y – – – letter type: y_ctype letter position: y_position lower-cased letter: y_char • Atomic function for a – ◦ Offset parameter δ ◦ Features are expressed by combinations of atomic functions a_state: SKIP, MATCH, ABBR • Atomic functions for adjacent x – – distance in letters: x_diff distance in words: x_diff_wd • Atomic function for adjacent y – 2008-08-22 distance in letters : y_diff ◦ Refer to Table 2 for the complete list of combination rules The 22nd International Conference on Computational Linguistics (Coling 2008) 21 Step 5: Computing features Bigram features y_ctype0=U/MATCH/ 0.7 y_position0=H/MATCH/ 0.9 y_ctype0=U;y_position0=H/MATCH/ 1.6 y_char0=t/MATCH/ -0.1 … 9.1 MATCH/MATCH/ 1.4 x_diff_wd=1/MATCH/MATCH/ 3.3 y_diff=1/MATCH/MATCH/ 0.8 … 5.5 Unigram features x_ctype0=L/y_ctype0=U/MATCH/ 0.5 x_position0=H/y_position0=H/MATCH/ 0.4 x_ctype0=L/y_position0=H/MATCH/ 0.3 x_position0=H/y_ctype0=U/MATCH/ 1.1 … 2008-08-22 x_ctype0=L/MATCH/ 0.1 x_position0=H/MATCH/ 1.9 x_ctype0=L;x_position0=H/MATCH/ 1.2 x_ctype1=L/MATCH/ 0.3 x_position1=I/MATCH/ 0.2 The 22nd International Conference on Computational Linguistics (Coling 2008) 22 Step 6: Probabilities of alignments 1.4 2.3 2.3 1.9 -3.6 2.1 0.8 2.3 2.1 3.7 4.9 2.7 2.7 2.9 9.1 The sum of feature weights on the alignment 95.5 83.2 37.5 Take exponents of these values and normalize as probabilities 0.99 4.55e-6 6.47e-26 2008-08-22 6.2 6.2 5.5 8.5 5.9 7.7 5.7 6.9 5.3 The 22nd International Conference on Computational Linguistics (Coling 2008) 23 Maximum entropy modeling Conditional probability modeled by MaxEnt P (a | x, y ) Z (x, y ) 1 expΛ F (a, x, y ) Z ( x, y ) expΛ F (a, x, y ) aC ( x , y ) Possible alignments Parameter estimation (training) Vector of the number of occurrences of features on the alignment a Vector of feature weights Maximize the log-likelihood of the probability distribution by applying maximum a posteriori (MAP) estimation N L1 log P(a n 1 N (n) | x , y ) (n) (n) Log-likelihood L2 log P(a ( n ) | x ( n ) , y ( n ) ) n 1 2008-08-22 Sum of feature weights on the alignment a Λ1 1 Λ 2 L1 regularization; solved by OW-LQN method (Andrew, 07) 2 2 2 2 L2 regularization; solved by L-BFGS method (Nocedal, 80) The 22nd International Conference on Computational Linguistics (Coling 2008) 24 Training corpus • Corpus for training/evaluation – – Selected 1,000 abstracts randomly from MEDLINE Annotated 1,420 parenthetical expressions manually, and obtained 864 positive instances (aligned abbreviation definitions) and 556 negative instances (non-definitions) ! Measurement of hepatitis C virus (HCV) RNA may be beneficial in managing the treatment of … 1 2 3 (HCV) ! The mean reduction of UCCA at month 48 was 5.7% for patients initially on placebo who received t reatment at 24 months (RRMS) or … (RRMS) 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 25 Experiments • Experimental settings (parameters) – – – – L1-regularization or L2-regularization (σ = 3) No distortion (d = 0) or distortion (d = 1) Average number of alignment candidates per instance: – 8.46 (d = 0) and 69.1 (d = 1) Total number of features generated (d=0): 850,009 – Baseline systems – – Schwartz & Hearst (03), SaRAD (Adar, 04), ALICE (Ao, 05) Chang & Schutze (06), Nadeau & Turney (05) – Test data – – 2008-08-22 Our abbreviation alignment corpus (10-fold cross validation) Medstract corpus – Our method is trained on our corpus, and tested on this The 22nd International Conference on Computational Linguistics (Coling 2008) 26 Performance on our corpus System P R F1 Schwartz & Hearst (2003) .978 .940 .959 SaRAD (Adar, 04) .891 .919 .905 ALICE (Ao, 05) .961 .920 .940 Chang & Schutze (2006) .942 .900 .921 Nadeau & Turney (2005) .954 .871 .910 Proposed (d = 0; L1 regularization) .973 .969 .971 Proposed (d = 0; L2 regularization) .964 .968 .966 Proposed (d = 1; L1 regularization) .960 .981 .971 Proposed (d = 1; L2 regularization) .957 .976 .967 (simple) rule-based (complex) Machine learning No distortion Proposed Distortion • The L1 regularization Baseline inclusion previous proposed systems approaches ofmethod distorted performed with refined achieved with abbreviations better heuristics machine the than best learning L(d=1) (SaRAD F1 score gained (C&S andof because ALICE) the and all 2 probably higest could N&T) the number were not recall outperform roughly of and features F1 comparable the aresimlest far larger tosystem rule-based than(S&H) thatmethods of instances 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 27 Performance on Medstract corpus System P R F1 Schwartz & Hearst (2003) .942 .891 .916 SaRAD (Adar, 04) .909 .859 .884 ALICE (Ao, 05) .960 .945 .953 Chang & Schutze (2006) .858 .852 .855 Nadeau & Turney (2005) .889 .875 .882 Proposed (d = 1; L1 regularization) .976 .945 .960 (simple) rule-based (complex) Machine learning Proposed The proposed method was trained on our corpus, and applied to the Medstract corpus Still outperformed the baseline systems ALICE delivered much better results than S&H 2008-08-22 The rules in ALICE might be tuned for this corpus The 22nd International Conference on Computational Linguistics (Coling 2008) 28 Alignment examples (1) Shuffled abbreviations were successfully recognized 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 29 Alignment examples (2) There are some confusing cases 2008-08-22 The proposed method failed to choose the third alignment The 22nd International Conference on Computational Linguistics (Coling 2008) 30 Top seven features with high weights Rank Feature Weight 1 U: x_position0=H;y_ctype0=U;y_position0=H/M 1.7370 2 B: y_position0=I/y_position0=I/x_diff=1/M-M 1.3470 3 U: x_ctype-1=L;x_ctype0=L/S 0.96342 4 B: x_ctype0=L/x_ctype0=L/x_diff_wd=0/M-M 0.94009 5 U: x_position0=I;x_char1=‘t’/S 0.91645 6 U: x_position0=H;x_pos0=NN;y_ctype0=U/M 0.86786 7 U: x_ctype-1=S;x_ctype0=L/M 0.86474 #1: “Associate a head letter in a definition with an uppercase head letter of the abbreviation” #2: “Produce two abbreviation letters from two consecutive letters in the definition” #3: “Do not produce an abbreviation letter from a lowercase letter whose preceding letter is also lowercase.” #4: “Produce two abbreviation letters from two lowercase letters in the same word” 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 31 Conclusion • Abbreviation recognition successfully formalized as a sequential alignment problem – – howed remarkable improvements over previous methods S Obtained fine-grained features that express the events wherein an expanded form produces an abbreviation letter • Future work – – – 2008-08-22 To handle different patters (e.g., ”aka”, “abbreviated as”) To combine with the statistical approach (Okazaki, 06) • Construct a comprehensible abbreviation dictionary based on the n-best solutions and statistics of occurrences To train the alignment model from a non-aligned corpus, inducing abbreviation alignments simultaneously The 22nd International Conference on Computational Linguistics (Coling 2008) 32 More and more abbreviations produced SaRAD (Adar, 04) extracted 6,574,953 abbreviation definitions from the whole of the MEDLINE database released in 2006 2008-08-22 The 22nd International Conference on Computational Linguistics (Coling 2008) 33
© Copyright 2024 ExpyDoc