Bilingual Phrase Spectral Clustering for

Graph-based Bilingual Phrase
Sense Disambiguation for
Statistical Machine Translation
Mamoru Komachi
[email protected]
2008-06-04
Background

Success of supervised ML methods depends
on annotated corpus


Hard to maintain
Weakly supervised method requires only
small amount of tagged data

Can reduce amount of human effort
Problems remained

WSD is crucial to weakly supervised method


Cf. semantic drift
Parallel corpora (and dictionaries) may help
disambiguate word senses

WSD models in SMT systems gain much
attention (Carpuat and Wu, 2007)
WSD and SMT

Improving Statistical Machine Translation
using Word Sense Disambiguation (Carpuat
and Wu, EMNLP 2007)

SMT is known to suffer from inaccurate lexical
choice (based on senseval style sense inventory)



Domain adaptation problem
Input is typically a word
Limited contextual features
Phrase-based WSD models for SMT

Sense annotations are derived from phrase
alignment learned during SMT training



WSD senses are from the SMT phrasal
translation lexicon “phrase table”
Not only words but also phrases are to be
disambiguated
Supervised WSD (an ensemble methods of
naïve Bayes, ME, boosting and a Kernel PCA)
Phrase table is highly ambiguous

Phrase table constructed from NTCIR-7 J-E
parallel corpus


0.5~1GB (in gzip format)
2.53 candidates per phrase (3.24 candidates
per phrase for phrases shorter than 5 words)

Includes function words as well
Plant ||| 工場
Plant ||| 植物
Plant ||| 設備
Plant ||| 発電 プラント
Plant ||| 工場 内
Plant ||| 制御 対象
Plant ||| 動植物
Plant ||| 供給 プラント
Plant has 120
translations in
the phrase
table!
Motivation

Propose a novel graph-based approach to
phrase sense disambiguation


Can exploit bilingual contextual patterns
Evaluate phrase sense disambiguation on
SMT framework
Monolingual bootstrapping

Pioneered by (Yarowsky, 1995)


Learn decision lists from a small set of seed
instances (input: instance I, output: classifier)
‘One sense per discourse’ constraint
Bootstrapping


Iteratively conduct pattern induction and instance
extraction starting from seed instances
Can fertilize small set of seed instances
Instances
vaio
Toshiba satellite
HP xb3000
2015/10/1
Query log
(Corpus)
Contextual
patterns
Compare vaio laptop
Compare toshiba satellite laptop
Compare HP xb3000 laptop
9
Compare # laptop
#:slot
Bilingual bootstrapping

Word Translation Disambiguation Using
Bilingual Bootstrapping (Li and Li, ACL-2002)
…
工場
植物
…
…
Mill
Plant
Vegetable
…
corpus
WSD
classifier
コーパス
Formalization of bootstrapping


Score vector of seed instance
Pattern-instance matrix P
0

1

P
1

0

1 0 0

0 0 0
1 0 1

1 1 1
Iterate
インスタンスの類似度行列をA=PTP
として、このステップを再帰的に行うと
in=Ani0
インスタンスを最終
スコア順に出力
in 1  P T pn


2015/10/1
行列の(p,i)要素はパターンp
とインスタンスiの共起

pn  Pin

i0  0,...,1,...,0
Output ranked instances when stopping criterion
met
11
Bilingual phrase sense
disambiguation

The (p,i) element of a pattern-instance
matrix P is a co-occurrence between pattern
p and instance I




p: contextual features of both language sides,
with phrase alignment from GIZA++
i: candidate (monolingual) phrase to
disambiguate
A = PTP
Similarity is given by the regularized
Laplacian kernel
Regularized Laplacian kernel
Predict final sense by k-NN given the target
instance

グラフGのラプラシアンL
次数対角行列Dのi番目の対角要素
L  D A
D(i,i)   A(i, j)
A:隣接行列
β:拡散係数
j

正則化ラプラシアン行列Rβ
R    n (L) n  (I  L)1
n 0


K   A  n A n
ノイマンカーネル行列
n 0
においてAの代わりに-Lを使用、右辺第一項のAを削除

2015/10/1
13
NTCIR-7 Patent Translation Task

Large-scale Japanese-English parallel corpus



2M sentences (comparable to A-E, C-E MT)
Mainly technical documents
Timeline



2008.01: dry run
2008.05: formal run
2008.12: final meeting
NAIST-NTT at NTCIR-7

Bilingual dictionary extracted (solely) from
Wikipedia



Used langlinks from Wikipedia DB
1:n translations are expanded to n pairs of
bilingual phrase (en, ja)
Extracted ~200,000 pairs



12,000 pairs appear in the training corpus (8.8%)
44.7% of words (token) is covered by automatically
constructed bilingual lexicon (GIZA++)
Learned 1,193 (0.6%) novel translation
Proposed method (but not yet
finished training…)


Extract translation pairs relevant to the given
domain (patent translation task)
Construct a pattern-instance matrix P


Pattern features: bag-of-words feature and link
features extracted from Wikipedia ja-en abstract
Instance: translation pair (en, ja)


Seed instances: 40 translation pairs from the target
domain
Apply Laplacian kernel
Evaluation (BLEU score)
-Wikipedia
+Wikipedia
Fmlrun-int.en.out
28.24
27.28
Fmlrun-int.ja.out
26.39
26.48
Fmlrun-int.ja.out.recased
25.34
25.47
Fmlrun-int.ja-out.detokenized
20.38
20.52
•E-J translation(en) gets worse with Wikipedia
dictionary
•J-E translation(ja) gives slightly better performance
with Wikipedia dictionary than without
Future work


Implement bilingual phrase sense
disambiguation
Evaluate this method against IWSLT 2006 JE/E-J and NTCIR-7 J-E/E-J datasets
Future work(2)


Automatic extraction of biomedical lexicon
starting from life-science dictionary (mining
from MedLine, etc…)
Summarization (Harendra’s work)…