Slide 1

Tree Automata for
Automatic Language Translation
kevin knight
information sciences institute
university of southern california
Outline
• History of the World (of Automata in NLP)
• Weighted string automata in NLP
– Applications
•
•
•
•
transliteration
machine translation
language modeling
speech, lexical processing, tagging, summarization, optical
character recognition, …
– Generic algorithms and toolkits
• Weighted tree automata in NLP
– Applications
– Generic algorithms and toolkits
• Some connections with theory
History of the World
consonant/vowel
sequences in Pushkin novels
[Markov 1913]
noisy channel model
cryptography
[Shannon 1948]
context free grammars
[Chomsky 1956]
transformational
grammars
[Chomsky 1957]
[Rounds 1970] & [Thatcher 1970] tree transducers, to
formalize
transformational
grammars
Transformational Grammar
*
the boy saw the door
S
the door was seen by the boy
S
*
NP
NP
VP
DT
N
V
the
boy
saw
NP
DT
N
the
door
VP
DT
N
the
door
AUX
was
V
seen
PP
P
by
NP
DT
N
the
boy
History of the World
consonant/vowel
sequences in Pushkin novels
[Markov 1913]
noisy channel model
cryptography
[Shannon 1948]
context free grammars
[Chomsky 1956]
transformational
grammars
[Chomsky 1957]
[Rounds 1970] & [Thatcher 1970] tree transducers, to
formalize
transformational
grammars
History of the World
consonant/vowel
sequences in Pushkin novels
[Markov 1913]
noisy channel model
cryptography
[Shannon 1948]
context free grammars
[Chomsky 1956]
transformational
grammars
[Chomsky 1957]
[Rounds 1970] & [Thatcher 1970] tree transducers, to
formalize
transformational
automata
[Thatcher 1973] tree
grammars
survey article
History of the World
“The number one priority in the area [of tree automata theory] is
a careful assessment of the significant problems concerning
consonant/vowel
natural language and programming
language semantics and
sequences
in Pushkin
novels
translation. If such problems can
be found
and formulated,
I
am convinced that the approach informally surveyed here can
[Markov
1913]
provide a unifying framework within whichnoisy
to study
them.”
channel
model
cryptography
[Shannon 1948]
context free grammars
[Chomsky 1956]
transformational
grammars
[Chomsky 1957]
[Rounds 1970] & [Thatcher 1970] tree transducers, to
formalize
transformational
automata
[Thatcher 1973] tree
grammars
survey article
History of the World
Tree
Automata
Theory
Linguistics
Computers
History of the World
THEORY
Let’s prove theorems!
LINGUISTICS
Let’s drop formalism
until we understand
things better!
NATURAL LANGUAGE PROCESSING
Let’s build demo systems!
Natural Language Processing
• 1970-80s
– models of English syntax, demonstration grammars
– beyond CFG
• augmented transition networks (ATN)
• unification-based grammars (HPSG, LFG, ...)
– mostly turned out to be formally equivalent to each other … and to
Turing machines
• tree-adjoining grammar (TAG), categorial grammar
– mildly context-sensitive grammars
• Meanwhile, in speech recognition…
–
–
–
–
probabilistic finite-state grammars of English
built automatically from training data (corpus)
word n-grams
successful paradigm
Natural Language Processing
• 1993
– US agency DARPA presided over forced marriage of
speech and language research
• 1990s
– NLP dominated by probabilistic finite-state string
formalisms and automatic training
– Weighted FSA/FST toolkits
• 2000s
– Re-awakened interest in tree formalisms for modeling
syntax-sensitive operations
Back to the Outline
• History of the World (of Automata for NLP)
• Weighted string automata in NLP
– Applications
•
•
•
•
transliteration
machine translation
language modeling
speech, lexical processing, tagging, summarization,
optical character recognition, …
– Generic algorithms and toolkits
• Weighted tree automata in NLP
– Applications
– Generic algorithms and toolkits
• Some connections with theory
Natural Language Transformations
•
•
•
•
•
•
•
•
Machine Translation
Name Transliteration
Compression
Question Answering
Spelling Correction
Speech Recognition
Language Generation
Text to Speech
Input
Output
Finite-State Transducer (FST)
Original input:
k
q k  q2 *e*
q i  q AY
q g  q3 *e*
q2 n  q N
q3 h  q4 *e*
q4 t  qfinal T
i
g
i : AY
k : *e*
q
g : *e*
t
q
FST
n
h
Transformation:
q3
n
i
g
q2
n:N
qfinal
h : *e*
q4
k
h
t:T
t
Finite-State (String) Transducer
Original input:
k
FST
q k  q2 *e*
q i  q AY
q g  q3 *e*
q2 n  q N
q3 h  q4 *e*
q4 t  qfinal T
n
i
g
h
i : AY
k : *e*
q
g : *e*
t
Transformation:
g
qfinal
h : *e*
q4
n
i
q2
n:N
q3
q2
h
t:T
t
Finite-State (String) Transducer
Original input:
k
FST
q k  q2 *e*
q i  q AY
q g  q3 *e*
q2 n  q N
q3 h  q4 *e*
q4 t  qfinal T
n
i
g
h
i : AY
k : *e*
q
g : *e*
t
Transformation:
q
qfinal
h : *e*
q4
i
g
q2
n:N
q3
N
h
t:T
t
Finite-State (String) Transducer
Original input:
k
FST
q k  q2 *e*
q i  q AY
q g  q3 *e*
q2 n  q N
q3 h  q4 *e*
q4 t  qfinal T
n
i
g
h
i : AY
k : *e*
q
g : *e*
t
Transformation:
AY
q
q2
n:N
q3
N
qfinal
h : *e*
q4
g
h
t:T
t
Finite-State (String) Transducer
Original input:
k
FST
q k  q2 *e*
q i  q AY
q g  q3 *e*
q2 n  q N
q3 h  q4 *e*
q4 t  qfinal T
n
i
g
h
i : AY
k : *e*
q
g : *e*
t
Transformation:
AY
q2
n:N
q3
N
qfinal
h : *e*
q4
q3
h
t:T
t
Finite-State (String) Transducer
Original input:
k
FST
q k  q2 *e*
q i  q AY
q g  q3 *e*
q2 n  q N
q3 h  q4 *e*
q4 t  qfinal T
n
i
g
h
i : AY
k : *e*
q
g : *e*
t
Transformation:
AY
q2
n:N
q3
N
qfinal
h : *e*
q4
t:T
q4
t
Finite-State (String) Transducer
Original input:
k
FST
q k  q2 *e*
q i  q AY
q g  q3 *e*
q2 n  q N
q3 h  q4 *e*
q4 t  qfinal T
n
i
g
h
i : AY
k : *e*
q
g : *e*
t
Transformation:
AY
q2
n:N
q3
N
qfinal
h : *e*
q4
t:T
T
qfinal
Transliteration
Angela Knight
transliteration
a
n
ji
ra na
i
to
Frequently occurring translation problem across languages
with different sound systems and character sets.
(Japanese, Chinese, Arabic, Russian, English…)
Can’t be solved by dictionary lookup.
Forward and Backward
Transliteration
Angela Knight
a
n
ji
ra na
i
to
Angela Knight
a
n
ji
ra na
i
forward
transliteration
(some variation allowed)
to
backward
transliteration
(no variation allowed)
Practical Problem
Transliteration
WFST
7 input symbols
Angela Knight
13 output symbols
Transliteration
WFST
ra
7 input symbols
Angela Knight
13 output symbols
Transliteration
generate/accept
well-formed
English sequences
WFSA
P(e)
make transformations
w/o worrying too much
about context
Angela
Knight
WFST
P(k | e)
noisy
channel
framework
Transliteration
generate/accept
well-formed
English sequences
WFSA
P(e)
Angela
Knight
make transformations
w/o worrying too much
about context
Angela
Knight
WFST
P(k | e)
DECODE
argmax P(e | k) =
e
argmax P(e) P(k | e)
e
noisy
channel
framework
Transliteration
“generative story”
WFSA A
Angela Knight
WFST B
AE N J EH L UH N AY T
WFST C
a n j i r a n a i t o
WFST D
DECODE
+ millions more
WFSA A
+ millions more
WFST B
AE N J IH R UH N AY T
AH N J IH L UH N AY T OH
+ millions more
WFST C
a n j i r a n a i t o
WFST D
Machine Translation
美国关岛国际机场及其办公室均接获一
名自称沙地阿拉伯富商拉登等发出的电
子邮件,威胁将会向机场等公众地方发
动生化袭击後,关岛经保持高度戒备。
The U.S. island of Guam is maintaining a high
state of alert after the Guam airport and its offices
both received an e-mail from someone calling
himself the Saudi Arabian Osama bin Laden and
threatening a biological/chemical attack against
public places such as the airport.
Machine Translation
direct
model
noisy
channel
model
“I see a Spanish sentence on the page.
How did it get there?”
Machine Translation
[Brown et al 93]
[Knight & Al-Onaizan 98]
Machine Translation
WFSA A
[Brown et al 93]
[Knight & Al-Onaizan 98]
Machine Translation
WFSA B
[Brown et al 93]
[Knight & Al-Onaizan 98]
Machine Translation
WFSA C
[Brown et al 93]
[Knight & Al-Onaizan 98]
Machine Translation
WFSA D
[Brown et al 93]
[Knight & Al-Onaizan 98]
Machine Translation
WFSA E
[Brown et al 93]
[Knight & Al-Onaizan 98]
Other Applications of
Weighted String Automata in NLP
• speech recognition [Pereira, Riley, Sproat 94]
• lexical processing
– word segmentation [Sproat et al 96]
– morphological analysis/generation [Kaplan and Kay 94; Clark 02]
• tagging
– part of speech tagging [Church 88]
– name finding
• summarization [Zajic, Dorr, Schwartz 02]
• optical character recognition [Kolak, Byrne, Resnik 03]
• decipherment [Knight et al 06]
Algorithms for String Automata
N-best …
EM training
Determinization …
… paths through an WFSA
(Viterbi, 1967; Eppstein, 1998)
Forward-backward EM
(Baum & Welch, 1971; Eisner 2001)
Intersection
… of weighted string acceptors
(Mohri, 1997)
WFSA intersection
Application
string  WFST  WFSA
Transducer composition
WFST composition
(Pereira & Riley, 1996)
String Automata Toolkits for
Used in NLP
• Unweighted
– Xerox finite-state calculus
• plus many children
• Weighted
– AT&T FSM
– plus many children
• Google OpenFST, ISI Carmel, Aachen FSA, DFKI FSM
toolkit, MIT FST toolkit …
String Automata Toolkits for
Used in NLP
% echo 'a n ji ra ho re su te ru na i to' |
carmel -rsi -k 5 -IEQ
word.names.50000wds.transducer
word-epron.names.55000wds.transducer
epron-jpron.1.transducer
jpron.transducer
vowel-separator.transducer
jpron-asciikana.transducer
ANGELA
ANGELA
ANGELA
ANGELA
ANGELA
FORRESTAL KNIGHT
FORRESTER KNIGHT
FOREST EL KNIGHT
FORESTER KNIGHT
HOLLISTER KNIGHT
2.60e-20
6.00e-21
1.91e-21
1.77e-21
1.33e-21
/*
/*
/*
/*
/*
/*
wfsa
wfst
wfst
wfst
wfst
wfst
*/
*/
*/
*/
*/
*/
The Beautiful World of Composable
Transducers
English phoneme
sequence
P(r|e)
P(p|e)
P(e)
English word
sequence
P(l|e)
Long English
word sequence
Foreign phoneme
sequence
P(r)
P(f|r)
P(f|e)
Foreign word
sequence
P(e|f)
English word
sequence
P(p|e)
English phoneme
sequence
P(e|p)
English word
sequence
The Beautiful World of Composable
Transducers
English phoneme
sequence
P(r|e)
P(p|e)
P(e)
English word
sequence
P(l|e)
Long English
word sequence
Foreign phoneme
sequence
P(r)
P(f|r)
P(f|e)
Foreign word
sequence
P(e|f)
English word
sequence
P(p|e)
English phoneme
sequence
P(e|p)
English word
sequence
The Beautiful World of Composable
Transducers
English phoneme
sequence
P(r|e)
P(p|e)
P(e)
English word
sequence
P(l|e)
Long English
word sequence
Foreign phoneme
sequence
P(r)
P(f|r)
P(f|e)
Foreign word
sequence
P(e|f)
English word
sequence
P(p|e)
English phoneme
sequence
P(e|p)
English word
sequence
The Beautiful World of Composable
Transducers
English phoneme
sequence
P(r|e)
P(p|e)
P(e)
English word
sequence
P(l|e)
Long English
word sequence
Foreign phoneme
sequence
P(r)
P(f|r)
P(f|e)
Foreign word
sequence
P(e|f)
English word
sequence
P(p|e)
English phoneme
sequence
P(e|p)
English word
sequence
The Beautiful World of Composable
Transducers
English phoneme
sequence
P(r|e)
P(p|e)
P(e)
English word
sequence
P(l|e)
Long English
word sequence
Foreign phoneme
sequence
P(r)
P(f|r)
P(f|e)
Foreign word
sequence
P(e|f)
English word
sequence
P(p|e)
English phoneme
sequence
P(e|p)
English word
sequence
The Beautiful World of Composable
Transducers
English phoneme
sequence
P(r|e)
P(p|e)
P(e)
English word
sequence
P(l|e)
Long English
word sequence
Foreign phoneme
sequence
P(r)
P(f|r)
P(f|e)
Foreign word
sequence
P(e|f)
English word
sequence
P(p|e)
English phoneme
sequence
P(e|p)
English word
sequence
Finite-State String Transducers
Nice properties  Nice toolkits
Translation
Accuracy
phrase substitution/transposition
2002
2003
2004
2005
2006
NIST Common Evaluations
Finite-State String Transducers
• Not expressive enough for many problems!
• For example, machine translation:
– Arabic to English: Move the verb from the
beginning of the sentence to the middle (in
between the subject and object)
– Chinese to English: When translating nounphrase “de” noun-phrase, flip the order of the
noun-phrases & substitute “of” for “de”
Experimental Progress in
Statistical Machine Translation
Translation
Accuracy
phrase substitution,
no linguistic categories
tree transformation,
linguistic categories
2002
2003
2004
2005
2006
NIST Common Evaluations
Syntax Started to Be Helpful
in 2006
Translation Accuracy
45
Chinese/English
40
35
String-based
sentences < 16 words
(NIST-03/04)
30
String-based
all sentences
(NIST-2003)
apr may jun jul
2005
aug sept oct nov dec jan feb mar apr may jun july
jan feb
2006
2007
String-Based Output
枪手 被 警方 击毙 .
Gunman of police killed .
Decoder
Hypothesis #1
String-Based Output
枪手 被 警方 击毙 .
Gunman of police attack .
Decoder
Hypothesis #7
String-Based Output
枪手 被 警方 击毙 .
Gunman by police killed .
Decoder
Hypothesis #12
String-Based Output
枪手 被 警方 击毙 .
Killed gunman by police .
Decoder
Hypothesis #134
String-Based Output
枪手 被 警方 击毙 .
Gunman killed the police .
Decoder
Hypothesis #9,329
String-Based Output
枪手 被 警方 击毙 .
Gunman killed by police .
highest scoring
output, phrasebased model
Decoder
Hypothesis #50,654
Problematic:
• VBD “killed” needs a direct object
• VBN “killed” needs an auxiliary verb (“was”)
• countable “gunman” needs an article (“a”, “the”)
• “passive marker” in Chinese controls re-ordering
Can’t enforce/encourage any of this!
Tree-Based Output
枪手 被 警方 击毙 .
The gunman killed by police .
DT NN VBD IN NN
NPB
PP
NP-C
VP
S
Decoder
Hypothesis #1
Tree-Based Output
枪手 被 警方 击毙 .
Gunman by police shot .
NN IN NN VBD
NPB
PP
NP-C
VP
S
Decoder
Hypothesis #16
Tree-Based Output
枪手 被 警方 击毙 .
Decoder
The gunman was killed by police .
Hypothesis #1923
DT NN AUX VBN IN NN
NPB
PP
highest scoring
output, syntaxNP-C
VP
based model
S
OK, so how does a Chinese string transform into an English tree, or vice-versa?
Back to the Outline
• History of the World (of Automata for NLP)
• Weighted string automata in NLP
– Applications
•
•
•
•
transliteration
machine translation
language modeling
speech, lexical processing, tagging, summarization,
optical character recognition, …
– Generic algorithms and toolkits
• Weighted tree automata in NLP
– Applications
– Generic algorithms and toolkits
• Some connections with theory
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Transformation:
S
NP
S
VP
NP
VP
PRO
VBZ
NP
PRO
VBZ
NP
he
enjoys
SBAR
he
enjoys
SBAR
VBG
VP
VBG
VP
listening P
NP
listening P
NP
to
music
to
music
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Transformation:
S
NP
S
VP
NP
VP
PRO
VBZ
NP
PRO
VBZ
NP
he
enjoys
SBAR
he
enjoys
SBAR
VBG
VP
VBG
VP
listening P
NP
listening P
NP
to
music
to
music
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Transformation:
S
NP
PRO
he
VP
VBZ
enjoys
NP
,
SBAR
VBG
NP
NP
PRO
VP
he
listening P
NP
to
music
wa
,
VBZ
SBAR
VBG
, ga ,
VP
listening P
NP
to music
enjoys
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Transformation:
S
NP
PRO
VP
VBZ
NP
NP
kare
he
enjoys
SBAR
VBG
VP
,
wa
,
SBAR
VBG
listening P
NP
to
music
VBZ
, ga ,
VP
listening P
NP
to music
enjoys
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Final output:
S
NP
PRO
he
VP
VBZ
enjoys
NP
kare ,wa , ongaku,o ,kiku , no, ga, daisuki, desu
SBAR
VBG
VP
listening P
NP
to
music
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Transformation:
S
NP
qS
VP
NP
VP
PRO
VBZ
NP
PRO
VBZ
NP
he
enjoys
SBAR
he
enjoys
SBAR
VBG
VP
VBG
VP
listening P
NP
listening P
NP
to
music
to
music
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
qS
Transformation:
0.2
 s x0, wa, r x2, ga, q x1
S
qS
x0:NP VP
NP
VP
x1:VBZ
x2:NP
NP
VP
PRO
VBZ
NP
PRO
VBZ
NP
he
enjoys
SBAR
he
enjoys
SBAR
VBG
VP
VBG
VP
listening P
NP
listening P
NP
to
music
to
music
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Transformation:
S
NP
PRO
he
VP
VBZ
enjoys
NP
,
SBAR
VBG
r NP
s NP
PRO
VP
he
listening P
NP
to
music
wa
,
q VBZ
SBAR
VBG
, ga ,
VP
listening P
NP
to music
enjoys
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Transformation:
s NP
0.7
 kare
S
PRO
NP
PRO
he
VP
VBZ
enjoys
he
NP
,
SBAR
VBG
r NP
s NP
PRO
VP
he
listening P
NP
to
music
wa
,
q VBZ
SBAR
VBG
, ga ,
VP
listening P
NP
to music
enjoys
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Transformation:
S
NP
PRO
VP
VBZ
NP
r NP
kare
he
enjoys
SBAR
VBG
VP
,
wa
,
SBAR
VBG
listening P
NP
to
music
q VBZ
, ga ,
VP
listening P
NP
to music
enjoys
Top-Down Tree Transducer
(W. Rounds 1970; J. Thatcher 1970)
Original input:
Final output:
S
NP
PRO
he
VP
VBZ
enjoys
NP
kare ,wa , ongaku,o ,kiku , no, ga, daisuki, desu
SBAR
VBG
VP
listening P
NP
to
music
To get total probability,
multiply probabilities of the
individual steps.
Top-Down Tree Transducer
• Introduced by Rounds (1970) & Thatcher (1970)
“Recent developments in the theory of automata have pointed
to an extension of the domain of definition of automata from
strings to trees … parts of mathematical linguistics can be
formalized easily in a tree-automaton setting … Our results
should clarify the nature of syntax-directed translations and
transformational grammars …”
(Rounds 1970, “Mappings on Grammars and Trees”, Math.
Systems Theory 4(3))
• Large theory literature
– e.g., Gécseg & Steinby (1984), Comon et al (1997)
• Once again re-connecting with NLP practice
– e.g., Knight & Graehl (2005), Galley et al (2004, 2006)
Tree Transducers Can be Extracted from Bilingual Data
(Galley, Hopkins, Knight, Marcu, 2004)
RULES ACQUIRED:
S
NP-C
VP
VP-C
VBD
SG-C
VP
VBN
TO
NPB
PRP
VP-C
VB
NP-C
NPB
PRP$ NN
VBD(felt)
VBN(obliged)
VB(do)
NN(part)
NN(part)





有
责任
尽
一份
一份 力
VP-C(x0:VBN x1:SG-C)  x0 x1
VP(TO(to) x0:VP-C)
 x0
…
S(x0:NP-C x1:VP)
 x0 x1
i felt obliged to do my part
我 有 责任 尽 一份 力
Tree-to-String Transducer,
used (noisy-channel-wise) to
do string to tree translation.
Tree Transducers Can be Extracted from Bilingual Data
(Galley, Hopkins, Knight, Marcu, 2004)
RULES ACQUIRED:
S
NP-C
VP
VP-C
VBD
SG-C
VP
VBN
TO
NPB
PRP
VP-C
VB
NP-C
NPB
PRP$ NN
VBD(felt)
VBN(obliged)
VB(do)
NN(part)
NN(part)





有
责任
尽
一份
一份 力
VP-C(x0:VBN x1:SG-C)  x0 x1
VP(TO(to) x0:VP-C)
 x0
…
S(x0:NP-C x1:VP)
 x0 x1
i felt obliged to do my part
我 有 责任 尽 一份 力
Tree-to-String Transducer,
used (noisy-channel-wise) to
do string to tree translation.
Tree Transducers Can be Extracted from Bilingual Data
(Galley, Hopkins, Knight, Marcu, 2004)
RULES ACQUIRED:
S
NP-C
VP
VP-C
VBD
SG-C
VP
VBN
TO
NPB
PRP
VP-C
VB
NP-C
NPB
PRP$ NN
VBD(felt)
VBN(obliged)
VB(do)
NN(part)
NN(part)





有
责任
尽
一份
一份 力
VP-C(x0:VBN x1:SG-C)  x0 x1
VP(TO(to) x0:VP-C)
 x0
…
S(x0:NP-C x1:VP)
 x0 x1
i felt obliged to do my part
我 有 责任 尽 一份 力
Tree-to-String Transducer,
used (noisy-channel-wise) to
do string to tree translation.
Tree Transducers Can be Extracted from Bilingual Data
(Galley, Hopkins, Knight, Marcu, 2004)
RULES ACQUIRED:
S
NP-C
VP
VP-C
VBD
SG-C
VP
VBN
TO
NPB
PRP
VP-C
VB
NP-C
NPB
PRP$ NN
VBD(felt)
VBN(obliged)
VB(do)
NN(part)
NN(part)





有
责任
尽
一份
一份 力
VP-C(x0:VBN x1:SG-C)  x0 x1
VP(TO(to) x0:VP-C)
 x0
…
S(x0:NP-C x1:VP)
 x0 x1
i felt obliged to do my part
我 有 责任 尽 一份 力
Additional extraction methods:
(Galley et al, 2006)
(Marcu et al, 2006)
Current systems learn ~500m rules.
Sample “said that” rules
VP
VBD
said
SBAR-C
IN
?
x0:S-C
that
0.57
0.09
0.02
0.02
0.02
0.01
0.01
VP(VBD("said") SBAR-C(IN("that") x0:S-C)) -> 说 , x0
VP(VBD("said") SBAR-C(IN("that") x0:S-C)) -> 说 x0
VP(VBD("said") SBAR-C(IN("that") x0:S-C)) -> 他 说 , x0
VP(VBD("said") SBAR-C(IN("that") x0:S-C)) -> 指出 , x0
VP(VBD("said") SBAR-C(IN("that") x0:S-C)) -> x0
VP(VBD("said") SBAR-C(IN("that") x0:S-C)) -> 表示 x0
VP(VBD("said") SBAR-C(IN("that") x0:S-C)) -> 说 , x0 的
Sample Subject-Verb-Object Rules
S
x0:NP-C
VP
x1:VBD
x3:.
?
x2:NP-C
CHINESE / ENGLISH
0.82
0.02
0.01
S(x0:NP-C VP(x1:VBD x2:NP-C) x3:.) -> x0 x1 x2 x3
S(x0:NP-C VP(x1:VBD x2:NP-C) x3:.) -> x0 x1 , x2 x3
S(x0:NP-C VP(x1:VBD x2:NP-C) x3:.) -> x0 , x1 x2 x3
ARABIC / ENGLISH
0.54
0.44
S(x0:NP-C VP(x1:VBD x2:NP-C) x3:.) -> x0 x1 x2 x3
S(x0:NP-C VP(x1:VBD x2:NP-C) x3:.) -> x1 x0 x2 x3
Decoding
• argmax P(etree | cstring)
etree
• Difficult search problem
– Bottom-up CKY parser
– Builds English constituents on top of Chinese spans
– Record of rule applications (the derivation) provides
information to construct English tree
– Returns k-best trees
Syntax-Based Decoding
Rules apply when their right-hand sides (RHS)
match some portion of the input.
这 7人 中包括 来自 法国 和 俄罗斯 的 宇航 员
.
Syntax-Based Decoding
Rules apply when their right-hand sides (RHS)
match some portion of the input.
“these”
RULE 1:
DT(these)
这
“include”
RULE 2:
VBP(include)
 中包括
“France”
RULE 4:
NNP(France)
 法国
“and”
RULE 5:
CC(and)
和
“Russia”
RULE 6:
NNP(Russia)
 俄罗斯
“astronauts”
RULE 8:
NP(NNS(astronauts))
 宇航 , 员
这 7人 中包括 来自 法国 和 俄罗斯 的 宇航 员
“.”
RULE 9:
PUNC(.)
.
.
Syntax-Based Decoding
“France and Russia”
RULE 13:
NP(x0:NNP, x1:CC, x2:NNP)
 x0 , x1 , x2
“these”
RULE 1:
DT(these)
这
“include”
RULE 2:
VBP(include)
 中包括
“France”
RULE 4:
NNP(France)
 法国
“and”
RULE 5:
CC(and)
和
“Russia”
RULE 6:
NNP(Russia)
 俄罗斯
“astronauts”
RULE 8:
NP(NNS(astronauts))
 宇航 , 员
这 7人 中包括 来自 法国 和 俄罗斯 的 宇航 员
“.”
RULE 9:
PUNC(.)
.
.
Syntax-Based Decoding
“coming from France and Russia”
RULE 11:
VP(VBG(coming), PP(IN(from), x0:NP))
 来自 , x0
“France and Russia”
RULE 13:
NP(x0:NNP, x1:CC, x2:NNP)
 x0 , x1 , x2
“these”
RULE 1:
DT(these)
这
“include”
RULE 2:
VBP(include)
 中包括
“France”
RULE 4:
NNP(France)
 法国
“&”
RULE 5:
CC(and)
和
“Russia”
RULE 6:
NNP(Russia)
 俄罗斯
“astronauts”
RULE 8:
NP(NNS(astronauts))
 宇航 , 员
这 7人 中包括 来自 法国 和 俄罗斯 的 宇航 员
“.”
RULE 9:
PUNC(.)
.
.
Syntax-Based Decoding
“astronauts coming from
France and Russia”
RULE 16:
NP(x0:NP, x1:VP)
 x1 , 的 , x0
“coming from France and Russia”
RULE 11:
VP(VBG(coming), PP(IN(from), x0:NP))
 来自 , x0
“France and Russia”
RULE 13:
NP(x0:NNP, x1:CC, x2:NNP)
 x0 , x1 , x2
“these”
RULE 1:
DT(these)
这
“include”
RULE 2:
VBP(include)
 中包括
“France”
RULE 4:
NNP(France)
 法国
“&”
RULE 5:
CC(and)
和
“Russia”
RULE 6:
NNP(Russia)
 俄罗斯
“astronauts”
RULE 8:
NP(NNS(astronauts))
 宇航 , 员
这 7人 中包括 来自 法国 和 俄罗斯 的 宇航 员
“.”
RULE 9:
PUNC(.)
.
.
“include astronauts coming from
France and Russia”
RULE 14:
VP(x0:VBP, x1:NP)
 x0 , x1
“astronauts coming from
France and Russia”
RULE 16:
NP(x0:NP, x1:VP)
 x1 , 的 , x0
“coming from France and Russia”
RULE 11:
VP(VBG(coming), PP(IN(from), x0:NP))
 来自 , x0
“France and Russia”
RULE 13:
NP(x0:NNP, x1:CC, x2:NNP)
 x0 , x1 , x2
“these”
RULE 1:
DT(these)
这
“include”
RULE 2:
VBP(include)
 中包括
“France”
RULE 4:
NNP(France)
 法国
“&”
RULE 5:
CC(and)
和
“Russia”
RULE 6:
NNP(Russia)
 俄罗斯
“astronauts”
RULE 8:
NP(NNS(astronauts))
 宇航 , 员
这 7人 中包括 来自 法国 和 俄罗斯 的 宇航 员
“.”
RULE 9:
PUNC(.)
.
.
RULE 15:
S(x0:NP, x1:VP, x2:PUNC)
 x0 , x1 , x2
“These 7 people include astronauts
coming from France and Russia”
RULE 14:
VP(x0:VBP, x1:NP)
 x0 , x1
Derivation Tree
“include astronauts coming from
France and Russia”
“astronauts coming from
France and Russia”
RULE 16:
NP(x0:NP, x1:VP)
 x1 , 的 , x0
“coming from France and Russia”
RULE 11:
VP(VBG(coming), PP(IN(from), x0:NP))
 来自 , x0
“France and Russia”
“these 7 people”
RULE 13:
NP(x0:NNP, x1:CC, x2:NNP)
 x0 , x1 , x2
RULE 10:
NP(x0:DT, CD(7), NNS(people)
 x0 , 7人
“these”
RULE 1:
DT(these)
这
“include”
RULE 2:
VBP(include)
 中包括
“France”
RULE 4:
NNP(France)
 法国
“&”
RULE 5:
CC(and)
和
“Russia”
RULE 6:
NNP(Russia)
 俄罗斯
“astronauts”
RULE 8:
NP(NNS(astronauts))
 宇航 , 员
这 7人 中包括 来自 法国 和 俄罗斯 的 宇航 员
“.”
RULE 9:
PUNC(.)
.
.
Derived
English Tree
S
VP
NP
VP
PP
NP
NP
NP
NP
DT
CD
NNS
VBP
NNS
VBG
IN
These 7 people include astronauts coming from
NNP
NP
CC
NNP PUNC
France and Russia
.
Chinese/English Translation Examples
Chinese gloss:
six unit Iraq civilian today in Iraq south part
possessive protest in , police and UK troops
shot killed .
Machine Translation:
Police and British troops shot and killed six Iraqi
civilians in protests in southern Iraq today.
Chinese/English Translation Examples
Chinese:
印度 目前 共 有 74 种 控价 药 , 增加 后 的 控价
药品 将 占 印度 所售 药品 的 40% 以上 。
Machine Translation:
Currently, a total of 74 types of medicine
prices increased after the price of medicines
will account for more than 40 per cent of
medicines sold by India.
Third, even if the lower
VP weren’t bogus,
“confirms” only takes
a certain type of VP,
namely a gerund
(“confirms discussing
the idea”).
Arabic-English
translation
Second, even if the
S-C really were a
sentence, the verb
“discussed” doesn’t
take an S argument.
So this is a bogus VP.
First, this is
not a sentence.
The VP below
is not finite (e.g.,
“visited Iran”).
Tree Automata Operations for
Machine Translation?
argmax P(etree | cstring)
etree
Weighted tree grammar
that accepts/scores
English trees
e = yield(best-tree(intersect(lm.rtg,
b-apply(cstring, tm.tt)))
Weighted tree-to-string
transducer that turns
English trees into Chinese strings
Tree Automata Algorithms
String Automata
Algorithms
Tree Automata
Algorithms
N-best …
… paths through an WFSA
(Viterbi, 1967; Eppstein, 1998)
… trees in a weighted forest
(Jiménez & Marzal, 2000; Huang &
Chiang, 2005)
EM training
Forward-backward EM
(Baum/Welch, 1971; Eisner 2003)
Tree transducer EM training
(Graehl & Knight, 2004)
Determinization …
… of weighted string acceptors
(Mohri, 1997)
… of weighted tree acceptors
(Borchardt & Vogler, 2003;
May & Knight, 2005)
Intersection
WFSA intersection
Tree acceptor intersection
(despite CFG not closed)
Applying transducers
string  WFST  WFSA
tree  TT  weighted tree
acceptor
Transducer composition
WFST composition
(Pereira & Riley, 1996)
Many tree transducers not closed
under composition
(Rounds 70; Engelfriet 75)
Tree Automata Toolkits for
Used in NLP
• Tiburon: Weighted tree automata toolkit
– Developed by Jonathan May, USC/ISI
– First version distributed in April 2006
– Includes tutorial
– Inspired by string automata toolkits
• www.isi.edu/licensed-sw/tiburon
[May & Knight 06]
Tree Automata Toolkits for
Used in NLP
% echo "A(B(C) B(B(C)))" |
tiburon -k 1 - even.rtg three.rtg
A(B(C) B(B(C))): 3.16E-9
% echo "A(B(C) B(C))" |
tiburon -k 1 - even.rtg three.rtg
Warning: returning fewer trees than
requested
0
Back to the Outline
• History of the World (of Automata for NLP)
• Weighted string automata in NLP
– Applications
•
•
•
•
transliteration
machine translation
language modeling
speech, lexical processing, tagging, summarization,
optical character recognition, …
– Generic algorithms and toolkits
• Weighted tree automata in NLP
– Applications
– Generic algorithms and toolkits
• Some connections with theory
Desirable Properties of
Transducer Formalism
• Expressiveness
– Can express the knowledge needed to capture the
transformation & solve the linguistic problem
• Modularity
– Can integrate smaller components into bigger
systems, co-ordinate search
• Inclusiveness
– Encompasses simpler formalisms
• Teachability
– Can learn from input/output examples
Desirable Formal Properties of
Transformation Formalism
Modularity
be closed under composition
Inclusiveness
Teachability
capture any transformation that a
string-based FST can
given input/output tree pairs, find
locally optimal rule probabilities in
low-polynomial time
Expressiveness
see next few slides
Expressiveness
some necessary things for machine translation
Phrasal Translation
VP
Non-constituent Phrases
S
está cantando
VBZ
VBG
is
singing
PRO
Non-contiguous Phrases
VP
hay X
VP
there VB
VB
X
poner X
PRT
X
on
put
are
Context-Sensitive
Word Insertion/Deletion
NPB
DT
the
Re-Ordering
NP
S
X
X
Lexicalized Re-Ordering
X
Y X Z
VP
Y
Z
PP
X
P
of
Y
Y
X
Expressiveness
S
S
*
NP
CONJ
VP
DT
N
V
the
boy
saw
Local rotation
wa[and]
NP
DT
N
the
door
S’
V
NP
NP
ra’aa
[saw]
N
N
atefl
[the boy]
albab
[the door]
Desirable Formal Properties of
Transformation Formalism
Expressiveness
Modularity
Inclusiveness
Teachability
do local rotation
be closed under composition
capture any transformation that a
string-based FST can
given input/output pairs, find locally
optimal rule probabilities in lowpolynomial time
How do different tree formalisms fare?
Top-down Tree Transducers
every rule has this form
one-level LHS
multilevel RHS
S
x0
x1
S
x2
x1
arabic verb
arabic subject
arabic object
VP
x0
LNT
x2
T – top-down
L – linear (non-copying)
N – non-deleting
Top-down Tree Transducers
one-level LHS
multilevel RHS
S
x0
x1
S
x2
T – top-down
L – linear (non-copying)
N – non-deleting
VP
arabic verb
arabic subject
arabic object
x0
LT
x2
can delete subtrees
Top-down Tree Transducers
one-level LHS
multilevel RHS
S
x0
x1
S
x2
x0
arabic verb
arabic subject
arabic object
x0
T
T – top-down
L – linear (non-copying)
N – non-deleting
VP
x0
x2
can copy & delete subtrees
Top-down Tree Transducers
one-level LHS
qS
x0
T – top-down
L – linear (non-copying)
N – non-deleting
multilevel RHS
S
x1
x2
r x1
arabic verb
arabic subject
arabic object
s x0
T
VP
q x0
LT
s x2
LNT
all employ states
T – top-down
L – linear (non-copying)
N – non-deleting
T
copying
non-copying
deleting
LT
non-deleting
LNT
T – top-down
L – linear (non-copying)
N – non-deleting
T
copying
non-copying
LT
deleting
non-deleting
Expressiveness:
*
S
PRO
VP
V
S
V
NP
PRO
NP
qS
?
LNT
x0
x1
T – top-down
L – linear (non-copying)
N – non-deleting
qS
T
copying
S
x0
x1 r x1 q x0 s x1
r VP
non-copying
LT
deleting
q x0
x0
x1
s VP
non-deleting
Expressiveness:
*
S
PRO
VP
V
NP
x0
S
V
PRO
q x1
x1
NP
qS
?
LNT
x0
x1
Extended (x-) Transducers
multilevel LHS
S
S
x0
english
subject
x1
VP
x1
english verb
T – top-down
L – linear (non-copying)
N – non-deleting
x – extended LHS
multilevel RHS
x2
x0
x2
xLNT
can grab more structure
english object
• possibility mentioned in [Rounds 70]
• defined in [Graehl & Knight 04]
• used for practical MT by [Galley et al 04, 06]
copying
T
GS’84
non-copying
deleting
LT
non-deleting
LNT
GS’84
xTR=TR
xT
GK’04
+ finite-check
before delete
copying
non-copying
T
GS’84
xLT
LT
deleting
non-deleting
xLNT
GK’04
+ local rotation
LNT
GS’84
Expressive power theorems in [Maletti, Graehl, Hopkins, Knight, submitted]
xTR=TR
Expressive enough for local rotation
xT
GK’04
copying
non-copying
T
GS’84
xLT
LT
deleting
non-deleting
Expressiveness:
*
S
PRO
xLNT
VP
V
V
NP
GK’04
S
PRO
NP
LNT
GS’84
Expressive power theorems in [Maletti, Graehl, Hopkins, Knight, to appear SIAM J. Comput]
Expressive enough for local rotation
xTR=TR
xT
GK’04
copying
non-copying
T
GS’84
xLT
LB=LTR
LT
deleting
B
non-deleting
xLNT
GK’04
LNT
GS’84
bottom up
transducers
Expressive enough for local rotation
xTR=TR
Closed under composition
xT
GK’04
copying
non-copying
T
GS’84
xLT
LB=LTR
LT
deleting
B
non-deleting
xLNT
GK’04
LNT
GS’84
bottom up
transducers
Inclusiveness
• Tree transducers
are described as
generalizing FSTs
(strings are “long
skinny trees”)
FST
transition
q
q
A/B
A/*e*
Equivalent tree
transducer rule
r
r
qA
B
x0
r x0
qA
r x0
x0
q
*e*/B
r
qx
B
rx
But these transitions are not part of
traditional tree transducers, which
must consume a symbol at each step.
Expressive enough
Closed under composition
Generalizes FST
xTR=TR
xT
T
copying
non-copying
B
MBOT
xLT
LB=LTR
LT
deleting
non-deleting
xLNT
(xRHS, input-e,
output-e)
xLNT
xLNT
LNT
(xRHS, output-e)
(xRHS, input-e)
(xRHS, input-e,
output-e)
FST
xLNT
LNT
LNT
(xRHS, e-free)
(xRHS, input-e)
(xRHS, output-e)
GSM
LNT
(xRHS, e-free)
More Theory Connections
• Other desirable properties
–
–
–
–
More expressivity
Other types of teachability
Process trees horizontally as well as vertically
Graph transduction
• Papers:
–
–
–
–
–
–
Overview of tree automata in NLP [Knight & Graehl 05]
MT Journal [Knight 07]
SIAM J. Comput. [Maletti et al, forthcoming]
CIAA, e.g., Tiburon paper [May & Knight 06]
WATA (Weighted Automata Theory and Applications)
FSMNLP (Finite-State Methods and Natural Language
Processing). Subworkshop: “Tree Automata and Transducers”
(papers due 4/13/09)
Conclusion
• Weighted string automata for NLP
– well understood and exploited
• Weighted tree automata for NLP
– just starting
• Some connections with theory
– of continuing interest
• Good news from the empirical front
– making good progress on machine translation