The Unsupervised Learning of Natural Language

Learning Language from
Distributional Evidence
Christopher Manning
Depts of CS and Linguistics
Stanford University
Workshop on Where Does Syntax Come From?, MIT, Oct 2007
There’s a lot to agree on!
[C. D. Yang. 2004. UG, statistics or both? Trends in CogSci 8(10)]



Both endowment (priors/biases) and learning from
experience contribute to language acquisition
“To be effective, a learning algorithm … must have an
appropriate representation of the relevant … data.”
Languages, and hence models of language, have intricate
structure, which must be modeled.
2
More points of agreement


Learning language structure requires priors or
biases, to work at all, but especially at speed
Yang is “in favor of probabilistic learning
mechanisms that may well be domain-general”



I am too.
Probabilistic methods (and especially Bayesian prior +
likelihood methods) are perfect for this!
Probabilistic models can achieve and explain:




gradual learning and robustness in acquisition
non-homogeneous grammars of individuals
gradual language change over time
[and also other stuff, like online processing]
3
The disagreements are important,
but two levels down

In his discussion of Saffran, Aslin, and Newport (1998),
Yang contrasts “statistical learning” (of syllable bigram
transition probabilities) from use of UG principles such as
one primary stress per word.

I would place both parts in the same probabilistic model



Stress is a cue for word learning, just as syllable transition
probabilities are a cue (and it’s very subtle in speech!!!)
Probability theory is an effective means of combining multiple,
often noisy, sources of information with prior beliefs
Yang keeps probabilities outside of the grammar, by
suggesting that the child maintains a probability
distribution over a collection of competing grammars

I would place the probabilities inside the grammar

It is more economical, explanatory, and effective.
4
The central questions
1.
2.
What representations are appropriate for
human languages?
What biases are required to learn languages
successfully?

3.
Linguistically informed biases – but perhaps
fairly general ones are enough
How much of language structure can be
acquired from the linguistic input?

This gives a lower bound on how much is
innate.
5
1. A mistaken meme: language as a
homogeneous, discrete system

Joos (1950: 701–702):

“Ordinary mathematical techniques fall mostly into two
classes, the continuous (e.g., the infinitesimal calculus)
and the discrete or discontinuous (e.g., finite group
theory). Now it will turn out that the mathematics
called ‘linguistics’ belongs to the second class. It does
not even make any compromise with continuity as
statistics does, or infinite-group theory. Linguistics is a
quantum mechanics in the most extreme sense. All
continuities, all possibilities of infinitesimal gradation,
are shoved outside of linguistics in one direction or the
other.”

[cf. Chambers 1995]
6
The quest for homogeneity

Bloch (1948: 7):


“The totality of the possible utterances of one
speaker at one time in using a language to
interact with one other speaker is an idiolect.
… The phrase ‘with one other speaker’ is
intended to exclude the possibility that an
idiolect might embrace more than one style
of speaking.”
Sapir (1921: 147)

“Everyone knows that language is variable”
7
Variation is everywhere


The definition of an idiolect fails, as
variation occurs even internal to the usage
of a speaker in one style.
As least as:

Black voters also turned out at least as well
as they did in 1996, if not better in some
regions, including the South, according to
exit polls. Gore was doing as least as well
among black voters as President Clinton did
that year.

(Associated Press, 2000)
8
Linguistic Facts vs. Linguistic
Theories


Weinreich, Labov and Herzog (1968) see 20th
century linguistics as having gone astray by
mistakenly searching for homogeneity in
language, on the misguided assumption that
only homogeneous systems can be
structured
Probability theory provides a method for
describing structure in variable systems!
9
The need for probability
models inside syntax
The motivation comes from two sides:

Categorical linguistic theories claim too much:
 They place a hard categorical boundary of
grammaticality, where really there is a fuzzy edge,
determined by many conflicting constraints and issues
of conventionality vs. human creativity

Categorical linguistic theories explain too little:
 They say nothing at all about the soft constraints
which explain how people choose to say things

Something that language educators, computational NLP
people – and historical linguists and sociolinguists
dealing with real language – usually want to know about
10
Clausal argument
subcategorization frames

Problem: in context, language is used more flexibly
than categorical constraints suggest:

E.g., most subcategorization frame ‘facts’ are wrong
Pollard and Sag (1994) inter alia [regard vs. consider]:

*We regard Kim to be an acceptable candidate

We regard Kim as an acceptable candidate
The New York Times:

As 70 to 80 percent of the cost of blood tests, like
prescriptions, is paid for by the state, neither physicians nor
patients regard expense to be a consideration.

Conservatives argue that the Bible regards homosexuality to
be a sin.
And the same pattern repeats for many verbs and frames….
11
Probability mass functions:
subcategorization of regard
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%

    
20.0%
10.0%
0.0%
__ NP as NP
__ NP NP
__ NP as AdjP
__ NP as PP
__ NP as VP[ing]
__ NP VP[inf]
12
Bresnan and Nikitina (2003)
on the Dative Alternation

Pinker (1981), Krifka (2001): verbs of instantaneous force
allow dative alternation but not “verbs of continuous
imparting of force” like push

“As Player A pushed him the chips, all hell broke loose at the
table.”


Pinker (1981), Levin (1993), Krifka (2001): verbs of
instrument of communication allow dative shift but not
verbs of manner of speaking

“Hi baby.” Wade says as he stretches. You just mumble him
an answer. You were comfy on that soft leather couch.
Besides …


www.cardplayer.com/?sec=afeature&art id=165
www.nsyncbitches.com/thunder/fic/break.htm
In context such usages are unremarkable!


It’s just the productivity and context-dependence of
language
Examples are rare → because these are gradient constraints.
Here data is gathered from a really huge corpus: the web.
13
The disappearing hard constraints of
categorical grammars

We see the same thing over and over

Another example is constraints on Heavy NP Shift
[cf. Wasow 2002]







You start with a strong categorical theory, which is
mostly right.
People point out exceptions and counter examples
You either weaken it in the face of counterexamples
Or you exclude the examples from consideration
Either way you end up without an interesting theory
There’s little point in aiming for explanatory
adequacy when the descriptive adequacy of the used
representations [as opposed to particular
descriptions] just isn’t there.
There is insight in the probability distributions!
14
Explaining more:
What do people say?

What people say has two parts:

Contingent facts about the world


People have been talking a lot about Iraq lately
The way speakers choose to express ideas
using the resources of the language

People don’t often put that clauses pre-verbally:



It appears almost certain that we will have to take a
loss
That we will have to take a loss appears almost
certain
The latter is properly part of people’s
Knowledge of Language. Part of syntax.
15
Variation is part of competence
[Labov 1972: 125]

“The variable rules themselves require at so
many points the recognition of grammatical
categories, of distinctions between
grammatical boundaries, and are so closely
interwoven with basic categorical rules, that
it is hard to see what would be gained by
extracting a grain of performance from this
complex system. It is evident that [both the
categorical and the variable rules proposed]
are a part of the speaker’s knowledge of
language.”
16
What do people say?



Simply delimiting a set of grammatical sentences provides
only a very weak description of a language, and of the
ways people choose to express ideas in it
Probability densities over sentences and sentence
structures can give a much richer view of language
structure and use
In particular, we find that (apparently) categorical
constraints in one language often reappear as the same
soft generalizations and tendencies in other languages


[Givón 1979, Bresnan, Dingare, and Manning 2001]
Linguistic theory should be able to uniformly capture
these constraints, rather than only recognizing them when
they are categorical
17
Explaining more: what determines
ditransitive vs. NP PP for dative verb
[Bresnan, Cueni, Nikitina, and Baayen 2005]



Build mixed effects [logistic regression] model over a
corpus of examples
Model is able to pull apart the correlations between
various predictive variables
Explanatory variables:






Discourse accessibility, definiteness, pronominality, animacy
(Thompson 1990, Collins 1995)
Differential length in words of recipient and theme (Arnold et al.
2000, Wasow 2002, Szmrecsanyi 2004b)
Structural parallelism in dialogue (Weiner and Labov 1983,
Bock1986, Szmrecsanyi 2004a)
Number, person (Aissen1999, 2003; Haspelmath 2004; Bresnan
and Nikitina 2003)
Concreteness of theme
Broad semantic class of verb (transfer, prevent, communicate, …)
18
Explaining more: what determines
ditransitive vs. NP PP for dative verb
[Bresnan, Cueni, Nikitina, and Baayen 2005]

What does one learn?

Almost all the predictive variables have independent
significant effects



Shows that reductionist theories of the phenomenon that
tries to reduce things to one or two factors are wrong
First object NP is preferred to be:


Only a couple fail to: e.g., number of recipient
Given, animate, definite, pronoun, shorter
Model can predict whether to use a double object or NP
PP construction correctly 94% of the time.


It captures much of what is going on in this choice
These factors exceed in importance differences from
individual variation
19
Statistical parsing models also give
insight into processing
[Levy 2004]

A realistic model of human sentence processing explain:
 Robustness to arbitrary input, accurate disambiguation
 Inference on the basis of incomplete input [Tanenhaus et al
1995, Altmann and Kamide 1999, Kaiser and Trueswell 2004]



Sentence processing difficulty is differential and localized
On the traditional view, resource limitations, especially
memory, drive processing difficulty
Locality-driven processing [Gibson 1998, 2000]: multiple
and/or more distant dependencies are harder to process
the reporter who attacked
the senator
the reporter who the senator attacked
Easy
Hard
Processing
20
Expectation-driven processing
[Hale 2001, Levy 2005]

Alternative paradigm: Expectation-based
models of syntactic processing




Expectations are weighted averages over
probabilities
Structures we expect are easy to process
Modern computational linguistics techniques
of statistical parsing 
precise psycholinguistic model
Model matches empirical results of many
recent experiments better than traditional
memory-limitation models
21
Example: Verb-final domains
Locality predictions and empirical results
[Konieczny 2000] looked at reading times at German final
verbs

Locality-based models (Gibson 1998) predict difficulty for
longer clauses

But Konieczny found that final verbs were read faster in
longer clauses
Prediction Result
Er hat die Gruppe geführt
easy
slow

He led the group
Er hat die Gruppe auf den Berg geführt
hard
fast
He led the group to the mountain
Er hat die Gruppe auf den sehr schönen Berg geführt
He led the group to the very beautiful mountain
hard
fastest
22
Deriving Konieczny’s results


Seeing more = having more information
More information = more accurate expectations
S
VP
NPVfin
NP
Er hat
die Gruppe
PP
V
auf den Berg geführt
NP?
PP-goal?
PP-loc?
Verb?
ADVP?
Once we’ve seen a PP goal we’re unlikely to see
another
 So the expectation of seeing anything else goes up
Rigorously tested: for pi(w), using a PCFG derived
empirically from a syntactically annotated corpus of
German (the NEGRA treebank)


23
Predictions from the
Hale 2001/Levy 2004 model
520
510
Reading time (ms)
500
490
480
470
460
450
Locality-based difficulty (ordinal)
Er hat die Gruppe (auf den (sehr schönen) Berg) geführt
16.2
Reading time at final verb
3
Negative Log probability
16
15.8
15.6
2
15.4
15.2
1
15
14.8
No PP
Short PP
Long PP
Locality-based models (e.g., Gibson 1998, 2000) would violate monotonicity
24
2.&3. Learning sentence structure
from distributional evidence


Start with raw language, learn syntactic structure
Some have argued that learning syntax from
positive data alone is impossible:



Many others have felt it should be possible:





Gold, 1967: Non-identifiability in the limit
Chomsky, 1980 : The poverty of the stimulus
Lari and Young, 1990
Carroll and Charniak, 1992
Alex Clark, 2001
Mark Paskin, 2001
… but it is a hard problem
25
Language learning idea 1:
Lexical affinity models

Words select other words on syntactic grounds
congress narrowly passed the amended bill

Link up pairs with high mutual information



[Yuret, 1998]: Greedy linkage
[Paskin, 2001]: Iterative re-estimation with EM
Evaluation: compare linked pairs (undirected) to a gold standard
Method
Accuracy
Paskin, 2001
39.7
Random
41.7
26
Idea: Word Classes

Mutual information between words does not necessarily
indicate syntactic selection.
expect brushbacks but no beanballs



Individual words like brushbacks are entwined with
semantic facts about the world.
Syntactic classes, like NOUN and ADVERB are bleached of
word-specific semantics.
We could build dependency models over word classes.
[cf. Carroll and Charniak, 1992]
NOUN ADVERB
congress
narrowly VERB
passedDET
the PARTICIPLE
amended NOUN
bill
congress narrowly passed the
amended
bill
27
Problems: Word Class Models
Random
41.7
Carroll and Charniak, 92 44.7
Adjacent Words

53.2
Issues:


Too simple a model – doesn’t work much better supervised
No representation of valence/distance of arguments)
congress narrowly passed the amended bill
NOUN NOUN VERB
NOUN NOUN VERB
stock prices fell
stock prices fell
28
Bias: Using better dependency
representations [Klein and Manning 2004]
?
head
arg
distance
Classes?
Paskin 01
Distance
Local Factor
P(a | h)
P(c(a) | c(h))
P(c(a) | c(h), d)
Carroll & Charniak 92
Klein/Manning (DMV)
Adjacent Words
55.9
Klein/Manning (DMV)
63.6
29
Idea: Can we learn phrase structure
constituency as distributional clustering
 the president said that the downturn was over 
president
the __ of
president
the __ said
governor
the __ of
governor
the __ appointed
said
sources __ 
said
president __ that
reported
sources __ 
president
governor
the
a
said
reported
[Finch and Chater 92, Schütze 93, many others]
30
Distributional learning

There is much debate in the child language
acquisition literature about distributional learning
and the possibility of kids successfully using it:





[Maratsos & Chalkley 1980] suggest it
[Pinker 1984] suggests it‘s impossible (too many
correlations, too abstract properties needed)
[Reddington et al. 1998] say it succeeds because there
are dominant cues relevant to language
[Mintz et al. 2002] look at distributional structure of
input
Speaking in terms of engineering, let me just tell you
that it works really well!

It’s one of the most successful techniques that we have
– my group uses it everywhere for NLP.
31
Idea: Distributional Syntax?
[Klein and Manning NIPS 2003]

Can we use distributional clustering for
learning syntax?
S
VP
NP
PP
 factory payrolls fell in september 
Span
fell in september
Context
payrolls __ 
payrolls fell in
factory __ sept
32
Problem: Identifying Constituents
Distributional classes are easy to find…
the final vote
two decades
most people
the final
the intitial
two of the
of the
with a
without many
in the end
on time
for now
decided to
took most of
go with
NP
VP
PP
Principal Component 1
Principal
Component 2
Principal
Component 2
… but figuring out which are constituents is hard.
+
-
Principal Component 1
33
A Nested Distributional Model:
Constituent-Context Model (CCM)
P(S|T) =
P(fpfis|+)
P(__|+)
+
+
+
P(fp|+)
P(__ fell|+)
factory payrolls fell in september 
P(fis|+)
P(p __ |+)
P(is|+)
P(fell __ |+)
+
34
Initialization: A little UG?
Tree Uniform
Split Uniform
35
Results: Constituency
van Zaanen, 00
35.6
Right-Branch
70.0
Our Model (CCM)
81.6
Treebank Parse
CCM Parse
36
Combining the two models
[Klein and Manning ACL 2004]
Dependency Evaluation
Random
45.6
DMV
62.7
CCM + DMV
64.7
Constituency Evaluation


Random
39.4
CCM
81.0
CCM + DMV
88.0
!
Supervised PCFG constituency recall is at 92.8
Qualitative improvements

Subject-verb groups gone, modifier placement improved
37
Beyond surface syntax…
[Levy and Manning, ACL 2004]




Syntactic category, parent,
grandparent (subj vs obj
extraction; VP finiteness
Head words (wanted vs to
vs eat)
Presence of daughters (NP
under S)
Syntactic path (Gildea &
Jurafsky 2002):
origin?
<SBAR,S,VP,S,VP>
Plus: feature conjunctions, specialized features for expletive
subject dislocations, passivizations, passing featural
information properly through coordinations, etc., etc.
 cf. Campbell (2004 ACL) – a lot of linguistic knowledge
38
Evaluation on dependency metric:
gold-standard input trees
ADVP
SBAR
Levy&Manning,
gold input trees
ADJP
VP
CF gold tree deps
S
NP
Overall
70
75
80
85
90
95
100
Accuracy (F1)
39
Might we also learn a linking to
argument structure distributionally?

Why it might be possible:
Instances of open:
Word Distributions:
agent
she, he
She opened the door with a key.
patient
door, bottle
The door opened as they approached.
instrument key, corkscrew
The bottle opened easily.
Fortunately, the corkscrew opened the bottle. Allowed Linkings:
{ agent=subject, patient=object }
He opened the bottle with a corkscrew.
{ patient=subject }
She opened the bottle very carefully.
{ agent=subject, patient=object,
This key opens the door of the cottage.
instrument=obl_with }
He opened the door.
{ instrument=subject,
patient=object }
40
Probabilistic Model Learning
[Grenager and Manning 2006 EMNLP]






Given a set of observed verb
instances, what are the most
likely model parameters?
Use unsupervised learning in
a structured probabilistic
graphical model
A good application for EM!
M-step:

Trivial computation
E-Step:

We compute conditional
distributions over
possible role vectors for
each instance
And we repeat
subj
np
obj1
obj2
v
give
l
{ 0=subj, 1=obj2, 2=obj1 }
o
[ 0=subj, M=np 1=obj2, 2=obj1 ]
s1
r1
s2
r2
s3
r3
s4
r4
0
M
2
1
w1
w2
w3
w4
plunge
today
them
test
41
Semantic role induction results



The model achieve some traction, but it’s hard
Learning becomes harder with greater abstraction
This is the right research frontier to explore!
Verb: give
Linkings:
Roles:
Verb: pay
{0=subj,1=obj2, 2=obj1}
0.46
{0=subj,1=obj1}
0.32
{0=subj,1=obj1, 2=to}
0.19
{0=subj,1=obj1, 2=for}
0.21
{0=subj,1=obj1}
0.05
{0=subj}
0.07
…
…
{0=subj,1=obj1, 2=to}
0.05
Linkings:
0
it, he, bill, they, that, …
{0=subj,1=obj2, 2=obj1}
0.05
1
power, right, stake, …
…
…
2
them, it, him, dept., …
0
it, they, company, he, …
1
$, bill, price, tax …
2
stake, gov., share, amt., …
… …
Roles:
… …
42
Conclusions







Probabilistic models give precise descriptions of a variable,
uncertain world
There are many phenomena in syntax that cry out for noncategorical or probabilistic representations of language
Probabilistic models can –and should – be used over rich
linguistic representations
They support effective learning and processing
Language learning does require biases or priors
But a lot more can be learned from modest amounts of
input than people have thought
There’s not much evidence of a poverty of the stimulus
preventing such models being used in acquisition.
43