A Bayesian Approach to the Poverty of the Stimulus
description
Transcript of A Bayesian Approach to the Poverty of the Stimulus
A Bayesian Approach to the Poverty of the Stimulus
Amy PerforsMIT
With Josh Tenenbaum (MIT) and Terry Regier (University of
Chicago)
Innate
Learned
Innate
Learned
Explicit Structure
No explicit
Structure
No Yes
Language has hierarchical phrase structure
Why believe that language has hierarchical phrase structure?
Formal properties + information-theoretic, simplicity-based argument (Chomsky, 1956) Dependency structure of language:
A finite-state grammar cannot capture the infinite sets of English sentences with dependencies like this
If we restrict ourselves to only a finite set of sentences, then in theory a finite-state grammar could account for them: “but this grammar will be so complex as to be of little use or interest.”
Simple declarative: The girl is happy, They are eating
Simple interrogative: Is the girl happy? Are they eating?
1. Linear: move the first “is” (auxiliary) in the sentence to the beginning
2. Hierarchical: move the auxiliary in the main clause to the beginning
Res
ult
Hyp
oth
eses
Complex declarative: The girl who is sleeping is happy.
Dat
a
Children say: Is the girl who is sleeping happy?
NOT: *Is the girl who sleeping is happy?
Tes
t
Chomsky, 1965, 1980; Crain & Nakayama, 1987
Why believe that structure dependence is innate?
The Argument from the Poverty of the Stimulus (PoS):
Why believe it’s not innate?
There are actually enough complex interrogatives (Pullum & Scholz 02)
Children’s behavior can be explained via statistical learning of natural language data (Lewis & Elman 01; Reali & Christiansen 05)
It is not necessary to assume a grammar with explicit structure
Innate
Learned
Explicit Structure
No explicit
Structure
Innate
Learned
Explicit Structure
No explicit
Structure
Our argument
We suggest that, contra the PoS claim: It is possible, given the nature of the input and certain
domain-general assumptions about the learning mechanism, that an ideal, unbiased learner can realize that language has a hierarchical phrase structure; therefore this knowledge need not be innate
The reason: grammars with hierarchical phrase structure offer an optimal tradeoff between simplicity and fit to natural language data
Our argument
Plan
Model Data: corpus of child-directed speech (CHILDES)
Grammars Linear & hierarchical Both: Hand-designed & result of local
search Linear: automatic, unsupervised ML
Evaluation Complexity vs. fit
Results Implications
The model: Data
Corpus from CHILDES database (Adam, Brown corpus)
55 files, age range 2;3 to 5;2 Sentences spoken by adults to children Each word replaced by syntactic category
det, n, adj, prep, pro, prop, to, part, vi, v, aux, comp, wh, c
Ungrammatical sentences and the most grammatically complex sentence types were removed: kept 21792 out of 25876 utterances Topicalized sentences(66); sentences serial verb constructions (459),
subordinate phrases (845), sentential complements (1636), and conjunctions (634). Ungrammatical sentences (444)
Data
Final corpus contained 2336 individual sentence types corresponding to 21792 sentence tokens
Data: variation
Amount of evidence available at different points in development
Data: variation
Amount of evidence available at different points in development
Amount comprehended at different points in development
Data: amount available
Rough estimate – split by age
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
# FilesAge % types# types
2;3 to 3;1
2;3 to 2;8 879
1295
1735
2090
2336
38%
55%
74%
89%
100%
11
2;3 to 3;5
2;3 to 4;2
2;3 to 5;2
2;3 173 7.4%Epoch 0 1
33
22
55
44
Data: amount comprehended
Rough estimate – split by frequency
Level 1
Level 2
Level 3
Level 4
Level 5
Level 6
Frequency # types % tokens% types
8
37
67
115
268
2336
500+
100+
50+
25+
10+
1+ (all)
0.3%
1.6%
2.9%
4.9%
12%
100%
28%
55%
64%
71%
82%
100%
The model
Data Child-directed speech (CHILDES)
Grammars Linear & hierarchical Both: Hand-designed & result of local
search Linear: automatic, unsupervised ML)
Evaluation Complexity vs. fit
Grammar types
Context-free grammar
Rules Example
“Flat” grammar
Rules
List of each sentence
Example
Regular grammar
Rules
NT t NT
Example
NT tNT NT NT
NT t NT
NT NT
NT t
HierarchicalLinear
Rules
Example
1-state grammar
Anything accepted
CFG-S
Description
Designed to be as linguistically plausible as possible
Example productions
Standard CFG
CFG-L
Description
Derived from CFG-S; contains additional productions corresponding to different expansions of the same NT (puts less
probability mass on recursive productions)
Example productions
Larger CFG
77 rules, 15 non-terminals 133 rules, 15 non-terminals
Specific hierarchical grammars: Hand-designed
FLAT
List of each
sentence
2336 rules, 0 non-
terminals
1-STATE
Anything accepted
26 rules, 0 non-
terminals
Exact fit, no compression
Poor fit, high compression
Specific linear grammars: Hand-designed
REG-N
Narrowest regular derived from CFG
289 rules, 85 non-
terminals
FLAT
List of each
sentence
2336 rules, 0 non-
terminals
1-STATE
Anything accepted
26 rules, 0 non-
terminals
Exact fit, no compression
Poor fit, high compression
Specific linear grammars: Hand-designed
Mid-level regular derived from CFG
REG-M
169 rules, 14 non-
terminals
REG-N
Narrowest regular derived from CFG
289 rules, 85 non-
terminals
FLAT
List of each
sentence
2336 rules, 0 non-
terminals
1-STATE
Anything accepted
26 prods, 0 non-
terminals
Exact fit, no compression
Poor fit, high compression
Specific linear grammars: Hand-designed
REG-B
Broadest regular derived from CFG
117 rules, 10 non-
terminals
Mid-level regular derived from CFG
REG-M
169 prods, 14 non-
terminals
REG-N
Narrowest regular derived from CFG
289 prods, 85 non-
terminals
FLAT
List of each
sentence
2336 rules, 0 non-
terminals
1-STATE
Anything accepted
26 rules, 0 non-
terminals
Exact fit, no compression
Poor fit, high compression
Specific linear grammars: Hand-designed
Local search around hand-designed grammars
Automated search
Linear: unsupervised, automatic HMM learning
Goldwater & Griffiths, 2007
Bayesian model for acquisition of trigram HMM (designed for POS tagging, but given a corpus of syntactic categories,
learns a regular grammar)
The model
Data Child-directed speech (CHILDES)
Grammars Linear & hierarchical Hand-designed & result of local search Linear: automatic, unsupervised ML
Evaluation Complexity vs. fit
Grammars
T: type of grammar
G: Specific grammar
D: Data
Context-freeRegular
Flat, 1-state
unbiased (uniform)
Grammars
T: type of grammar
G: Specific grammar
D: Data
Context-freeRegular
Flat, 1-state
data fit (likelihood)
complexity (prior)
Low prior probability = more complex
Low likelihood = poor fit to the data
Fit: low
Simplicity: high
Fit: moderate
Simplicity: moderate
Fit: high
Simplicity: low
Tradeoff: Complexity vs. Fit
Measuring complexity: prior
Designing a grammar (God’s eye view)
Grammars with more rules and non-terminals will have lower prior probability
n = # of nonterminals Ni = # items in production iPk = # productions of nonterminal k V = vocab sizeΘk = production probability parameters for k
Measuring fit: likelihood
Probability of that grammar generating the data Product of the probability of each parse
Ex: pro aux det n
= 0.25 = 0.5*0.25*1.0*0.25*0.5 = 0.016
Plan
Model Data: corpus of child-directed speech
(CHILDES) Grammars
Linear & hierarchical Hand-designed & result of local search Linear: automated, unsupervised ML
Evaluation Complexity vs. fit
Results Implications
Corpus level
FLAT REG-N REG-M REG-B REG-AUTO
1-ST CFG-S CFG-L
1 -116 -119 -119 -119 -125 -135 -161 -176
2 -764 -537 -581 -538 -501 -476 -545 -586
3 -1480 -971 -905 -875 -841 -765 -835 -902
4 -7337 -3284 -2963 -2787 -3011 -3339 -2653 -2784
5 -13466 -5256 -4896 -4772 -5083 -6034 -4545 -4587
6 -85730 -29441 -27300 -27561 -28713 -40360 -27883 -26967
Results: data split by frequency levels (estimate of comprehension)
Log posterior probability (lower magnitude = better)
Results: data split by age (estimate of availability)
Results: data split by age (estimate of availability)
Log posterior probability (lower magnitude = better)
Corpus epoch
FLAT REG-N REG-M REG-B REG-AUTO
1-ST CFG-S CFG-L
0 -4849 -3181 -2671 -2488 -2422 -2443 -2187 -2312
1 -28778 -11608 -10209 -9891 -11127 -13379 -9673 -9522
2 -44158 -16346 -14972 -14557 -15643 -20594 -14541 -14194
3 -61365 -21757 -20182 -19775 -20332 -28765 -20109 -19527
4 -75570 -26201 -24507 -24193 -24786 -35547 -24706 -23904
5 -85730 -29441 -27300 -27561 -28713 -40360 -27883 -26967
Generalization: How well does each grammar predict sentences it hasn’t seen?
Generalization: How well does each grammar predict sentences it hasn’t seen?
Type In
corp?
Example RGN RG-M RG-B AUTO 1-ST CFG-S CFG-L
Simple Declarative
Eagles do fly. (n aux vi)
Simple Interrogative
Do eagles fly? (aux n vi)
Complex Declarative
Eagles that are alive do fly. (n comp aux adj aux vi)
Complex Interrogative
Do eagles that are alive fly? (aux n comp aux adj vi)
Complex Interrogative
Are eagles that alive do fly? (aux n comp adj aux vi)
Complex interrogatives
Shown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input
This paradigm is valuable: it makes any assumptions explicit and enables us to rigorously evaluate how different representations capture the tradeoff between simplicity and fit to data
In some ways, “higher-order” knowledge may be easier to learn than specific details (the “blessing of abstraction”)
Take-home messages
Implications for innateness?
Ideal learner Strong(er) assumptions:
The learner can find the best grammar in the space of possibilities
Weak(er) assumptions The learner has the ability to parse the corpus into
syntactic categories The learner can represent both linear and hierarchical
grammars Assume a particular way of calculating complexity &
data fit Have we actually found representative
grammars?
The End
Thanks also to the following for many helpful discussions: Virginia Savova, Jeff Elman, Danny Fox, Adam Albright, Fei Xu, Mark Johnson, Ken Wexler,
Ted Gibson, Sharon Goldwater, Michael Frank, Charles Kemp, Vikash Mansinghka, Noah Goodman
REG-B
Broadest regular derived from CFG
117 rules, 10 non-
terminals
Mid-level regular derived from CFG
REG-M
169 prods, 14 non-
terminals
REG-N
Narrowest regular derived from CFG
289 prods, 85 non-
terminals
FLAT
List of each
sentence
2336 rules, 0 non-
terminals
1-STATE
Anything accepted
26 rules, 0 non-
terminals
Exact fit, no compression
Poor fit, high compression
Specific linear grammars: Hand-designed
Why these results?
Natural language actually is generated from a grammar that looks more like a CFG
The other grammars overfit and therefore do not capture important language-specific generalizations
Flat
Computing the prior…
CFGREG
Context-free grammar
Regular grammar
NT t NTNT t
NT NT NT
NT t NT
NT NT
NT t
Likelihood, intuitively
Z: rule out because it does not explain some of the data points
X and Y both “explain” the data points, but X is the more likely source
Possible empirical tests
Present people with data the model learns FLAT, REG, and CFGs from; see which novel productions they generalize to Non-linguistic? To small children?
Examples of learning regular grammars in real life: does the model do the same?
Do people learn regular grammars?
S1 s2 s3 w1 w1 w1
Miss Mary Mack, Mack, MackAll dressed in black, black, blackWith silver buttons, buttons, buttonsAll down her back, back, backShe asked her mother, mother, mother,…
X s1 s2 s3
Spanish dancer, do the splits.Spanish dancer, give a kick.Spanish dancer, turn around.
Children’s Songs: Line level grammar
Do people learn regular grammars?
Teddy bear, teddy bear, turn around.Teddy bear, teddy bear, touch the ground.Teddy bear, teddy bear, show your shoe.Teddy bear, teddy bear, that will do.Teddy bear, teddy bear, go upstairs.…
Bubble gum, bubble gum, chew and blow,
Bubble gum, bubble gum, scrape your toe,
Bubble gum, bubble gum, tastes so sweet,
Children’s Songs: Song level: X X s1 s2 s3
Dolly Dimple walks like this,Dolly Dimple talks like this,Dolly Dimple smiles like this,Dolly Dimple throws a kiss.
Do people learn regular grammars?
A my name is AliceAnd my husband's name is Arthur,We come from Alabama,Where we sell artichokes.B my name is BarneyAnd my wife's name is Bridget,We come from Brooklyn,Where we sell bicycles.…
Songs containing items represented as lists (where order matters)
Dough a Thing I Buy Beer WithRay a guy who buys me beerMe, the one who wants a beerFa, a long way to the beerSo, I think I'll have a beerLa, -gers great but so is beer!Tea, no thanks I'll have a beer…
Cinderella, dressed in yella,Went upstairs to kiss a fella,Made a mistake and kissed a snake,How many doctors did it take?1, 2, 3, …
Do people learn regular grammars?
You put your [body part] inYou put your [body part] outYou put your [body part] inand you shake it all about
You do the hokey pokeyAnd you turn yourself aroundAnd that's what it's all about!
Most of the song is a template, with repeated (varying) element
If I were the marrying kindI thank the lord I'm not sirThe kind of rugger I would beWould be a rugby [position/item] sirCos I'd [verb phrase]And you'd [verb phrase]We'd all [verb phrase] together…
If you’re happy and you know it[verb] your [body part]If you’re happy and you know it then your face will surely show itIf you’re happy and you know it[verb] your [body part]
Do people learn regular grammars?
There was a farmer had a dog,And Bingo was his name-O.B-I-N-G-O!B-I-N-G-O!B-I-N-G-O!And Bingo was his name-O!
(each subsequent verse, replace a letter with a clap)
Other interesting structures… I know a song that never ends,
It goes on and on my friends,I know a song that never ends,And this is how it goes:(repeat)
Oh, Sir Richard, do not touch me
(each subsequent verse, remove the last word at the end of the sentence)
New PRG: 1-state
S End
Det, n, pro, prop, prep, adj, aux, wh, comp, to, v, vi, part
Det, n, pro, prop, prep, adj, aux, wh, comp, to, v, vi, part
Log(prior) = 0; no free parameters
Another PRG: standard + noise
For instance, level-1 PRG + noise would be the best regular grammar for the corpus at level 1, plus the 1-state model This could parse all levels of evidence Perhaps this would be better than a more
complicated PRG at later levels of evidence
Corpus level
Flat RG-L RG-S CFG-S
CFG-L
Flat RG-L RG-S CFG-S CFG-L
1 6817
11619
10118
13324
16426
85 135 119 157 188
2 405134
394146
357156
313185
446176
539 540 513 498 622
3 783281
560322
475333
384401
436373
1064 882 808 785 809
4 1509548
783627
607653
491490
596709
2055 1410 1260 1260 1305
5 40871499
13431758
8581863
5412078
7781941
5586 3101 2721 2619 2719
6 5148918119
508424326
155925625
68127289
133025754
69607
29410 27184
27970 27084
Results: frequency levels (comprehension estimates)
Log prior, log likelihood (abs) Log posterior (smaller is better)
P
P
P
P
P
P
L
L
L
L
L
L
Period Flat RG-L RG-S CFG-S
CFG-L
Flat RG-L RG-S CFG-S
CFG-L
0 2839891
14571260
8431342
5521498
8081425
3730 2717 2185 2050
2233
1 168315959
33607804
13498291
6678879
11758373
22790
11164
9640 9546
9548
2 260639272
374812168
146412891
67413785
127313006
35335
15916
14335
14459
14289
3 3657512932
431317185
149318123
68119406
129618280
49507
21498
19616
20087
19576
4 4529215969
468121376
152122536
68124059
129622674
61261
26057
24057
24740
23970
5 5148918119
508424326
155925625
68127289
133025754
69607
29410
27184
27970
27084
Results: availability by age
Log prior, log likelihood (abs)
P
P
P
P
P
P
L
L
L
L
L
L
Log posterior (smaller is better)
One type of hand-designed grammar
69 productions, 14 nonterminals
390 productions, 85 nonterminals
Specific grammars of each type
The other type of hand-designed grammar
126 productions, 14 nonterminals
170 productions, 14 nonterminals
Specific grammars of each type
P1. It is impossible to have made some generalization G simply on the basis of data D
P2. Children show behavior BP3. Behavior B is not possible
without having made G
G: a specific grammarD: typical child-directed speech inputB: children don’t make certain
mistakes (they don’t seem to entertain structure-independent hypotheses)
T: language has hierarchical phrase structureC1. Some constraints T,
which limit what type of generalizations G are possible, must be innate
The Argument from the Poverty of the Stimulus (PoS)
#1: Children hear complex interrogatives
Well, a few, but not many Adam (CHILDES) – 0.048%
No yes-no questions Four wh-questions (e.g., “What is the music it’s
playing?”) Nina (CHILDES) – 0.068%
No yes-no questions 14 wh-questions
In all, most estimates are << 1% of input
Legate & Yang 2002
Well, a few, but not many Adam (CHILDES) – 0.048%
No yes-no questions Four wh-questions (e.g., “What is the music it’s
playing?”) Nina (CHILDES) – 0.068%
No yes-no questions 14 wh-questions
In all, most estimates are << 1% of input
Legate & Yang 2002
How much is “enough”?
#1: Children hear complex interrogatives
#2: Can get the behavior without structure
There is enough statistical information in the input to be able to conclude which type of complex interrogative is ungrammatical
Reali & Christiansen 2004; Lewis & Elman, 2001
Rare: comp adj aux
Common: comp aux adj
Response: there is enough statistical information in the input to be able to conclude that “Are eagles that alive can fly?” is ungrammatical
Reali & Christiansen 2004; Lewis & Elman, 2001
Rare: comp adj aux
Common: comp aux adj
Sidesteps the question: does not address the innateness of structure (knowledge X)
Explanatorily opaque
#2: Can get the behavior without structure
Why do linguists believe that language has hierarchical phrase
structure? Formal properties + information-theoretic, simplicity-based
argument (Chomsky, 1956) A sentence has an (i,j) dependency if replacement of the ith symbol ai of
S by bi requires a corresponding replacement of the jth symbolf aj of S by bj
If S has an m-termed dependency set in L, at least 2^m states are necessary in the finite-state grammar that generates L Therefore, if L is a finite-state language, then there is an m such that no
sentence S of L has a dependency set of more than m terms in L The “mirror language” made up of sentences consisting of a string X
followed by X in reverse (e.g., aa, abba, babbab, aabbaa, etc), has the property that for any m we can find a dependency set D = {(1,2m), (2,2m-1),..,(m,m+1)}. Therefore it cannot be captured by any finite-state grammar
English has infinite sets of sentences with dependency sets with more than any fixed number of terms. E.g. “the man who said that S5 is arriving today”, there is a dependency between “man” and “is”. Therefore English cannot be finite-state
There is the possible counterargument that since any finite corpus could be captured by a finite-state grammar, then English is only not finite-state in the limit – but in practice, it could be Easy counterargument: simplicity considerations. Chomsky: “If the
processes have a limit, then the construction of a finite-state grammar will not be literally impossible (since a list is a trivial finite-state grammar), but this grammar will be so complex as to be of little use or interest.”
Innate
Learned
The big picture
Innate
Learned
Grammar Acquisition (Chomsky)
P1. Children show behavior B
B
The Argument from the Poverty of the Stimulus (PoS)
P1. Children show behavior BP2. Behavior B is not possible
without having some specific grammar or rule G
G
B
The Argument from the Poverty of the Stimulus (PoS)
P1. Children show behavior BP2. Behavior B is not possible
without having some specific grammar or rule G
P3. It is impossible to have learned G simply on the basis of data D
G
BD
X
The Argument from the Poverty of the Stimulus (PoS)
C1. Some constraints T, which limit what type of grammars are possible, must be innate
G
B
T
D
P1. Children show behavior BP2. Behavior B is not possible
without having some specific grammar or rule G
P3. It is impossible to have learned G simply on the basis of data D
The Argument from the Poverty of the Stimulus (PoS)
There are enough complex interrogatives in D
P1. It is impossible to have made some generalization G simply on the basis of data D
P2. Children show behavior BP3. Behavior B is not possible
without having made G
e.g., Pullum & Scholz 2002
C1. Some constraints T, which limit what type of generalizations G are possible, must be innate
Replies to the PoS argument
There are enough complex interrogatives in D
P1. It is impossible to have made some generalization G simply on the basis of data D
P2. Children show behavior BP3. Behavior B is not possible
without having made G
Pullum & Scholz, 2002
There is a route to B other than G (statistical learning)e.g., Lewis & Elman , 2001
Reali & Christiansen, 2005C1. Some constraints T,
which limit what type of generalizations G are possible, must be innate
Replies to the PoS argument
Innate Learned
Innate Learned
Explicit structure
No explicit structure
Innate Learned
Explicit structure
No explicit structure