A Bayesian Approach to the Poverty of the Stimulus

A Bayesian Approach to the Poverty of the Stimulus

Amy PerforsMIT

With Josh Tenenbaum (MIT) and Terry Regier (University of

Chicago)

Innate

Learned

Innate

Learned

Explicit Structure

No explicit

Structure

No Yes

Language has hierarchical phrase structure

Why believe that language has hierarchical phrase structure?

Formal properties + information-theoretic, simplicity-based argument (Chomsky, 1956) Dependency structure of language:

A finite-state grammar cannot capture the infinite sets of English sentences with dependencies like this

If we restrict ourselves to only a finite set of sentences, then in theory a finite-state grammar could account for them: “but this grammar will be so complex as to be of little use or interest.”

Simple declarative: The girl is happy, They are eating

Simple interrogative: Is the girl happy? Are they eating?

1. Linear: move the first “is” (auxiliary) in the sentence to the beginning

2. Hierarchical: move the auxiliary in the main clause to the beginning

Res

ult

Hyp

oth

eses

Complex declarative: The girl who is sleeping is happy.

Dat

a

Children say: Is the girl who is sleeping happy?

NOT: *Is the girl who sleeping is happy?

Tes

t

Chomsky, 1965, 1980; Crain & Nakayama, 1987

Why believe that structure dependence is innate?

The Argument from the Poverty of the Stimulus (PoS):

Why believe it’s not innate?

There are actually enough complex interrogatives (Pullum & Scholz 02)

Children’s behavior can be explained via statistical learning of natural language data (Lewis & Elman 01; Reali & Christiansen 05)

It is not necessary to assume a grammar with explicit structure

Innate

Learned

Explicit Structure

No explicit

Structure

Our argument

We suggest that, contra the PoS claim: It is possible, given the nature of the input and certain

domain-general assumptions about the learning mechanism, that an ideal, unbiased learner can realize that language has a hierarchical phrase structure; therefore this knowledge need not be innate

The reason: grammars with hierarchical phrase structure offer an optimal tradeoff between simplicity and fit to natural language data

Our argument

Plan

Model Data: corpus of child-directed speech (CHILDES)

Grammars Linear & hierarchical Both: Hand-designed & result of local

search Linear: automatic, unsupervised ML

Evaluation Complexity vs. fit

Results Implications

The model: Data

Corpus from CHILDES database (Adam, Brown corpus)

55 files, age range 2;3 to 5;2 Sentences spoken by adults to children Each word replaced by syntactic category

det, n, adj, prep, pro, prop, to, part, vi, v, aux, comp, wh, c

Ungrammatical sentences and the most grammatically complex sentence types were removed: kept 21792 out of 25876 utterances Topicalized sentences(66); sentences serial verb constructions (459),

subordinate phrases (845), sentential complements (1636), and conjunctions (634). Ungrammatical sentences (444)

Data

Final corpus contained 2336 individual sentence types corresponding to 21792 sentence tokens

Data: variation

Amount of evidence available at different points in development

Data: variation

Amount of evidence available at different points in development

Amount comprehended at different points in development

Data: amount available

Rough estimate – split by age

Epoch 1

Epoch 2

Epoch 3

Epoch 4

Epoch 5

# FilesAge % types# types

2;3 to 3;1

2;3 to 2;8 879

1295

1735

2090

2336

38%

55%

74%

89%

100%

11

2;3 to 3;5

2;3 to 4;2

2;3 to 5;2

2;3 173 7.4%Epoch 0 1

33

22

55

44

Data: amount comprehended

Rough estimate – split by frequency

Level 1

Level 2

Level 3

Level 4

Level 5

Level 6

Frequency # types % tokens% types

8

37

67

115

268

2336

500+

100+

50+

25+

10+

1+ (all)

0.3%

1.6%

2.9%

4.9%

12%

100%

28%

55%

64%

71%

82%

100%

The model

Data Child-directed speech (CHILDES)

Grammars Linear & hierarchical Both: Hand-designed & result of local

search Linear: automatic, unsupervised ML)


Grammar types

Context-free grammar

Rules Example

“Flat” grammar

Rules

List of each sentence

Example

Regular grammar

Rules

NT t NT

Example

NT tNT NT NT

NT t NT

NT NT

NT t

HierarchicalLinear

Rules

Example

1-state grammar

Anything accepted

CFG-S

Description

Designed to be as linguistically plausible as possible

Example productions

Standard CFG

CFG-L

Description

Derived from CFG-S; contains additional productions corresponding to different expansions of the same NT (puts less

probability mass on recursive productions)

Example productions

Larger CFG

77 rules, 15 non-terminals 133 rules, 15 non-terminals

Specific hierarchical grammars: Hand-designed

FLAT

List of each

sentence

2336 rules, 0 non-

terminals

1-STATE

Anything accepted

26 rules, 0 non-

terminals

Exact fit, no compression

Poor fit, high compression

Specific linear grammars: Hand-designed

REG-N

Narrowest regular derived from CFG

289 rules, 85 non-

terminals

FLAT

List of each

sentence

2336 rules, 0 non-

terminals

1-STATE

Anything accepted

26 rules, 0 non-

terminals




Mid-level regular derived from CFG

REG-M

169 rules, 14 non-

terminals

REG-N


289 rules, 85 non-

terminals

FLAT

List of each

sentence

2336 rules, 0 non-

terminals

1-STATE

Anything accepted

26 prods, 0 non-

terminals




REG-B

Broadest regular derived from CFG

117 rules, 10 non-

terminals


REG-M

169 prods, 14 non-

terminals

REG-N


289 prods, 85 non-

terminals

FLAT

List of each

sentence

2336 rules, 0 non-

terminals

1-STATE

Anything accepted

26 rules, 0 non-

terminals




Local search around hand-designed grammars

Automated search

Linear: unsupervised, automatic HMM learning

Goldwater & Griffiths, 2007

Bayesian model for acquisition of trigram HMM (designed for POS tagging, but given a corpus of syntactic categories,

learns a regular grammar)

The model

Data Child-directed speech (CHILDES)

Grammars Linear & hierarchical Hand-designed & result of local search Linear: automatic, unsupervised ML


Grammars

T: type of grammar

G: Specific grammar

D: Data

Context-freeRegular

Flat, 1-state

unbiased (uniform)

Grammars

T: type of grammar

G: Specific grammar

D: Data

Context-freeRegular

Flat, 1-state

data fit (likelihood)

complexity (prior)

Low prior probability = more complex

Low likelihood = poor fit to the data

Fit: low

Simplicity: high

Fit: moderate

Simplicity: moderate

Fit: high

Simplicity: low

Tradeoff: Complexity vs. Fit

Measuring complexity: prior

Designing a grammar (God’s eye view)

Grammars with more rules and non-terminals will have lower prior probability

n = # of nonterminals Ni = # items in production iPk = # productions of nonterminal k V = vocab sizeΘk = production probability parameters for k

Measuring fit: likelihood

Probability of that grammar generating the data Product of the probability of each parse

Ex: pro aux det n

= 0.25 = 0.5*0.25*1.0*0.25*0.5 = 0.016

Plan

Model Data: corpus of child-directed speech

(CHILDES) Grammars

Linear & hierarchical Hand-designed & result of local search Linear: automated, unsupervised ML


Results Implications

Corpus level

FLAT REG-N REG-M REG-B REG-AUTO

1-ST CFG-S CFG-L

1 -116 -119 -119 -119 -125 -135 -161 -176

2 -764 -537 -581 -538 -501 -476 -545 -586

3 -1480 -971 -905 -875 -841 -765 -835 -902

4 -7337 -3284 -2963 -2787 -3011 -3339 -2653 -2784

5 -13466 -5256 -4896 -4772 -5083 -6034 -4545 -4587

6 -85730 -29441 -27300 -27561 -28713 -40360 -27883 -26967

Results: data split by frequency levels (estimate of comprehension)

Log posterior probability (lower magnitude = better)

Results: data split by age (estimate of availability)

Results: data split by age (estimate of availability)

Log posterior probability (lower magnitude = better)

Corpus epoch

FLAT REG-N REG-M REG-B REG-AUTO

1-ST CFG-S CFG-L

0 -4849 -3181 -2671 -2488 -2422 -2443 -2187 -2312

1 -28778 -11608 -10209 -9891 -11127 -13379 -9673 -9522

2 -44158 -16346 -14972 -14557 -15643 -20594 -14541 -14194

3 -61365 -21757 -20182 -19775 -20332 -28765 -20109 -19527

4 -75570 -26201 -24507 -24193 -24786 -35547 -24706 -23904

5 -85730 -29441 -27300 -27561 -28713 -40360 -27883 -26967

Generalization: How well does each grammar predict sentences it hasn’t seen?

Generalization: How well does each grammar predict sentences it hasn’t seen?

Type In

corp?

Example RGN RG-M RG-B AUTO 1-ST CFG-S CFG-L

Simple Declarative

Eagles do fly. (n aux vi)

Simple Interrogative

Do eagles fly? (aux n vi)

Complex Declarative

Eagles that are alive do fly. (n comp aux adj aux vi)

Complex Interrogative

Do eagles that are alive fly? (aux n comp aux adj vi)

Complex Interrogative

Are eagles that alive do fly? (aux n comp adj aux vi)

Complex interrogatives

Shown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input

This paradigm is valuable: it makes any assumptions explicit and enables us to rigorously evaluate how different representations capture the tradeoff between simplicity and fit to data

In some ways, “higher-order” knowledge may be easier to learn than specific details (the “blessing of abstraction”)

Take-home messages

Implications for innateness?

Ideal learner Strong(er) assumptions:

The learner can find the best grammar in the space of possibilities

Weak(er) assumptions The learner has the ability to parse the corpus into

syntactic categories The learner can represent both linear and hierarchical

grammars Assume a particular way of calculating complexity &

data fit Have we actually found representative

grammars?

The End

Thanks also to the following for many helpful discussions: Virginia Savova, Jeff Elman, Danny Fox, Adam Albright, Fei Xu, Mark Johnson, Ken Wexler,

Ted Gibson, Sharon Goldwater, Michael Frank, Charles Kemp, Vikash Mansinghka, Noah Goodman

REG-B

Broadest regular derived from CFG

117 rules, 10 non-

terminals


REG-M

169 prods, 14 non-

terminals

REG-N


289 prods, 85 non-

terminals

FLAT

List of each

sentence

2336 rules, 0 non-

terminals

1-STATE

Anything accepted

26 rules, 0 non-

terminals




Why these results?

Natural language actually is generated from a grammar that looks more like a CFG

The other grammars overfit and therefore do not capture important language-specific generalizations

Flat

Computing the prior…

CFGREG

Context-free grammar

Regular grammar

NT t NTNT t

NT NT NT

NT t NT

NT NT

NT t

Likelihood, intuitively

Z: rule out because it does not explain some of the data points

X and Y both “explain” the data points, but X is the more likely source

Possible empirical tests

Present people with data the model learns FLAT, REG, and CFGs from; see which novel productions they generalize to Non-linguistic? To small children?

Examples of learning regular grammars in real life: does the model do the same?

Do people learn regular grammars?

S1 s2 s3 w1 w1 w1

Miss Mary Mack, Mack, MackAll dressed in black, black, blackWith silver buttons, buttons, buttonsAll down her back, back, backShe asked her mother, mother, mother,…

X s1 s2 s3

Spanish dancer, do the splits.Spanish dancer, give a kick.Spanish dancer, turn around.

Children’s Songs: Line level grammar


Teddy bear, teddy bear, turn around.Teddy bear, teddy bear, touch the ground.Teddy bear, teddy bear, show your shoe.Teddy bear, teddy bear, that will do.Teddy bear, teddy bear, go upstairs.…

Bubble gum, bubble gum, chew and blow,

Bubble gum, bubble gum, scrape your toe,

Bubble gum, bubble gum, tastes so sweet,

Children’s Songs: Song level: X X s1 s2 s3

Dolly Dimple walks like this,Dolly Dimple talks like this,Dolly Dimple smiles like this,Dolly Dimple throws a kiss.


A my name is AliceAnd my husband's name is Arthur,We come from Alabama,Where we sell artichokes.B my name is BarneyAnd my wife's name is Bridget,We come from Brooklyn,Where we sell bicycles.…

Songs containing items represented as lists (where order matters)

Dough a Thing I Buy Beer WithRay a guy who buys me beerMe, the one who wants a beerFa, a long way to the beerSo, I think I'll have a beerLa, -gers great but so is beer!Tea, no thanks I'll have a beer…

Cinderella, dressed in yella,Went upstairs to kiss a fella,Made a mistake and kissed a snake,How many doctors did it take?1, 2, 3, …


You put your [body part] inYou put your [body part] outYou put your [body part] inand you shake it all about

You do the hokey pokeyAnd you turn yourself aroundAnd that's what it's all about!

Most of the song is a template, with repeated (varying) element

If I were the marrying kindI thank the lord I'm not sirThe kind of rugger I would beWould be a rugby [position/item] sirCos I'd [verb phrase]And you'd [verb phrase]We'd all [verb phrase] together…

If you’re happy and you know it[verb] your [body part]If you’re happy and you know it then your face will surely show itIf you’re happy and you know it[verb] your [body part]


There was a farmer had a dog,And Bingo was his name-O.B-I-N-G-O!B-I-N-G-O!B-I-N-G-O!And Bingo was his name-O!

(each subsequent verse, replace a letter with a clap)

Other interesting structures… I know a song that never ends,

It goes on and on my friends,I know a song that never ends,And this is how it goes:(repeat)

Oh, Sir Richard, do not touch me

(each subsequent verse, remove the last word at the end of the sentence)

New PRG: 1-state

S End

Det, n, pro, prop, prep, adj, aux, wh, comp, to, v, vi, part

Det, n, pro, prop, prep, adj, aux, wh, comp, to, v, vi, part

Log(prior) = 0; no free parameters

Another PRG: standard + noise

For instance, level-1 PRG + noise would be the best regular grammar for the corpus at level 1, plus the 1-state model This could parse all levels of evidence Perhaps this would be better than a more

complicated PRG at later levels of evidence

Corpus level

Flat RG-L RG-S CFG-S

CFG-L

Flat RG-L RG-S CFG-S CFG-L

1 6817

11619

10118

13324

16426

85 135 119 157 188

2 405134

394146

357156

313185

446176

539 540 513 498 622

3 783281

560322

475333

384401

436373

1064 882 808 785 809

4 1509548

783627

607653

491490

596709

2055 1410 1260 1260 1305

5 40871499

13431758

8581863

5412078

7781941

5586 3101 2721 2619 2719

6 5148918119

508424326

155925625

68127289

133025754

69607

29410 27184

27970 27084

Results: frequency levels (comprehension estimates)

Log prior, log likelihood (abs) Log posterior (smaller is better)

P

P

P

P

P

P

L

L

L

L

L

L

Period Flat RG-L RG-S CFG-S

CFG-L

Flat RG-L RG-S CFG-S

CFG-L

0 2839891

14571260

8431342

5521498

8081425

3730 2717 2185 2050

2233

1 168315959

33607804

13498291

6678879

11758373

22790

11164

9640 9546

9548

2 260639272

374812168

146412891

67413785

127313006

35335

15916

14335

14459

14289

3 3657512932

431317185

149318123

68119406

129618280

49507

21498

19616

20087

19576

4 4529215969

468121376

152122536

68124059

129622674

61261

26057

24057

24740

23970

5 5148918119

508424326

155925625

68127289

133025754

69607

29410

27184

27970

27084

Results: availability by age

Log prior, log likelihood (abs)

P

P

P

P

P

P

L

L

L

L

L

L

Log posterior (smaller is better)

One type of hand-designed grammar

69 productions, 14 nonterminals


Specific grammars of each type

The other type of hand-designed grammar



Specific grammars of each type

P1. It is impossible to have made some generalization G simply on the basis of data D

P2. Children show behavior BP3. Behavior B is not possible

without having made G

G: a specific grammarD: typical child-directed speech inputB: children don’t make certain

mistakes (they don’t seem to entertain structure-independent hypotheses)

T: language has hierarchical phrase structureC1. Some constraints T,

which limit what type of generalizations G are possible, must be innate

The Argument from the Poverty of the Stimulus (PoS)

#1: Children hear complex interrogatives

Well, a few, but not many Adam (CHILDES) – 0.048%

No yes-no questions Four wh-questions (e.g., “What is the music it’s

playing?”) Nina (CHILDES) – 0.068%

No yes-no questions 14 wh-questions

In all, most estimates are << 1% of input

Legate & Yang 2002

Well, a few, but not many Adam (CHILDES) – 0.048%

No yes-no questions Four wh-questions (e.g., “What is the music it’s

playing?”) Nina (CHILDES) – 0.068%

No yes-no questions 14 wh-questions

In all, most estimates are << 1% of input

Legate & Yang 2002

How much is “enough”?

#1: Children hear complex interrogatives

#2: Can get the behavior without structure

There is enough statistical information in the input to be able to conclude which type of complex interrogative is ungrammatical

Reali & Christiansen 2004; Lewis & Elman, 2001

Rare: comp adj aux

Common: comp aux adj

Response: there is enough statistical information in the input to be able to conclude that “Are eagles that alive can fly?” is ungrammatical

Reali & Christiansen 2004; Lewis & Elman, 2001

Rare: comp adj aux

Common: comp aux adj

Sidesteps the question: does not address the innateness of structure (knowledge X)

Explanatorily opaque

#2: Can get the behavior without structure

Why do linguists believe that language has hierarchical phrase

structure? Formal properties + information-theoretic, simplicity-based

argument (Chomsky, 1956) A sentence has an (i,j) dependency if replacement of the ith symbol ai of

S by bi requires a corresponding replacement of the jth symbolf aj of S by bj

If S has an m-termed dependency set in L, at least 2^m states are necessary in the finite-state grammar that generates L Therefore, if L is a finite-state language, then there is an m such that no

sentence S of L has a dependency set of more than m terms in L The “mirror language” made up of sentences consisting of a string X

followed by X in reverse (e.g., aa, abba, babbab, aabbaa, etc), has the property that for any m we can find a dependency set D = {(1,2m), (2,2m-1),..,(m,m+1)}. Therefore it cannot be captured by any finite-state grammar

English has infinite sets of sentences with dependency sets with more than any fixed number of terms. E.g. “the man who said that S5 is arriving today”, there is a dependency between “man” and “is”. Therefore English cannot be finite-state

There is the possible counterargument that since any finite corpus could be captured by a finite-state grammar, then English is only not finite-state in the limit – but in practice, it could be Easy counterargument: simplicity considerations. Chomsky: “If the

processes have a limit, then the construction of a finite-state grammar will not be literally impossible (since a list is a trivial finite-state grammar), but this grammar will be so complex as to be of little use or interest.”

Innate

Learned

The big picture

Innate

Learned

Grammar Acquisition (Chomsky)

P1. Children show behavior B

B



without having some specific grammar or rule G

G

B




P3. It is impossible to have learned G simply on the basis of data D

G

BD

X


C1. Some constraints T, which limit what type of grammars are possible, must be innate

G

B

T

D



P3. It is impossible to have learned G simply on the basis of data D


There are enough complex interrogatives in D




e.g., Pullum & Scholz 2002

C1. Some constraints T, which limit what type of generalizations G are possible, must be innate

Replies to the PoS argument

There are enough complex interrogatives in D




Pullum & Scholz, 2002

There is a route to B other than G (statistical learning)e.g., Lewis & Elman , 2001

Reali & Christiansen, 2005C1. Some constraints T,

which limit what type of generalizations G are possible, must be innate

Replies to the PoS argument

Innate Learned

Innate Learned

Explicit structure

No explicit structure

A Bayesian Approach to the Poverty of the Stimulus

Documents

Transcript of A Bayesian Approach to the Poverty of the Stimulus