Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

43
Minimum Description Minimum Description Length Length An Adequate Syntactic An Adequate Syntactic Theory? Theory? Mike Dowman Mike Dowman 3 June 2005 3 June 2005

Transcript of Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Page 1: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Minimum Description LengthMinimum Description Length

An Adequate Syntactic An Adequate Syntactic

Theory?Theory?

Mike DowmanMike Dowman

3 June 20053 June 2005

Page 2: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Linguistic Theory

Language Acquisition

Device

Individual's Knowledge of Language

Primary Linguistic

Data

Chomsky’s Conceptualization of Language Acquisition

Page 3: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Diachronic Theories

Language Acquisition

Device

Arena of Language

Use

Primary Linguistic

Data

Individual's Knowledge of

Language

Hurford’s Diachronic Spiral

Page 4: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Learnability

Poverty of the stimulusLanguage is really complex

Obscure and abstract rules constrain, wh-movement, pronoun binding, passive formation, etc.

Examples of E-language don’t give sufficient information to determine this

Page 5: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

WH-movement

Whoi do you think Lord Emsworth will invite ti?

Whoi do you think that Lord Emsworth will invite ti?

Whoi do you think ti will arrive first?

* Whoi do you think that ti will arrive first?

Page 6: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Negative Evidence

• Some constructions seem impossible to learn without negative evidence

John gave a painting to the museum

John gave the museum a painting

John donated a painting to the museum

* John donated the museum a painting

Page 7: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Implicit Negative Evidence

If constructions don’t appear can we just assume they’re not grammatical?

No – we only see a tiny proportion of possible, grammatical sentences

People generalize from examples they have seen to form new utterances

‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)

Page 8: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Learnability Proofs

Gold (1967) for languages to be learnable in the limit we must have:

• Negative evidence

• or a priori restrictions on possible languages

But learnable in the limit means being sure that we have determined the correct language

Page 9: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Statistical Learnability

Horning (1969)• If grammars are statistical• so utterances are produced with

frequencies corresponding to the grammar Languages are learnable• But we can never be sure when the

correct grammar has been found• This just gets more likely as we see more

data

Page 10: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Horning’s Proof

• Used Bayes’ rule

• More complex grammars are less probable a priori P(h)

• Statistical grammars can assign probabilities to data P(d | h)

• Search through all possible grammars, starting with the simplest

)|()()|( hdPhPdhP

Page 11: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

MDL

Horning’s evaluation method for grammars can be seen as a form of Minimum Description Length

Simplest is best (Occam’s Razor) Simplest means specifiable with the least

amount of informationInformation theory (Shannon, 1948) allows

us to link probability and information:Amount of information = -log Probability

Page 12: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Encoding Grammars and Data

1010100111010100101101010001100111100011010110

Grammar Data coded in terms of grammar

Decoder

A B C

B D E

E {kangaroo, aeroplane, comedian}

D {the, a, some}

C {died, laughed, burped}

The comedian died

A kangaroo burped

The aeroplane laughed

Some comedian burped

Page 13: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Complexity and Probability

• More complex grammarLonger coding length, so lower probability

• More restrictive grammarLess choices for data, so each possibility

has a higher probability

Page 14: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

• Most restrictive grammar just lists all possible utterances

Only the observed data is grammatical, so it has a high probability

• A simple grammar could be made that allowed any sentences

Grammar would have a high probabilityBut data a very low one

MDL finds a middle ground between always generalizing and never generalizing

Page 15: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Rampant Synonymy?

• Inductive inference (Solomonoff, 1960a)• Kolmogorov complexity (Kolmogorov, 1965)• Minimum Message Length (Wallace and

Boulton, 1968)• Algorithmic Information Theory (Chaitin, 1969)• Minimum Description Length (Rissanen, 1978)• Minimum Coding Length (Ellison, 1992)• Bayesian Learning (Stolcke, 1994)• Minimum Representation Length (Brent, 1996)

Page 16: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Evaluation and Search

• MDL principle gives us an evaluation criterion for grammars (with respect to corpora)

• But it doesn’t solve the problem of how to find the grammars in the first place

Search mechanism needed

Page 17: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Two Learnability Problems

• How to determine which of two or more grammars is best given some data

• How to guide the search for grammars so that we can find the correct one, without considering every logically possible grammar

Page 18: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

MDL in Linguistics

• Solomonoff (1960b): ‘Mechanization of Linguistic Learning’

• Learning phrase structure grammars for simple ‘toy’ languages: Stolcke (1994), Langley and Stromsten (2000)

• Or real corpora: Chen (1995), Grünwald (1994)

• Or for language modelling in speech recognition systems: Starkie (2001)

Page 19: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Not Just Syntax!

• Phonology: Ellison (1992), Rissanen and Ristad (1994)

• Morphology: Brent (1993), Goldsmith (2001)

• Segmenting continuous speech: de Marcken (1996), Brent and Cartwright (1997)

Page 20: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

MDL and Parameter Setting• Briscoe (1999) and Rissanen and Ristad

(1994) used MDL as part of parameter setting learning mechanisms

MDL and Iterated Learning• Briscoe (1999) used MDL as part of an

expression-induction model

• Brighton (2002) investigated effect of bottlenecks on an MDL learner

• Roberts et al (2005) modelled lexical exceptions to syntactic rules

Page 21: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

An Example: My Model

Learns simple phrase structure grammars• Binary or non-branching rules:A B CD EF tomato• All derivations start from special symbol S• null symbol in 3rd position indicates non-

branching rule

Page 22: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Encoding Grammars

Grammars can be coded as lists of three symbols

• First symbol is rules left hand side, second and third its right hand side

A, B, C, D, E, null, F, tomato, null

• First we have to encode the frequency of each symbol

Page 23: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

1 S NP VP (3)2 NP john (2)3 NP mary (1)4 VP screamed (2)5 VP died (1)

Data: 1, 2, 4, 1, 2, 5, 1, 3, 4Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3…

We must record the frequency of each rule

Encoding Data

Total frequency for NP = 3

Total frequency for VP = 3

Total frequency for S = 3

Page 24: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Encoding in My Model

1010100111010100101101010001100111100011010110

Symbol Frequencies

Rule Frequencies

Decoder

1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died

John screamedJohn diedMary Screamed

Grammar Data

S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)

Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1

Page 25: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Search Strategy

• Start with simple grammar that allows all sentences

• Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.)

• Annealing search

• First stage: just look at data coding length

• Second stage: look at overall evaluation

Page 26: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary

Example: EnglishLearned Grammar

S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam

Page 27: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Evaluations

050

100150200250300350400450

Eva

luat

ion

(b

its)

InitialGrammar

LearnedGrammar

OverallEvaluation

Grammar

Data

Page 28: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Dative Alternation

• Children learn distinction between alternating and non-alternating verbs

• Previously unseen verbs are used productively in both constructions

New verbs follow regular pattern

• During learning children use non-alternating verbs in both constructions

U-shaped learning

Page 29: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Training Data

• Three alternating verbs: gave, passed, lent

• One non-alternating verb: donated

• One verb seen only once: sent

The museum lent Sam a painting

John gave a painting to Sam

Sam donated John to the museum

The museum sent a painting to Sam

Page 30: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Dative Evaluations

0

500

1000

1500

2000

2500

3000

3500

Eva

luat

ion

(b

its)

InitialGrammar

LearnedGrammar

OverallEvaluation

Grammar

Data

Page 31: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Grammar Properties

• Learned grammar distinguishes alternating and non-alternating verbs

• sent appears in alternating class

• With less data, only one class of verbs, so donated can appear in both constructions

• All sentences generated by the grammar are grammatical

• But structures are not right

Page 32: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Learned Structures

John gave a painting to Sam

NP VA DET N P NP

NP

S

Z

Y

X

Page 33: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

• Why does the model place a newly seen verb in the regular class?

Y VA NPY VA ZY VP ZVA passedVA gaveVA lentVP donated

VA / VP sent

Regular and Irregular Rules

sent doesn’t alternate

sent alternates

Overall Evaluation (bits)

1703.6 1703.4

Grammar (bits)

322.2 321.0

Data (bits) 1381.4 1382.3

Regular constructions are preferred because the grammar is coded statistically

Page 34: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Why use Statistical Grammars?

Statistics are a valuable source of information They help to infer when absences are due to

chanceThe learned grammar predicted that sent should

appear in the double object construction• but in 150 sentences it was only seen in the

prepositional dative construction• With a non statistical grammar we need an

explanation as to why this is• A statistical grammar knows that sent is rare,

which explains the absence of double object occurrences

Page 35: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Scaling Up: Onnis, Roberts and Chater (2003)

Causative alternation:

John cut the string* The string cut* John arrived the trainThe train arrivedJohn bounced the ballThe ball bounced

Page 36: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Onnis et al’s Data

• Two word classes: N and V• NV and VN only allowable sentences16 verbs alternate: NV + VN10 verbs NV only10 verbs VN only

Coding scheme marks non-alternating verbs as exceptional (cost in coding scheme)

Page 37: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Onnis et al’s Results

< 16,000 sentences all verbs alternate

> 16,000 sentences non alternating verbs classified as exceptional

No search mechanism Just looked at evaluations with and without exceptions

In expression-induction model quasi-regularities appear as a result of chance omissions

Page 38: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

MDL and MML Issues

• Numeric parameters - accuracy

• Bayes’ optimal classification (not MAP learning) – Monte Carlo methods

If we see a sentence, work out the probability of it for each grammar

Weighted sum gives probability of sentence

• Unseen data – zero probability?

Page 39: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

One and Two Part Codes

1010100111010100101101010001100111100011010110

Grammar Data coded in terms of grammar

Decoder

Data and grammar combined

Decoder 1010100111010100101101010001100111100011010110

Grammar Data

Page 40: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Coding English Texts

Grammar is a frequency for each letter and for space

• Counts start at one• We decode a series of letters – and

update the counts for each letter• All letters coded in terms of their

probabilities at that point in the decoding• At end we have a decoded text and

grammar

Page 41: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Decoding Example

Letter Count Count Count Count

A 1 2 2 2

B 1 1 2 3

C 1 1 1 1

Space 1 1 1 1

Decoded string

A (P=1/4)

B (P=1/5)

B (P=2/6)

Page 42: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

One-Part Grammars

Grammars can also be coded using one-part codes

• Start with no grammar, but have a probability associated with adding a new rule

• Each time we decode data we first choose to add a new rule, or use an existing one

Examples are Dowman (2000) or Venkataraman (1997)

Page 43: Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

ConclusionsConclusions

MDL can solve the poverty of the stimulus MDL can solve the poverty of the stimulus problemproblem

But it doesn’t solve the problem of But it doesn’t solve the problem of constraining the search for grammarsconstraining the search for grammars

Coding schemes create learning biasesCoding schemes create learning biases Statistical grammars and statistical coding of Statistical grammars and statistical coding of

grammars can help learninggrammars can help learning