Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Post on 29-Dec-2015

217 views 0 download

Tags:

Transcript of Minimum Description Length An Adequate Syntactic Theory? Mike Dowman 3 June 2005.

Minimum Description LengthMinimum Description Length

An Adequate Syntactic An Adequate Syntactic

Theory?Theory?

Mike DowmanMike Dowman

3 June 20053 June 2005

Linguistic Theory

Language Acquisition

Device

Individual's Knowledge of Language

Primary Linguistic

Data

Chomsky’s Conceptualization of Language Acquisition

Diachronic Theories

Language Acquisition

Device

Arena of Language

Use

Primary Linguistic

Data

Individual's Knowledge of

Language

Hurford’s Diachronic Spiral

Learnability

Poverty of the stimulusLanguage is really complex

Obscure and abstract rules constrain, wh-movement, pronoun binding, passive formation, etc.

Examples of E-language don’t give sufficient information to determine this

WH-movement

Whoi do you think Lord Emsworth will invite ti?

Whoi do you think that Lord Emsworth will invite ti?

Whoi do you think ti will arrive first?

* Whoi do you think that ti will arrive first?

Negative Evidence

• Some constructions seem impossible to learn without negative evidence

John gave a painting to the museum

John gave the museum a painting

John donated a painting to the museum

* John donated the museum a painting

Implicit Negative Evidence

If constructions don’t appear can we just assume they’re not grammatical?

No – we only see a tiny proportion of possible, grammatical sentences

People generalize from examples they have seen to form new utterances

‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)

Learnability Proofs

Gold (1967) for languages to be learnable in the limit we must have:

• Negative evidence

• or a priori restrictions on possible languages

But learnable in the limit means being sure that we have determined the correct language

Statistical Learnability

Horning (1969)• If grammars are statistical• so utterances are produced with

frequencies corresponding to the grammar Languages are learnable• But we can never be sure when the

correct grammar has been found• This just gets more likely as we see more

data

Horning’s Proof

• Used Bayes’ rule

• More complex grammars are less probable a priori P(h)

• Statistical grammars can assign probabilities to data P(d | h)

• Search through all possible grammars, starting with the simplest

)|()()|( hdPhPdhP

MDL

Horning’s evaluation method for grammars can be seen as a form of Minimum Description Length

Simplest is best (Occam’s Razor) Simplest means specifiable with the least

amount of informationInformation theory (Shannon, 1948) allows

us to link probability and information:Amount of information = -log Probability

Encoding Grammars and Data

1010100111010100101101010001100111100011010110

Grammar Data coded in terms of grammar

Decoder

A B C

B D E

E {kangaroo, aeroplane, comedian}

D {the, a, some}

C {died, laughed, burped}

The comedian died

A kangaroo burped

The aeroplane laughed

Some comedian burped

Complexity and Probability

• More complex grammarLonger coding length, so lower probability

• More restrictive grammarLess choices for data, so each possibility

has a higher probability

• Most restrictive grammar just lists all possible utterances

Only the observed data is grammatical, so it has a high probability

• A simple grammar could be made that allowed any sentences

Grammar would have a high probabilityBut data a very low one

MDL finds a middle ground between always generalizing and never generalizing

Rampant Synonymy?

• Inductive inference (Solomonoff, 1960a)• Kolmogorov complexity (Kolmogorov, 1965)• Minimum Message Length (Wallace and

Boulton, 1968)• Algorithmic Information Theory (Chaitin, 1969)• Minimum Description Length (Rissanen, 1978)• Minimum Coding Length (Ellison, 1992)• Bayesian Learning (Stolcke, 1994)• Minimum Representation Length (Brent, 1996)

Evaluation and Search

• MDL principle gives us an evaluation criterion for grammars (with respect to corpora)

• But it doesn’t solve the problem of how to find the grammars in the first place

Search mechanism needed

Two Learnability Problems

• How to determine which of two or more grammars is best given some data

• How to guide the search for grammars so that we can find the correct one, without considering every logically possible grammar

MDL in Linguistics

• Solomonoff (1960b): ‘Mechanization of Linguistic Learning’

• Learning phrase structure grammars for simple ‘toy’ languages: Stolcke (1994), Langley and Stromsten (2000)

• Or real corpora: Chen (1995), Grünwald (1994)

• Or for language modelling in speech recognition systems: Starkie (2001)

Not Just Syntax!

• Phonology: Ellison (1992), Rissanen and Ristad (1994)

• Morphology: Brent (1993), Goldsmith (2001)

• Segmenting continuous speech: de Marcken (1996), Brent and Cartwright (1997)

MDL and Parameter Setting• Briscoe (1999) and Rissanen and Ristad

(1994) used MDL as part of parameter setting learning mechanisms

MDL and Iterated Learning• Briscoe (1999) used MDL as part of an

expression-induction model

• Brighton (2002) investigated effect of bottlenecks on an MDL learner

• Roberts et al (2005) modelled lexical exceptions to syntactic rules

An Example: My Model

Learns simple phrase structure grammars• Binary or non-branching rules:A B CD EF tomato• All derivations start from special symbol S• null symbol in 3rd position indicates non-

branching rule

Encoding Grammars

Grammars can be coded as lists of three symbols

• First symbol is rules left hand side, second and third its right hand side

A, B, C, D, E, null, F, tomato, null

• First we have to encode the frequency of each symbol

1 S NP VP (3)2 NP john (2)3 NP mary (1)4 VP screamed (2)5 VP died (1)

Data: 1, 2, 4, 1, 2, 5, 1, 3, 4Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3…

We must record the frequency of each rule

Encoding Data

Total frequency for NP = 3

Total frequency for VP = 3

Total frequency for S = 3

Encoding in My Model

1010100111010100101101010001100111100011010110

Symbol Frequencies

Rule Frequencies

Decoder

1 S NP VP2 NP john 3 NP mary4 VP screamed5 VP died

John screamedJohn diedMary Screamed

Grammar Data

S (1)NP (3)VP (3)john (1)mary (1)screamed (1)died (1)null (4)

Rule 1 3Rule 2 2Rule 3 1Rule 4 2Rule 5 1

Search Strategy

• Start with simple grammar that allows all sentences

• Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.)

• Annealing search

• First stage: just look at data coding length

• Second stage: look at overall evaluation

John hit MaryMary hit EthelEthel ranJohn ranMary ranEthel hit JohnNoam hit JohnEthel screamedMary kicked EthelJohn hopes Ethel thinks Mary hit EthelEthel thinks John ranJohn thinks Ethel ranMary ranEthel hit MaryMary thinks John hit EthelJohn screamedNoam hopes John screamedMary hopes Ethel hit JohnNoam kicked Mary

Example: EnglishLearned Grammar

S NP VPVP ranVP screamedVP Vt NPVP Vs SVt hitVt kickedVs thinksVs hopesNP JohnNP EthelNP MaryNP Noam

Evaluations

050

100150200250300350400450

Eva

luat

ion

(b

its)

InitialGrammar

LearnedGrammar

OverallEvaluation

Grammar

Data

Dative Alternation

• Children learn distinction between alternating and non-alternating verbs

• Previously unseen verbs are used productively in both constructions

New verbs follow regular pattern

• During learning children use non-alternating verbs in both constructions

U-shaped learning

Training Data

• Three alternating verbs: gave, passed, lent

• One non-alternating verb: donated

• One verb seen only once: sent

The museum lent Sam a painting

John gave a painting to Sam

Sam donated John to the museum

The museum sent a painting to Sam

Dative Evaluations

0

500

1000

1500

2000

2500

3000

3500

Eva

luat

ion

(b

its)

InitialGrammar

LearnedGrammar

OverallEvaluation

Grammar

Data

Grammar Properties

• Learned grammar distinguishes alternating and non-alternating verbs

• sent appears in alternating class

• With less data, only one class of verbs, so donated can appear in both constructions

• All sentences generated by the grammar are grammatical

• But structures are not right

Learned Structures

John gave a painting to Sam

NP VA DET N P NP

NP

S

Z

Y

X

• Why does the model place a newly seen verb in the regular class?

Y VA NPY VA ZY VP ZVA passedVA gaveVA lentVP donated

VA / VP sent

Regular and Irregular Rules

sent doesn’t alternate

sent alternates

Overall Evaluation (bits)

1703.6 1703.4

Grammar (bits)

322.2 321.0

Data (bits) 1381.4 1382.3

Regular constructions are preferred because the grammar is coded statistically

Why use Statistical Grammars?

Statistics are a valuable source of information They help to infer when absences are due to

chanceThe learned grammar predicted that sent should

appear in the double object construction• but in 150 sentences it was only seen in the

prepositional dative construction• With a non statistical grammar we need an

explanation as to why this is• A statistical grammar knows that sent is rare,

which explains the absence of double object occurrences

Scaling Up: Onnis, Roberts and Chater (2003)

Causative alternation:

John cut the string* The string cut* John arrived the trainThe train arrivedJohn bounced the ballThe ball bounced

Onnis et al’s Data

• Two word classes: N and V• NV and VN only allowable sentences16 verbs alternate: NV + VN10 verbs NV only10 verbs VN only

Coding scheme marks non-alternating verbs as exceptional (cost in coding scheme)

Onnis et al’s Results

< 16,000 sentences all verbs alternate

> 16,000 sentences non alternating verbs classified as exceptional

No search mechanism Just looked at evaluations with and without exceptions

In expression-induction model quasi-regularities appear as a result of chance omissions

MDL and MML Issues

• Numeric parameters - accuracy

• Bayes’ optimal classification (not MAP learning) – Monte Carlo methods

If we see a sentence, work out the probability of it for each grammar

Weighted sum gives probability of sentence

• Unseen data – zero probability?

One and Two Part Codes

1010100111010100101101010001100111100011010110

Grammar Data coded in terms of grammar

Decoder

Data and grammar combined

Decoder 1010100111010100101101010001100111100011010110

Grammar Data

Coding English Texts

Grammar is a frequency for each letter and for space

• Counts start at one• We decode a series of letters – and

update the counts for each letter• All letters coded in terms of their

probabilities at that point in the decoding• At end we have a decoded text and

grammar

Decoding Example

Letter Count Count Count Count

A 1 2 2 2

B 1 1 2 3

C 1 1 1 1

Space 1 1 1 1

Decoded string

A (P=1/4)

B (P=1/5)

B (P=2/6)

One-Part Grammars

Grammars can also be coded using one-part codes

• Start with no grammar, but have a probability associated with adding a new rule

• Each time we decode data we first choose to add a new rule, or use an existing one

Examples are Dowman (2000) or Venkataraman (1997)

ConclusionsConclusions

MDL can solve the poverty of the stimulus MDL can solve the poverty of the stimulus problemproblem

But it doesn’t solve the problem of But it doesn’t solve the problem of constraining the search for grammarsconstraining the search for grammars

Coding schemes create learning biasesCoding schemes create learning biases Statistical grammars and statistical coding of Statistical grammars and statistical coding of

grammars can help learninggrammars can help learning