Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the...

147
Statistical Language Models and Grammars PAC Learning VC Dimension and Learnability Problems with the PAC Framework Conclusions Probabilistic Learning Computational Learning Theory and the APS Shalom Lappin * King’s College London * Joint work with Alex Clark Royal Holloway College London NASSLLI 2010, University of Indiana, Bloomington June 23, 2010 Shalom Lappin NASSLLI 2010 Class 3

Transcript of Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the...

Page 1: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Probabilistic LearningComputational Learning Theory and the APS

Shalom Lappin∗King’s College London

∗Joint work with Alex ClarkRoyal Holloway College London

NASSLLI 2010, University of Indiana, Bloomington

June 23, 2010

Shalom Lappin NASSLLI 2010 Class 3

Page 2: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Outline

1 Statistical Language Models and Grammars

2 PAC Learning

3 VC Dimension and Learnability

4 Problems with the PAC Framework

5 Conclusions

Shalom Lappin NASSLLI 2010 Class 3

Page 3: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Chomsky on Statistical Modeling of Grammar

Chomsky (1957) rejects the use of statistical methods torepresent the distinction between grammatical andungrammatical strings.

1 Colourless green ideas sleep furiously.2 Furiously sleep ideas green colourless.

(1) and (2) both have a probability approaching nil (in1957) of appearing in a corpus or actual speech.(1) is syntactically well formed, even if semanticallyanomalous, while (2) is not.

Shalom Lappin NASSLLI 2010 Class 3

Page 4: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Chomsky on Statistical Modeling of Grammar

Chomsky (1957) rejects the use of statistical methods torepresent the distinction between grammatical andungrammatical strings.

1 Colourless green ideas sleep furiously.2 Furiously sleep ideas green colourless.

(1) and (2) both have a probability approaching nil (in1957) of appearing in a corpus or actual speech.(1) is syntactically well formed, even if semanticallyanomalous, while (2) is not.

Shalom Lappin NASSLLI 2010 Class 3

Page 5: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Chomsky on Statistical Modeling of Grammar

Chomsky (1957) rejects the use of statistical methods torepresent the distinction between grammatical andungrammatical strings.

1 Colourless green ideas sleep furiously.2 Furiously sleep ideas green colourless.

(1) and (2) both have a probability approaching nil (in1957) of appearing in a corpus or actual speech.(1) is syntactically well formed, even if semanticallyanomalous, while (2) is not.

Shalom Lappin NASSLLI 2010 Class 3

Page 6: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Chomsky on Statistical Modeling of Grammar

Chomsky (1957) rejects the use of statistical methods torepresent the distinction between grammatical andungrammatical strings.

1 Colourless green ideas sleep furiously.2 Furiously sleep ideas green colourless.

(1) and (2) both have a probability approaching nil (in1957) of appearing in a corpus or actual speech.(1) is syntactically well formed, even if semanticallyanomalous, while (2) is not.

Shalom Lappin NASSLLI 2010 Class 3

Page 7: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Chomsky on Statistical Modeling of Grammar

Chomsky (1957) (p. 17)

If we rank the sequences of a given length in order of statisticalapproximation to English, we will find both grammatical and ungrammaticalsequences scattered throughout the list; there appears to be no particularrelation between order of approximation and grammaticalness. Despite theundeniable interest and importance of semantic and statistical studies oflanguage, they appear to have no direct relevance to the problem ofdetermining or characterizing the set of grammatical utterances. I believe thatwe are forced to conclude that grammar is autonomous and independent ofmeaning, and that probabilistic models give no particular insight into some ofthe basic problems of syntactic structure.

Shalom Lappin NASSLLI 2010 Class 3

Page 8: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Smoothed Bigram Model

Chomsky moves from the claim that information theoreticmethods cannot identify the set of grammatical sentencesin the PLD to the conclusion that they are irrelevant tocharacterizing syntactic structure.This argument is not sound.Chomsky assumes a bigram model in which probability ofa word in a string depends on the word that immediatelyprecedes it.Pereira (2000) constructs a smoothed bigram model inwhich the probability of a word depends on the class of theprior word.

Shalom Lappin NASSLLI 2010 Class 3

Page 9: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Smoothed Bigram Model

Chomsky moves from the claim that information theoreticmethods cannot identify the set of grammatical sentencesin the PLD to the conclusion that they are irrelevant tocharacterizing syntactic structure.This argument is not sound.Chomsky assumes a bigram model in which probability ofa word in a string depends on the word that immediatelyprecedes it.Pereira (2000) constructs a smoothed bigram model inwhich the probability of a word depends on the class of theprior word.

Shalom Lappin NASSLLI 2010 Class 3

Page 10: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Smoothed Bigram Model

Chomsky moves from the claim that information theoreticmethods cannot identify the set of grammatical sentencesin the PLD to the conclusion that they are irrelevant tocharacterizing syntactic structure.This argument is not sound.Chomsky assumes a bigram model in which probability ofa word in a string depends on the word that immediatelyprecedes it.Pereira (2000) constructs a smoothed bigram model inwhich the probability of a word depends on the class of theprior word.

Shalom Lappin NASSLLI 2010 Class 3

Page 11: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Smoothed Bigram Model

Chomsky moves from the claim that information theoreticmethods cannot identify the set of grammatical sentencesin the PLD to the conclusion that they are irrelevant tocharacterizing syntactic structure.This argument is not sound.Chomsky assumes a bigram model in which probability ofa word in a string depends on the word that immediatelyprecedes it.Pereira (2000) constructs a smoothed bigram model inwhich the probability of a word depends on the class of theprior word.

Shalom Lappin NASSLLI 2010 Class 3

Page 12: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Smoothed Bigram Model

Pereira’s model computes the conditional probability of aword wi in a string with the formula

P(wi | wi−1) ≈∑

c

P(wi | c)P(c | wi−1)

where c is the class of wi−1.We can use distributional patterns of words in a corpus tolearn their classes from training data.Other procedures allow us to compute the values of theparameters P(wi | c) and P(c | wi−1) from this data.When applied to (1) and (2), this model yields a five orderof magnitude difference between their probability values fora corpus of newspaper text.

Shalom Lappin NASSLLI 2010 Class 3

Page 13: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Smoothed Bigram Model

Pereira’s model computes the conditional probability of aword wi in a string with the formula

P(wi | wi−1) ≈∑

c

P(wi | c)P(c | wi−1)

where c is the class of wi−1.We can use distributional patterns of words in a corpus tolearn their classes from training data.Other procedures allow us to compute the values of theparameters P(wi | c) and P(c | wi−1) from this data.When applied to (1) and (2), this model yields a five orderof magnitude difference between their probability values fora corpus of newspaper text.

Shalom Lappin NASSLLI 2010 Class 3

Page 14: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Smoothed Bigram Model

Pereira’s model computes the conditional probability of aword wi in a string with the formula

P(wi | wi−1) ≈∑

c

P(wi | c)P(c | wi−1)

where c is the class of wi−1.We can use distributional patterns of words in a corpus tolearn their classes from training data.Other procedures allow us to compute the values of theparameters P(wi | c) and P(c | wi−1) from this data.When applied to (1) and (2), this model yields a five orderof magnitude difference between their probability values fora corpus of newspaper text.

Shalom Lappin NASSLLI 2010 Class 3

Page 15: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Smoothed Bigram Model

Pereira’s model computes the conditional probability of aword wi in a string with the formula

P(wi | wi−1) ≈∑

c

P(wi | c)P(c | wi−1)

where c is the class of wi−1.We can use distributional patterns of words in a corpus tolearn their classes from training data.Other procedures allow us to compute the values of theparameters P(wi | c) and P(c | wi−1) from this data.When applied to (1) and (2), this model yields a five orderof magnitude difference between their probability values fora corpus of newspaper text.

Shalom Lappin NASSLLI 2010 Class 3

Page 16: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Independence Assumptions for Statistical Learning

A basic assumption of most statistical learning theories isthat the events which the theory models are generatedrandomly, and independently of each other.This assumption is specified in the principle that events areIndependently and Identically Distributed (IID).The IID is an idealization, and it is open to obviouschallenges in the case of sentences uttered in a discourse.Local dependencies clearly do exist among sentences inparticular discourse contexts.

Shalom Lappin NASSLLI 2010 Class 3

Page 17: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Independence Assumptions for Statistical Learning

A basic assumption of most statistical learning theories isthat the events which the theory models are generatedrandomly, and independently of each other.This assumption is specified in the principle that events areIndependently and Identically Distributed (IID).The IID is an idealization, and it is open to obviouschallenges in the case of sentences uttered in a discourse.Local dependencies clearly do exist among sentences inparticular discourse contexts.

Shalom Lappin NASSLLI 2010 Class 3

Page 18: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Independence Assumptions for Statistical Learning

A basic assumption of most statistical learning theories isthat the events which the theory models are generatedrandomly, and independently of each other.This assumption is specified in the principle that events areIndependently and Identically Distributed (IID).The IID is an idealization, and it is open to obviouschallenges in the case of sentences uttered in a discourse.Local dependencies clearly do exist among sentences inparticular discourse contexts.

Shalom Lappin NASSLLI 2010 Class 3

Page 19: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Independence Assumptions for Statistical Learning

A basic assumption of most statistical learning theories isthat the events which the theory models are generatedrandomly, and independently of each other.This assumption is specified in the principle that events areIndependently and Identically Distributed (IID).The IID is an idealization, and it is open to obviouschallenges in the case of sentences uttered in a discourse.Local dependencies clearly do exist among sentences inparticular discourse contexts.

Shalom Lappin NASSLLI 2010 Class 3

Page 20: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Sustaining the IID over large Corpora

While local dependencies among sentences violate the IID,it is reasonable to assume that the effect of thesedependencies dissipates as the corpus increases in size.It seems plausible to assume that the IID generally holdsover a large number of events.If this is the case, then the probability distributions for largecorpora tend to converge on the IID.We will adopt it in lieu of a more refined characterization ofthe conditional probability relations among sentences in alarge corpus.

Shalom Lappin NASSLLI 2010 Class 3

Page 21: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Sustaining the IID over large Corpora

While local dependencies among sentences violate the IID,it is reasonable to assume that the effect of thesedependencies dissipates as the corpus increases in size.It seems plausible to assume that the IID generally holdsover a large number of events.If this is the case, then the probability distributions for largecorpora tend to converge on the IID.We will adopt it in lieu of a more refined characterization ofthe conditional probability relations among sentences in alarge corpus.

Shalom Lappin NASSLLI 2010 Class 3

Page 22: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Sustaining the IID over large Corpora

While local dependencies among sentences violate the IID,it is reasonable to assume that the effect of thesedependencies dissipates as the corpus increases in size.It seems plausible to assume that the IID generally holdsover a large number of events.If this is the case, then the probability distributions for largecorpora tend to converge on the IID.We will adopt it in lieu of a more refined characterization ofthe conditional probability relations among sentences in alarge corpus.

Shalom Lappin NASSLLI 2010 Class 3

Page 23: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Sustaining the IID over large Corpora

While local dependencies among sentences violate the IID,it is reasonable to assume that the effect of thesedependencies dissipates as the corpus increases in size.It seems plausible to assume that the IID generally holdsover a large number of events.If this is the case, then the probability distributions for largecorpora tend to converge on the IID.We will adopt it in lieu of a more refined characterization ofthe conditional probability relations among sentences in alarge corpus.

Shalom Lappin NASSLLI 2010 Class 3

Page 24: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Languages and Distributions

A language model specifies a probability distribution for thestrings of a language.It is reasonable to ask whether learning a language Linvolves acquiring a distinct formal representation of L, orwhether we can reduce knowledge of L to knowing itsdistribution.In the former case the target of probabilistic learning is a(possibly non-probabilistic) grammar.In the latter the language model is itself the target oflearning, and languages are identified directly with theirdistributions.

Shalom Lappin NASSLLI 2010 Class 3

Page 25: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Languages and Distributions

A language model specifies a probability distribution for thestrings of a language.It is reasonable to ask whether learning a language Linvolves acquiring a distinct formal representation of L, orwhether we can reduce knowledge of L to knowing itsdistribution.In the former case the target of probabilistic learning is a(possibly non-probabilistic) grammar.In the latter the language model is itself the target oflearning, and languages are identified directly with theirdistributions.

Shalom Lappin NASSLLI 2010 Class 3

Page 26: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Languages and Distributions

A language model specifies a probability distribution for thestrings of a language.It is reasonable to ask whether learning a language Linvolves acquiring a distinct formal representation of L, orwhether we can reduce knowledge of L to knowing itsdistribution.In the former case the target of probabilistic learning is a(possibly non-probabilistic) grammar.In the latter the language model is itself the target oflearning, and languages are identified directly with theirdistributions.

Shalom Lappin NASSLLI 2010 Class 3

Page 27: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Languages and Distributions

A language model specifies a probability distribution for thestrings of a language.It is reasonable to ask whether learning a language Linvolves acquiring a distinct formal representation of L, orwhether we can reduce knowledge of L to knowing itsdistribution.In the former case the target of probabilistic learning is a(possibly non-probabilistic) grammar.In the latter the language model is itself the target oflearning, and languages are identified directly with theirdistributions.

Shalom Lappin NASSLLI 2010 Class 3

Page 28: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments for Reducing Languages to Distributions

There are strong arguments for both views.Identifying languages with their distributions is motivatedby the fact there is a substantial amount of psycholinguisticevidence showing that frequency based learning is centralto language acquisition (see Chapter 11, Section 11.3 ofthe monograph).Assuming that knowledge of a language consists inmastering a language model provides a natural and directexplanation for our capacity to filter out the noise of illformed sentences in the PLD.Taking the target of acquisition to be a language modeleliminates an additional formal object, and so simplifies theaccount of language learning.

Shalom Lappin NASSLLI 2010 Class 3

Page 29: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments for Reducing Languages to Distributions

There are strong arguments for both views.Identifying languages with their distributions is motivatedby the fact there is a substantial amount of psycholinguisticevidence showing that frequency based learning is centralto language acquisition (see Chapter 11, Section 11.3 ofthe monograph).Assuming that knowledge of a language consists inmastering a language model provides a natural and directexplanation for our capacity to filter out the noise of illformed sentences in the PLD.Taking the target of acquisition to be a language modeleliminates an additional formal object, and so simplifies theaccount of language learning.

Shalom Lappin NASSLLI 2010 Class 3

Page 30: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments for Reducing Languages to Distributions

There are strong arguments for both views.Identifying languages with their distributions is motivatedby the fact there is a substantial amount of psycholinguisticevidence showing that frequency based learning is centralto language acquisition (see Chapter 11, Section 11.3 ofthe monograph).Assuming that knowledge of a language consists inmastering a language model provides a natural and directexplanation for our capacity to filter out the noise of illformed sentences in the PLD.Taking the target of acquisition to be a language modeleliminates an additional formal object, and so simplifies theaccount of language learning.

Shalom Lappin NASSLLI 2010 Class 3

Page 31: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments for Reducing Languages to Distributions

There are strong arguments for both views.Identifying languages with their distributions is motivatedby the fact there is a substantial amount of psycholinguisticevidence showing that frequency based learning is centralto language acquisition (see Chapter 11, Section 11.3 ofthe monograph).Assuming that knowledge of a language consists inmastering a language model provides a natural and directexplanation for our capacity to filter out the noise of illformed sentences in the PLD.Taking the target of acquisition to be a language modeleliminates an additional formal object, and so simplifies theaccount of language learning.

Shalom Lappin NASSLLI 2010 Class 3

Page 32: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments against Reducing Languages toDistributions

It is not possible to identify grammaticality directly with highfrequency of occurrence.Some grammatical sentences may occur with very low (ornil) frequency, while ungrammatical sentences appear inthe PLD.Specifying a precise relation between probability andgrammaticality is a non-trivial task.It requires a stochastic model of indirect negative evidence(see Chapter 6 of the monograph for a proposal).We will leave the issue of reducing languages todistributions open.

Shalom Lappin NASSLLI 2010 Class 3

Page 33: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments against Reducing Languages toDistributions

It is not possible to identify grammaticality directly with highfrequency of occurrence.Some grammatical sentences may occur with very low (ornil) frequency, while ungrammatical sentences appear inthe PLD.Specifying a precise relation between probability andgrammaticality is a non-trivial task.It requires a stochastic model of indirect negative evidence(see Chapter 6 of the monograph for a proposal).We will leave the issue of reducing languages todistributions open.

Shalom Lappin NASSLLI 2010 Class 3

Page 34: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments against Reducing Languages toDistributions

It is not possible to identify grammaticality directly with highfrequency of occurrence.Some grammatical sentences may occur with very low (ornil) frequency, while ungrammatical sentences appear inthe PLD.Specifying a precise relation between probability andgrammaticality is a non-trivial task.It requires a stochastic model of indirect negative evidence(see Chapter 6 of the monograph for a proposal).We will leave the issue of reducing languages todistributions open.

Shalom Lappin NASSLLI 2010 Class 3

Page 35: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments against Reducing Languages toDistributions

It is not possible to identify grammaticality directly with highfrequency of occurrence.Some grammatical sentences may occur with very low (ornil) frequency, while ungrammatical sentences appear inthe PLD.Specifying a precise relation between probability andgrammaticality is a non-trivial task.It requires a stochastic model of indirect negative evidence(see Chapter 6 of the monograph for a proposal).We will leave the issue of reducing languages todistributions open.

Shalom Lappin NASSLLI 2010 Class 3

Page 36: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Arguments against Reducing Languages toDistributions

It is not possible to identify grammaticality directly with highfrequency of occurrence.Some grammatical sentences may occur with very low (ornil) frequency, while ungrammatical sentences appear inthe PLD.Specifying a precise relation between probability andgrammaticality is a non-trivial task.It requires a stochastic model of indirect negative evidence(see Chapter 6 of the monograph for a proposal).We will leave the issue of reducing languages todistributions open.

Shalom Lappin NASSLLI 2010 Class 3

Page 37: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Problems with the Assumptions of the IIL Paradigm

The characterization of convergence in the limit is toodemanding in that it does not allow learners to approximatethe target within reasonable limits of probability andconfidence.Because IIL does not constrain the set of presentationsand it requires learning for all presentations, it imposes fargreater demands on learnability than apply in actualhuman language learning.By allowing learners unbounded amounts of data, time,and computational resources IIL is unrealistic permissive.

Shalom Lappin NASSLLI 2010 Class 3

Page 38: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Problems with the Assumptions of the IIL Paradigm

The characterization of convergence in the limit is toodemanding in that it does not allow learners to approximatethe target within reasonable limits of probability andconfidence.Because IIL does not constrain the set of presentationsand it requires learning for all presentations, it imposes fargreater demands on learnability than apply in actualhuman language learning.By allowing learners unbounded amounts of data, time,and computational resources IIL is unrealistic permissive.

Shalom Lappin NASSLLI 2010 Class 3

Page 39: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Problems with the Assumptions of the IIL Paradigm

The characterization of convergence in the limit is toodemanding in that it does not allow learners to approximatethe target within reasonable limits of probability andconfidence.Because IIL does not constrain the set of presentationsand it requires learning for all presentations, it imposes fargreater demands on learnability than apply in actualhuman language learning.By allowing learners unbounded amounts of data, time,and computational resources IIL is unrealistic permissive.

Shalom Lappin NASSLLI 2010 Class 3

Page 40: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Probably

Unlike IIL, the PAC framework (Valiant (1984) requires thatlearning of a target hypothesis be to a high degree ofprobability .This approach allows for unususal and perverse data setsfrom which learning of the target is not possible.A PAC model incorporates a confidence parameter δwhose value specifies the lower probability bound onsuccessful learning of the target, in relation to the size ofthe data sample.Learners must acquire the target with probability 1− δ,where δ decreases in size (approaching 0) as the amountof data increases.

Shalom Lappin NASSLLI 2010 Class 3

Page 41: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Probably

Unlike IIL, the PAC framework (Valiant (1984) requires thatlearning of a target hypothesis be to a high degree ofprobability .This approach allows for unususal and perverse data setsfrom which learning of the target is not possible.A PAC model incorporates a confidence parameter δwhose value specifies the lower probability bound onsuccessful learning of the target, in relation to the size ofthe data sample.Learners must acquire the target with probability 1− δ,where δ decreases in size (approaching 0) as the amountof data increases.

Shalom Lappin NASSLLI 2010 Class 3

Page 42: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Probably

Unlike IIL, the PAC framework (Valiant (1984) requires thatlearning of a target hypothesis be to a high degree ofprobability .This approach allows for unususal and perverse data setsfrom which learning of the target is not possible.A PAC model incorporates a confidence parameter δwhose value specifies the lower probability bound onsuccessful learning of the target, in relation to the size ofthe data sample.Learners must acquire the target with probability 1− δ,where δ decreases in size (approaching 0) as the amountof data increases.

Shalom Lappin NASSLLI 2010 Class 3

Page 43: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Probably

Unlike IIL, the PAC framework (Valiant (1984) requires thatlearning of a target hypothesis be to a high degree ofprobability .This approach allows for unususal and perverse data setsfrom which learning of the target is not possible.A PAC model incorporates a confidence parameter δwhose value specifies the lower probability bound onsuccessful learning of the target, in relation to the size ofthe data sample.Learners must acquire the target with probability 1− δ,where δ decreases in size (approaching 0) as the amountof data increases.

Shalom Lappin NASSLLI 2010 Class 3

Page 44: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Approximately

The PAC framework characterizes the learning process asconvergence on the target.It allows for both errors of undergeneralization (L \ H) andof overgereralization (H \ L).These errors should decline in proportion to the size of thedata sample.The model includes a parameter ε whose value specifiesthe error rate threshold for successful learning.εD(H) =

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Shalom Lappin NASSLLI 2010 Class 3

Page 45: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Approximately

The PAC framework characterizes the learning process asconvergence on the target.It allows for both errors of undergeneralization (L \ H) andof overgereralization (H \ L).These errors should decline in proportion to the size of thedata sample.The model includes a parameter ε whose value specifiesthe error rate threshold for successful learning.εD(H) =

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Shalom Lappin NASSLLI 2010 Class 3

Page 46: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Approximately

The PAC framework characterizes the learning process asconvergence on the target.It allows for both errors of undergeneralization (L \ H) andof overgereralization (H \ L).These errors should decline in proportion to the size of thedata sample.The model includes a parameter ε whose value specifiesthe error rate threshold for successful learning.εD(H) =

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Shalom Lappin NASSLLI 2010 Class 3

Page 47: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Approximately

The PAC framework characterizes the learning process asconvergence on the target.It allows for both errors of undergeneralization (L \ H) andof overgereralization (H \ L).These errors should decline in proportion to the size of thedata sample.The model includes a parameter ε whose value specifiesthe error rate threshold for successful learning.εD(H) =

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Shalom Lappin NASSLLI 2010 Class 3

Page 48: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning: Approximately

The PAC framework characterizes the learning process asconvergence on the target.It allows for both errors of undergeneralization (L \ H) andof overgereralization (H \ L).These errors should decline in proportion to the size of thedata sample.The model includes a parameter ε whose value specifiesthe error rate threshold for successful learning.εD(H) =

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Shalom Lappin NASSLLI 2010 Class 3

Page 49: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning in the Size of the Data

The PAC framework requires that learning be efficientrelative to the size of the available data.The size of the data set needed to insure convergence canonly grow polynomially in relation to the values of the twoparameters ε and δ.Therefore, if a hypothesis H is PAC learnable, then thelearning algorithm will converge on H with a data setwhose size is determined by a function that specifies apolynomial in 1/ε and 1/δ.

Shalom Lappin NASSLLI 2010 Class 3

Page 50: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning in the Size of the Data

The PAC framework requires that learning be efficientrelative to the size of the available data.The size of the data set needed to insure convergence canonly grow polynomially in relation to the values of the twoparameters ε and δ.Therefore, if a hypothesis H is PAC learnable, then thelearning algorithm will converge on H with a data setwhose size is determined by a function that specifies apolynomial in 1/ε and 1/δ.

Shalom Lappin NASSLLI 2010 Class 3

Page 51: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning in the Size of the Data

The PAC framework requires that learning be efficientrelative to the size of the available data.The size of the data set needed to insure convergence canonly grow polynomially in relation to the values of the twoparameters ε and δ.Therefore, if a hypothesis H is PAC learnable, then thelearning algorithm will converge on H with a data setwhose size is determined by a function that specifies apolynomial in 1/ε and 1/δ.

Shalom Lappin NASSLLI 2010 Class 3

Page 52: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Distribution Free and Uniform Learning

The classical PAC framework requires that learning beachieved for all probability distributions on the data.Therefore PAC learning is distribution free.It is also uniform for the members of a learnable class.Uniform learning specifies that for a given ε, δ, there is anN such that for every L ∈ L, we can learn L to an accuracyof ε, and with a probability of at least 1− δ, on the basis ofN examples, where N is constant across L.

Shalom Lappin NASSLLI 2010 Class 3

Page 53: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Distribution Free and Uniform Learning

The classical PAC framework requires that learning beachieved for all probability distributions on the data.Therefore PAC learning is distribution free.It is also uniform for the members of a learnable class.Uniform learning specifies that for a given ε, δ, there is anN such that for every L ∈ L, we can learn L to an accuracyof ε, and with a probability of at least 1− δ, on the basis ofN examples, where N is constant across L.

Shalom Lappin NASSLLI 2010 Class 3

Page 54: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Distribution Free and Uniform Learning

The classical PAC framework requires that learning beachieved for all probability distributions on the data.Therefore PAC learning is distribution free.It is also uniform for the members of a learnable class.Uniform learning specifies that for a given ε, δ, there is anN such that for every L ∈ L, we can learn L to an accuracyof ε, and with a probability of at least 1− δ, on the basis ofN examples, where N is constant across L.

Shalom Lappin NASSLLI 2010 Class 3

Page 55: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Distribution Free and Uniform Learning

The classical PAC framework requires that learning beachieved for all probability distributions on the data.Therefore PAC learning is distribution free.It is also uniform for the members of a learnable class.Uniform learning specifies that for a given ε, δ, there is anN such that for every L ∈ L, we can learn L to an accuracyof ε, and with a probability of at least 1− δ, on the basis ofN examples, where N is constant across L.

Shalom Lappin NASSLLI 2010 Class 3

Page 56: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Shalom Lappin NASSLLI 2010 Class 3

Page 57: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Shalom Lappin NASSLLI 2010 Class 3

Page 58: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Shalom Lappin NASSLLI 2010 Class 3

Page 59: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Shalom Lappin NASSLLI 2010 Class 3

Page 60: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Shalom Lappin NASSLLI 2010 Class 3

Page 61: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Shalom Lappin NASSLLI 2010 Class 3

Page 62: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Shalom Lappin NASSLLI 2010 Class 3

Page 63: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Shalom Lappin NASSLLI 2010 Class 3

Page 64: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

VC Dimension and Shattering

The Vapnik-Chervonenkis (VC) dimension of a hypothesisspace H is a measure of H’s complexity for PAC learning.It expresses a relation between H and the samples of adata set in terms of the maximal number of data points in asample that the elements of H can cover or shatter.The VC dimension of H is a crucial factor in determiningthe learnability of its elements.

Shalom Lappin NASSLLI 2010 Class 3

Page 65: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

VC Dimension and Shattering

The Vapnik-Chervonenkis (VC) dimension of a hypothesisspace H is a measure of H’s complexity for PAC learning.It expresses a relation between H and the samples of adata set in terms of the maximal number of data points in asample that the elements of H can cover or shatter.The VC dimension of H is a crucial factor in determiningthe learnability of its elements.

Shalom Lappin NASSLLI 2010 Class 3

Page 66: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

VC Dimension and Shattering

The Vapnik-Chervonenkis (VC) dimension of a hypothesisspace H is a measure of H’s complexity for PAC learning.It expresses a relation between H and the samples of adata set in terms of the maximal number of data points in asample that the elements of H can cover or shatter.The VC dimension of H is a crucial factor in determiningthe learnability of its elements.

Shalom Lappin NASSLLI 2010 Class 3

Page 67: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Learning in Finite and Infinite Hypothesis Spaces

Finiteness of H does not insure computational efficiency ofPAC learning.Conversely, tractable convergence is possible in certaincases of an infinite H.Hence, in the PAC framework the finiteness assumptionsof the P&P view of UG do not, in themselves, solve thelearning theoretic problems posed by language acquisition.

Shalom Lappin NASSLLI 2010 Class 3

Page 68: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Learning in Finite and Infinite Hypothesis Spaces

Finiteness of H does not insure computational efficiency ofPAC learning.Conversely, tractable convergence is possible in certaincases of an infinite H.Hence, in the PAC framework the finiteness assumptionsof the P&P view of UG do not, in themselves, solve thelearning theoretic problems posed by language acquisition.

Shalom Lappin NASSLLI 2010 Class 3

Page 69: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Learning in Finite and Infinite Hypothesis Spaces

Finiteness of H does not insure computational efficiency ofPAC learning.Conversely, tractable convergence is possible in certaincases of an infinite H.Hence, in the PAC framework the finiteness assumptionsof the P&P view of UG do not, in themselves, solve thelearning theoretic problems posed by language acquisition.

Shalom Lappin NASSLLI 2010 Class 3

Page 70: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning in a Finite Hypothesis Space

The number of training examples required scales withlog |H|, where H is the hypothesis space.Within the P&P framework, assuming n binary parameters,the size of the hypothesis space is |H| = 2n.The size of |H| grows exponentially with the number ofparameters, and so learning becomes increasingly difficult.Also, if the parameters are interdependent and difficult toestimate from observed data, then a finite class may stillnot be efficiently learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 71: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning in a Finite Hypothesis Space

The number of training examples required scales withlog |H|, where H is the hypothesis space.Within the P&P framework, assuming n binary parameters,the size of the hypothesis space is |H| = 2n.The size of |H| grows exponentially with the number ofparameters, and so learning becomes increasingly difficult.Also, if the parameters are interdependent and difficult toestimate from observed data, then a finite class may stillnot be efficiently learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 72: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning in a Finite Hypothesis Space

The number of training examples required scales withlog |H|, where H is the hypothesis space.Within the P&P framework, assuming n binary parameters,the size of the hypothesis space is |H| = 2n.The size of |H| grows exponentially with the number ofparameters, and so learning becomes increasingly difficult.Also, if the parameters are interdependent and difficult toestimate from observed data, then a finite class may stillnot be efficiently learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 73: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning in a Finite Hypothesis Space

The number of training examples required scales withlog |H|, where H is the hypothesis space.Within the P&P framework, assuming n binary parameters,the size of the hypothesis space is |H| = 2n.The size of |H| grows exponentially with the number ofparameters, and so learning becomes increasingly difficult.Also, if the parameters are interdependent and difficult toestimate from observed data, then a finite class may stillnot be efficiently learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 74: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning and the Size of the HypothesisSpace

In general, the feasibility of learning in a finite hypothesisspace does not necessarily depend on the number ofparameters, but on the size of H.Therefore, positing a UG with a finite set of parametersthat yields a finite hypothesis space H of possiblegrammars does not, in itself, entail the efficient learnabilityof the class of grammars that it specifies.If |H| is too large, too much data will be required to permitefficient learning, even where H is finite.

Shalom Lappin NASSLLI 2010 Class 3

Page 75: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning and the Size of the HypothesisSpace

In general, the feasibility of learning in a finite hypothesisspace does not necessarily depend on the number ofparameters, but on the size of H.Therefore, positing a UG with a finite set of parametersthat yields a finite hypothesis space H of possiblegrammars does not, in itself, entail the efficient learnabilityof the class of grammars that it specifies.If |H| is too large, too much data will be required to permitefficient learning, even where H is finite.

Shalom Lappin NASSLLI 2010 Class 3

Page 76: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Efficiency of Learning and the Size of the HypothesisSpace

In general, the feasibility of learning in a finite hypothesisspace does not necessarily depend on the number ofparameters, but on the size of H.Therefore, positing a UG with a finite set of parametersthat yields a finite hypothesis space H of possiblegrammars does not, in itself, entail the efficient learnabilityof the class of grammars that it specifies.If |H| is too large, too much data will be required to permitefficient learning, even where H is finite.

Shalom Lappin NASSLLI 2010 Class 3

Page 77: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning in an Infinite Hypothesis Space

The VC-dimension of the space is critical in determiningthe rate of convergence on a target for an infinitehypothesis space.The VC-dimension of H is the largest value of m such thatthere is a training sample of size m that is shattered by H.A training sample is shattered by H if, for each of the 2m

possible labelings of a sample (assignments from {0,1} toits elements), there is a hypothesis in H that assigns thatlabeling.

Shalom Lappin NASSLLI 2010 Class 3

Page 78: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning in an Infinite Hypothesis Space

The VC-dimension of the space is critical in determiningthe rate of convergence on a target for an infinitehypothesis space.The VC-dimension of H is the largest value of m such thatthere is a training sample of size m that is shattered by H.A training sample is shattered by H if, for each of the 2m

possible labelings of a sample (assignments from {0,1} toits elements), there is a hypothesis in H that assigns thatlabeling.

Shalom Lappin NASSLLI 2010 Class 3

Page 79: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning in an Infinite Hypothesis Space

The VC-dimension of the space is critical in determiningthe rate of convergence on a target for an infinitehypothesis space.The VC-dimension of H is the largest value of m such thatthere is a training sample of size m that is shattered by H.A training sample is shattered by H if, for each of the 2m

possible labelings of a sample (assignments from {0,1} toits elements), there is a hypothesis in H that assigns thatlabeling.

Shalom Lappin NASSLLI 2010 Class 3

Page 80: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Shattering and VC Dimensions: An Example

Assume that each member of H is associated with nreal-valued parameters, and so the hypothesis space isuncountably infinite.Suppose, for example, that the function to be learnedmaps points in one-dimensional space onto 0 and 1.A hypothesis space H is a subset of all possible functionsof this kind.

Shalom Lappin NASSLLI 2010 Class 3

Page 81: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Shattering and VC Dimensions: An Example

Assume that each member of H is associated with nreal-valued parameters, and so the hypothesis space isuncountably infinite.Suppose, for example, that the function to be learnedmaps points in one-dimensional space onto 0 and 1.A hypothesis space H is a subset of all possible functionsof this kind.

Shalom Lappin NASSLLI 2010 Class 3

Page 82: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Shattering and VC Dimensions: An Example

Assume that each member of H is associated with nreal-valued parameters, and so the hypothesis space isuncountably infinite.Suppose, for example, that the function to be learnedmaps points in one-dimensional space onto 0 and 1.A hypothesis space H is a subset of all possible functionsof this kind.

Shalom Lappin NASSLLI 2010 Class 3

Page 83: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Shattering and VC Dimensions: An Example

H might, for instance, contain just those functions thatassign 1s to all points within an interval int of a line and 0sto all points outside of int.The VC dimension of H is the cardinality of the largest setof points for which all possible labelings of the points areexpressed by elements of H (they are shattered by H).The VC dimension of an H consisting only of int functionsis 2, as the hypotheses in H shatter any set of 2 points in aline, but not all sets of 3 points.

Shalom Lappin NASSLLI 2010 Class 3

Page 84: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Shattering and VC Dimensions: An Example

H might, for instance, contain just those functions thatassign 1s to all points within an interval int of a line and 0sto all points outside of int.The VC dimension of H is the cardinality of the largest setof points for which all possible labelings of the points areexpressed by elements of H (they are shattered by H).The VC dimension of an H consisting only of int functionsis 2, as the hypotheses in H shatter any set of 2 points in aline, but not all sets of 3 points.

Shalom Lappin NASSLLI 2010 Class 3

Page 85: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Shattering and VC Dimensions: An Example

H might, for instance, contain just those functions thatassign 1s to all points within an interval int of a line and 0sto all points outside of int.The VC dimension of H is the cardinality of the largest setof points for which all possible labelings of the points areexpressed by elements of H (they are shattered by H).The VC dimension of an H consisting only of int functionsis 2, as the hypotheses in H shatter any set of 2 points in aline, but not all sets of 3 points.

Shalom Lappin NASSLLI 2010 Class 3

Page 86: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Intervals in a Real Number LineKearns and Vazirani (1994)

<—–(X)—-(Y)——>

a. <—–(1)—-(1)——>b. <—–(1)—-(0)——>c. <—–(0)—-(1)——>d. <—–(0)—-(0)——>

a. <—[-(1)—-(1)-]—->b. <—[-(1)-]–(0)——>c. <—–(0)–[-(1)-]—->d. <—–(0)—-(0)-[-]–>

Shalom Lappin NASSLLI 2010 Class 3

Page 87: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Intervals in a Real Number LineKearns and Vazirani (1994)

<—–(X)—-(Y)——>

a. <—–(1)—-(1)——>b. <—–(1)—-(0)——>c. <—–(0)—-(1)——>d. <—–(0)—-(0)——>

a. <—[-(1)—-(1)-]—->b. <—[-(1)-]–(0)——>c. <—–(0)–[-(1)-]—->d. <—–(0)—-(0)-[-]–>

Shalom Lappin NASSLLI 2010 Class 3

Page 88: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Intervals in a Real Number LineKearns and Vazirani (1994)

<—–(X)—-(Y)——>

a. <—–(1)—-(1)——>b. <—–(1)—-(0)——>c. <—–(0)—-(1)——>d. <—–(0)—-(0)——>

a. <—[-(1)—-(1)-]—->b. <—[-(1)-]–(0)——>c. <—–(0)–[-(1)-]—->d. <—–(0)—-(0)-[-]–>

Shalom Lappin NASSLLI 2010 Class 3

Page 89: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Intervals in a Real Number LineVC dimension is 2

The pair of points in a-d can be covered by all possiblelabelings that the interval hypotheses in H specify.

e. <—(1)—(0)—(1)—>

The labeling of the triple in e cannot be expressed by any singlebracketing.

Shalom Lappin NASSLLI 2010 Class 3

Page 90: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Intervals in a Real Number LineVC dimension is 2

The pair of points in a-d can be covered by all possiblelabelings that the interval hypotheses in H specify.

e. <—(1)—(0)—(1)—>

The labeling of the triple in e cannot be expressed by any singlebracketing.

Shalom Lappin NASSLLI 2010 Class 3

Page 91: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Intervals in a Real Number LineVC dimension is 2

The pair of points in a-d can be covered by all possiblelabelings that the interval hypotheses in H specify.

e. <—(1)—(0)—(1)—>

The labeling of the triple in e cannot be expressed by any singlebracketing.

Shalom Lappin NASSLLI 2010 Class 3

Page 92: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Tractable Learning in an Infinite Hypothesis Space

A hypothesis space H has infinite VC-dimension if for anyvalue of m, there is a training sample of size m that isshattered by H.PAC learning converges in the limit if and only if theVC-dimension of the hypothesis space is finite.The number of training examples required is roughly linearin the VC-dimension of the hypothesis space, and soefficient PAC learning is possible in an infinite H if theVC-dimension of H is relatively small.It is possible to improve the convergence rate for PAClearning by adding a distributional bias over the elementsof H.

Shalom Lappin NASSLLI 2010 Class 3

Page 93: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Tractable Learning in an Infinite Hypothesis Space

A hypothesis space H has infinite VC-dimension if for anyvalue of m, there is a training sample of size m that isshattered by H.PAC learning converges in the limit if and only if theVC-dimension of the hypothesis space is finite.The number of training examples required is roughly linearin the VC-dimension of the hypothesis space, and soefficient PAC learning is possible in an infinite H if theVC-dimension of H is relatively small.It is possible to improve the convergence rate for PAClearning by adding a distributional bias over the elementsof H.

Shalom Lappin NASSLLI 2010 Class 3

Page 94: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Tractable Learning in an Infinite Hypothesis Space

A hypothesis space H has infinite VC-dimension if for anyvalue of m, there is a training sample of size m that isshattered by H.PAC learning converges in the limit if and only if theVC-dimension of the hypothesis space is finite.The number of training examples required is roughly linearin the VC-dimension of the hypothesis space, and soefficient PAC learning is possible in an infinite H if theVC-dimension of H is relatively small.It is possible to improve the convergence rate for PAClearning by adding a distributional bias over the elementsof H.

Shalom Lappin NASSLLI 2010 Class 3

Page 95: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Tractable Learning in an Infinite Hypothesis Space

A hypothesis space H has infinite VC-dimension if for anyvalue of m, there is a training sample of size m that isshattered by H.PAC learning converges in the limit if and only if theVC-dimension of the hypothesis space is finite.The number of training examples required is roughly linearin the VC-dimension of the hypothesis space, and soefficient PAC learning is possible in an infinite H if theVC-dimension of H is relatively small.It is possible to improve the convergence rate for PAClearning by adding a distributional bias over the elementsof H.

Shalom Lappin NASSLLI 2010 Class 3

Page 96: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Finite Languages

The class of finite languages is not uniformly PAClearnable, because a hypothesis space HL for this classhas infinite VC dimension.HL contains all finite subsets of the set of strings formedfrom a vocabulary Σ∗.This set of subsets is infinite.A sample observation set O, of any size, of strings fromthis vocabulary will be shattered by HL.

Shalom Lappin NASSLLI 2010 Class 3

Page 97: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Finite Languages

The class of finite languages is not uniformly PAClearnable, because a hypothesis space HL for this classhas infinite VC dimension.HL contains all finite subsets of the set of strings formedfrom a vocabulary Σ∗.This set of subsets is infinite.A sample observation set O, of any size, of strings fromthis vocabulary will be shattered by HL.

Shalom Lappin NASSLLI 2010 Class 3

Page 98: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Finite Languages

The class of finite languages is not uniformly PAClearnable, because a hypothesis space HL for this classhas infinite VC dimension.HL contains all finite subsets of the set of strings formedfrom a vocabulary Σ∗.This set of subsets is infinite.A sample observation set O, of any size, of strings fromthis vocabulary will be shattered by HL.

Shalom Lappin NASSLLI 2010 Class 3

Page 99: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Finite Languages

The class of finite languages is not uniformly PAClearnable, because a hypothesis space HL for this classhas infinite VC dimension.HL contains all finite subsets of the set of strings formedfrom a vocabulary Σ∗.This set of subsets is infinite.A sample observation set O, of any size, of strings fromthis vocabulary will be shattered by HL.

Shalom Lappin NASSLLI 2010 Class 3

Page 100: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Finite Languages

The members of HL will yield all possible Boolean values(1 for inclusion in a language of L and 0 for exclusion) forthe elements of O, regardless of how large it is.Therefore, this class is unlearnable in the PAC framework.By contrast, the class of finite languages is identifiable inthe limit in Gold’s positive evidence only model, wheretractability in resources of time and data is not arequirement for learning.

Shalom Lappin NASSLLI 2010 Class 3

Page 101: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Finite Languages

The members of HL will yield all possible Boolean values(1 for inclusion in a language of L and 0 for exclusion) forthe elements of O, regardless of how large it is.Therefore, this class is unlearnable in the PAC framework.By contrast, the class of finite languages is identifiable inthe limit in Gold’s positive evidence only model, wheretractability in resources of time and data is not arequirement for learning.

Shalom Lappin NASSLLI 2010 Class 3

Page 102: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Finite Languages

The members of HL will yield all possible Boolean values(1 for inclusion in a language of L and 0 for exclusion) forthe elements of O, regardless of how large it is.Therefore, this class is unlearnable in the PAC framework.By contrast, the class of finite languages is identifiable inthe limit in Gold’s positive evidence only model, wheretractability in resources of time and data is not arequirement for learning.

Shalom Lappin NASSLLI 2010 Class 3

Page 103: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Regular Languages

The class of regular languages is also not uniformly PAClearnable for similar reasons.The VC dimension of this class is infinite.If HL consists of the infinite set of possible Finite StateAutomata that generate a regular language L, then anysample of strings from the vocabulary of L will beshattered by HL.

Shalom Lappin NASSLLI 2010 Class 3

Page 104: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Regular Languages

The class of regular languages is also not uniformly PAClearnable for similar reasons.The VC dimension of this class is infinite.If HL consists of the infinite set of possible Finite StateAutomata that generate a regular language L, then anysample of strings from the vocabulary of L will beshattered by HL.

Shalom Lappin NASSLLI 2010 Class 3

Page 105: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learning and the Class of Regular Languages

The class of regular languages is also not uniformly PAClearnable for similar reasons.The VC dimension of this class is infinite.If HL consists of the infinite set of possible Finite StateAutomata that generate a regular language L, then anysample of strings from the vocabulary of L will beshattered by HL.

Shalom Lappin NASSLLI 2010 Class 3

Page 106: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Imposing an Upper Bound on the Size of theLanguage Class

As Nowak et al. (2002) observe, by imposing an upperbound k on the cardinality of the sets of finite languages inHL, one achieves finite VC dimension for this hypothesisspace.The VC-dimension of such an HL whose elements arebounded in size is at most k .In this case the class of languages in HL is uniformly PAClearnable.Similarly, bounding the size of regular grammars andCFGs results in finite VC dimension and renders theseclasses uniformly learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 107: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Imposing an Upper Bound on the Size of theLanguage Class

As Nowak et al. (2002) observe, by imposing an upperbound k on the cardinality of the sets of finite languages inHL, one achieves finite VC dimension for this hypothesisspace.The VC-dimension of such an HL whose elements arebounded in size is at most k .In this case the class of languages in HL is uniformly PAClearnable.Similarly, bounding the size of regular grammars andCFGs results in finite VC dimension and renders theseclasses uniformly learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 108: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Imposing an Upper Bound on the Size of theLanguage Class

As Nowak et al. (2002) observe, by imposing an upperbound k on the cardinality of the sets of finite languages inHL, one achieves finite VC dimension for this hypothesisspace.The VC-dimension of such an HL whose elements arebounded in size is at most k .In this case the class of languages in HL is uniformly PAClearnable.Similarly, bounding the size of regular grammars andCFGs results in finite VC dimension and renders theseclasses uniformly learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 109: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Imposing an Upper Bound on the Size of theLanguage Class

As Nowak et al. (2002) observe, by imposing an upperbound k on the cardinality of the sets of finite languages inHL, one achieves finite VC dimension for this hypothesisspace.The VC-dimension of such an HL whose elements arebounded in size is at most k .In this case the class of languages in HL is uniformly PAClearnable.Similarly, bounding the size of regular grammars andCFGs results in finite VC dimension and renders theseclasses uniformly learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 110: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Learning Priors

The fact that unform PAC learning requires a finite VCdimension for the target class might appear to support aversion of the APS.The hypothesis class for language acquisition must berestricted to a set of grammars that has finiteVC-dimension to insure acquisition.Learners need to have prior knowledge of these bounds onthe target language class.This learning prior is clearly domain specific, and so itentails a form of linguistic nativism.

Shalom Lappin NASSLLI 2010 Class 3

Page 111: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Learning Priors

The fact that unform PAC learning requires a finite VCdimension for the target class might appear to support aversion of the APS.The hypothesis class for language acquisition must berestricted to a set of grammars that has finiteVC-dimension to insure acquisition.Learners need to have prior knowledge of these bounds onthe target language class.This learning prior is clearly domain specific, and so itentails a form of linguistic nativism.

Shalom Lappin NASSLLI 2010 Class 3

Page 112: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Learning Priors

The fact that unform PAC learning requires a finite VCdimension for the target class might appear to support aversion of the APS.The hypothesis class for language acquisition must berestricted to a set of grammars that has finiteVC-dimension to insure acquisition.Learners need to have prior knowledge of these bounds onthe target language class.This learning prior is clearly domain specific, and so itentails a form of linguistic nativism.

Shalom Lappin NASSLLI 2010 Class 3

Page 113: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Learning Priors

The fact that unform PAC learning requires a finite VCdimension for the target class might appear to support aversion of the APS.The hypothesis class for language acquisition must berestricted to a set of grammars that has finiteVC-dimension to insure acquisition.Learners need to have prior knowledge of these bounds onthe target language class.This learning prior is clearly domain specific, and so itentails a form of linguistic nativism.

Shalom Lappin NASSLLI 2010 Class 3

Page 114: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Bounded Language Classes

In fact, this argument does not go through when wedistinguish the target class from the hypothesis space.As in the case of IIL, learners can formulate hypothesisthat fall outside a PAC learnable class..Any class of finite, regular, or context free languages canbe learned up to an arbitrary cardinality bound k .

Shalom Lappin NASSLLI 2010 Class 3

Page 115: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Bounded Language Classes

In fact, this argument does not go through when wedistinguish the target class from the hypothesis space.As in the case of IIL, learners can formulate hypothesisthat fall outside a PAC learnable class..Any class of finite, regular, or context free languages canbe learned up to an arbitrary cardinality bound k .

Shalom Lappin NASSLLI 2010 Class 3

Page 116: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Bounded Language Classes

In fact, this argument does not go through when wedistinguish the target class from the hypothesis space.As in the case of IIL, learners can formulate hypothesisthat fall outside a PAC learnable class..Any class of finite, regular, or context free languages canbe learned up to an arbitrary cardinality bound k .

Shalom Lappin NASSLLI 2010 Class 3

Page 117: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Bounded Language Classes

This cardinality bound need not be specified as a prior ofthe learning algorithm.The algorithm can test progressively larger representationsof a language against the data until it arrives at the targethypothesis.So for any k , the class of finite/regular/context freelanguages of k cardinality can be uniformly PAC learned.The union of these bounded classes gives the fullunbounded class.

Shalom Lappin NASSLLI 2010 Class 3

Page 118: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Bounded Language Classes

This cardinality bound need not be specified as a prior ofthe learning algorithm.The algorithm can test progressively larger representationsof a language against the data until it arrives at the targethypothesis.So for any k , the class of finite/regular/context freelanguages of k cardinality can be uniformly PAC learned.The union of these bounded classes gives the fullunbounded class.

Shalom Lappin NASSLLI 2010 Class 3

Page 119: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Bounded Language Classes

This cardinality bound need not be specified as a prior ofthe learning algorithm.The algorithm can test progressively larger representationsof a language against the data until it arrives at the targethypothesis.So for any k , the class of finite/regular/context freelanguages of k cardinality can be uniformly PAC learned.The union of these bounded classes gives the fullunbounded class.

Shalom Lappin NASSLLI 2010 Class 3

Page 120: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

PAC Learnability and Bounded Language Classes

This cardinality bound need not be specified as a prior ofthe learning algorithm.The algorithm can test progressively larger representationsof a language against the data until it arrives at the targethypothesis.So for any k , the class of finite/regular/context freelanguages of k cardinality can be uniformly PAC learned.The union of these bounded classes gives the fullunbounded class.

Shalom Lappin NASSLLI 2010 Class 3

Page 121: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Labeled Data Samples

In the classical PAC learning framework all data samplesare labeled for (non) membership in the set identified bythe target hypothesis.As in the case of the negative evidence version of IIL, thisposes a serious difficulty for modeling languageacquisition.The sentences of the PLD are not labeled forgrammaticality.Specifying a version of PAC learning that uses positiveevidence only requires a major revision of the framework.

Shalom Lappin NASSLLI 2010 Class 3

Page 122: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Labeled Data Samples

In the classical PAC learning framework all data samplesare labeled for (non) membership in the set identified bythe target hypothesis.As in the case of the negative evidence version of IIL, thisposes a serious difficulty for modeling languageacquisition.The sentences of the PLD are not labeled forgrammaticality.Specifying a version of PAC learning that uses positiveevidence only requires a major revision of the framework.

Shalom Lappin NASSLLI 2010 Class 3

Page 123: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Labeled Data Samples

In the classical PAC learning framework all data samplesare labeled for (non) membership in the set identified bythe target hypothesis.As in the case of the negative evidence version of IIL, thisposes a serious difficulty for modeling languageacquisition.The sentences of the PLD are not labeled forgrammaticality.Specifying a version of PAC learning that uses positiveevidence only requires a major revision of the framework.

Shalom Lappin NASSLLI 2010 Class 3

Page 124: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Labeled Data Samples

In the classical PAC learning framework all data samplesare labeled for (non) membership in the set identified bythe target hypothesis.As in the case of the negative evidence version of IIL, thisposes a serious difficulty for modeling languageacquisition.The sentences of the PLD are not labeled forgrammaticality.Specifying a version of PAC learning that uses positiveevidence only requires a major revision of the framework.

Shalom Lappin NASSLLI 2010 Class 3

Page 125: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Positive Evidence only PAC Model

Defining a viable positive evidence only PAC framework isa non-trivial task.A naive approach on which all and only grammaticalstrings are assigned non-zero probability and sodistributions are effectively restricted to grammaticalstrings will not succeed.If the standard characterization of the error rate issustained, then an algorithm which treats every string asgrammatical will have a null error rate.Therefore, the model will be vacuous, as every class willbe PAC learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 126: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Positive Evidence only PAC Model

Defining a viable positive evidence only PAC framework isa non-trivial task.A naive approach on which all and only grammaticalstrings are assigned non-zero probability and sodistributions are effectively restricted to grammaticalstrings will not succeed.If the standard characterization of the error rate issustained, then an algorithm which treats every string asgrammatical will have a null error rate.Therefore, the model will be vacuous, as every class willbe PAC learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 127: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Positive Evidence only PAC Model

Defining a viable positive evidence only PAC framework isa non-trivial task.A naive approach on which all and only grammaticalstrings are assigned non-zero probability and sodistributions are effectively restricted to grammaticalstrings will not succeed.If the standard characterization of the error rate issustained, then an algorithm which treats every string asgrammatical will have a null error rate.Therefore, the model will be vacuous, as every class willbe PAC learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 128: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

A Positive Evidence only PAC Model

Defining a viable positive evidence only PAC framework isa non-trivial task.A naive approach on which all and only grammaticalstrings are assigned non-zero probability and sodistributions are effectively restricted to grammaticalstrings will not succeed.If the standard characterization of the error rate issustained, then an algorithm which treats every string asgrammatical will have a null error rate.Therefore, the model will be vacuous, as every class willbe PAC learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 129: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Distribution Free Learning

The classical PAC framework condition that learning bepossible on all probability distributions for the data samplesis problematic.It corresponds to the IIL requirement that learning for aclass be achieved under all presentations.Some distributions are, in effect, adversarial in that theyassign high probability to data that is eccentric for alanguage, and in this way they undermine learning.Imposing appropriate restrictions on possible distributionsfor PAC learning can render certain interesting classes oflanguage learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 130: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Distribution Free Learning

The classical PAC framework condition that learning bepossible on all probability distributions for the data samplesis problematic.It corresponds to the IIL requirement that learning for aclass be achieved under all presentations.Some distributions are, in effect, adversarial in that theyassign high probability to data that is eccentric for alanguage, and in this way they undermine learning.Imposing appropriate restrictions on possible distributionsfor PAC learning can render certain interesting classes oflanguage learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 131: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Distribution Free Learning

The classical PAC framework condition that learning bepossible on all probability distributions for the data samplesis problematic.It corresponds to the IIL requirement that learning for aclass be achieved under all presentations.Some distributions are, in effect, adversarial in that theyassign high probability to data that is eccentric for alanguage, and in this way they undermine learning.Imposing appropriate restrictions on possible distributionsfor PAC learning can render certain interesting classes oflanguage learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 132: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Distribution Free Learning

The classical PAC framework condition that learning bepossible on all probability distributions for the data samplesis problematic.It corresponds to the IIL requirement that learning for aclass be achieved under all presentations.Some distributions are, in effect, adversarial in that theyassign high probability to data that is eccentric for alanguage, and in this way they undermine learning.Imposing appropriate restrictions on possible distributionsfor PAC learning can render certain interesting classes oflanguage learnable.

Shalom Lappin NASSLLI 2010 Class 3

Page 133: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Uniform Learning

The PAC framework requires that learning be at a uniformrate for the elements of a class, in relation to the availabledata, as specified by a constant upper bound on the size ofthe data set.As we have seen in the case of infinite VC dimension, thiscondition is problematic if the class containsrepresentations of unbounded complexity.By allowing for non-uniform learning in which differentelements of such a class can be acquired at ratesexpressed by distinct polynomial functions on data sets, itis possible to expand the set of learnable classes.

Shalom Lappin NASSLLI 2010 Class 3

Page 134: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Uniform Learning

The PAC framework requires that learning be at a uniformrate for the elements of a class, in relation to the availabledata, as specified by a constant upper bound on the size ofthe data set.As we have seen in the case of infinite VC dimension, thiscondition is problematic if the class containsrepresentations of unbounded complexity.By allowing for non-uniform learning in which differentelements of such a class can be acquired at ratesexpressed by distinct polynomial functions on data sets, itis possible to expand the set of learnable classes.

Shalom Lappin NASSLLI 2010 Class 3

Page 135: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Uniform Learning

The PAC framework requires that learning be at a uniformrate for the elements of a class, in relation to the availabledata, as specified by a constant upper bound on the size ofthe data set.As we have seen in the case of infinite VC dimension, thiscondition is problematic if the class containsrepresentations of unbounded complexity.By allowing for non-uniform learning in which differentelements of such a class can be acquired at ratesexpressed by distinct polynomial functions on data sets, itis possible to expand the set of learnable classes.

Shalom Lappin NASSLLI 2010 Class 3

Page 136: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Uniform Acquisition of Natural Languages

It is generally agreed that natural languages exhibit roughlythe same degree of complexity in their grammars.Christiansen and Chater (2008), Kirby (2001), Kirby (2007),and Kirby and Hurford (2002) explain this property on thebasis of information theoretic conditions on transmissionand learning, which shape the evolution of language.If this account is correct, then the common complexity oflanguages is not due to UG, but largely to domain generalconstraints on human learning and information processing.On this view, even if learning is, in general, non-uniform fortarget classes, language acquisition will be uniform acrosslanguages because of their shared complexity properties.

Shalom Lappin NASSLLI 2010 Class 3

Page 137: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Uniform Acquisition of Natural Languages

It is generally agreed that natural languages exhibit roughlythe same degree of complexity in their grammars.Christiansen and Chater (2008), Kirby (2001), Kirby (2007),and Kirby and Hurford (2002) explain this property on thebasis of information theoretic conditions on transmissionand learning, which shape the evolution of language.If this account is correct, then the common complexity oflanguages is not due to UG, but largely to domain generalconstraints on human learning and information processing.On this view, even if learning is, in general, non-uniform fortarget classes, language acquisition will be uniform acrosslanguages because of their shared complexity properties.

Shalom Lappin NASSLLI 2010 Class 3

Page 138: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Uniform Acquisition of Natural Languages

It is generally agreed that natural languages exhibit roughlythe same degree of complexity in their grammars.Christiansen and Chater (2008), Kirby (2001), Kirby (2007),and Kirby and Hurford (2002) explain this property on thebasis of information theoretic conditions on transmissionand learning, which shape the evolution of language.If this account is correct, then the common complexity oflanguages is not due to UG, but largely to domain generalconstraints on human learning and information processing.On this view, even if learning is, in general, non-uniform fortarget classes, language acquisition will be uniform acrosslanguages because of their shared complexity properties.

Shalom Lappin NASSLLI 2010 Class 3

Page 139: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Uniform Acquisition of Natural Languages

It is generally agreed that natural languages exhibit roughlythe same degree of complexity in their grammars.Christiansen and Chater (2008), Kirby (2001), Kirby (2007),and Kirby and Hurford (2002) explain this property on thebasis of information theoretic conditions on transmissionand learning, which shape the evolution of language.If this account is correct, then the common complexity oflanguages is not due to UG, but largely to domain generalconstraints on human learning and information processing.On this view, even if learning is, in general, non-uniform fortarget classes, language acquisition will be uniform acrosslanguages because of their shared complexity properties.

Shalom Lappin NASSLLI 2010 Class 3

Page 140: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Conclusions

Probabilistic learning models capture stochastic elementsof language acquisition that IIL does not.PAC learning allows for gradual convergence on a targethypothesis, and it permits learning within a specified errorrate.It also requires that learning be efficient in its relation to theamount of required data.The fact that language classes with infinite VC dimensionare not uniformly learnable does not entail that an upperbound on the size of a possible grammar must be specifiedas a learning prior for language acquisition.

Shalom Lappin NASSLLI 2010 Class 3

Page 141: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Conclusions

Probabilistic learning models capture stochastic elementsof language acquisition that IIL does not.PAC learning allows for gradual convergence on a targethypothesis, and it permits learning within a specified errorrate.It also requires that learning be efficient in its relation to theamount of required data.The fact that language classes with infinite VC dimensionare not uniformly learnable does not entail that an upperbound on the size of a possible grammar must be specifiedas a learning prior for language acquisition.

Shalom Lappin NASSLLI 2010 Class 3

Page 142: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Conclusions

Probabilistic learning models capture stochastic elementsof language acquisition that IIL does not.PAC learning allows for gradual convergence on a targethypothesis, and it permits learning within a specified errorrate.It also requires that learning be efficient in its relation to theamount of required data.The fact that language classes with infinite VC dimensionare not uniformly learnable does not entail that an upperbound on the size of a possible grammar must be specifiedas a learning prior for language acquisition.

Shalom Lappin NASSLLI 2010 Class 3

Page 143: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Conclusions

Probabilistic learning models capture stochastic elementsof language acquisition that IIL does not.PAC learning allows for gradual convergence on a targethypothesis, and it permits learning within a specified errorrate.It also requires that learning be efficient in its relation to theamount of required data.The fact that language classes with infinite VC dimensionare not uniformly learnable does not entail that an upperbound on the size of a possible grammar must be specifiedas a learning prior for language acquisition.

Shalom Lappin NASSLLI 2010 Class 3

Page 144: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Conclusions

We can explain the fact that the grammars of all naturallanguages appear to exhibit roughly the same degree ofcomplexity without positing a strong set of domain specificlearning priors.The fact that PAC learning requires labeled data renders itan inappropriate model for language acquisition.Its assumptions that learning is distribution free anduniform for the elements of a class are also problematic.In order to develop a version of PAC learning that is usefulfor modeling acquisition we need to revise theseassumptions.

Shalom Lappin NASSLLI 2010 Class 3

Page 145: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Conclusions

We can explain the fact that the grammars of all naturallanguages appear to exhibit roughly the same degree ofcomplexity without positing a strong set of domain specificlearning priors.The fact that PAC learning requires labeled data renders itan inappropriate model for language acquisition.Its assumptions that learning is distribution free anduniform for the elements of a class are also problematic.In order to develop a version of PAC learning that is usefulfor modeling acquisition we need to revise theseassumptions.

Shalom Lappin NASSLLI 2010 Class 3

Page 146: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Conclusions

We can explain the fact that the grammars of all naturallanguages appear to exhibit roughly the same degree ofcomplexity without positing a strong set of domain specificlearning priors.The fact that PAC learning requires labeled data renders itan inappropriate model for language acquisition.Its assumptions that learning is distribution free anduniform for the elements of a class are also problematic.In order to develop a version of PAC learning that is usefulfor modeling acquisition we need to revise theseassumptions.

Shalom Lappin NASSLLI 2010 Class 3

Page 147: Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the distinction between grammatical and ungrammatical strings. 1 Colourless green ideas

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Conclusions

We can explain the fact that the grammars of all naturallanguages appear to exhibit roughly the same degree ofcomplexity without positing a strong set of domain specificlearning priors.The fact that PAC learning requires labeled data renders itan inappropriate model for language acquisition.Its assumptions that learning is distribution free anduniform for the elements of a class are also problematic.In order to develop a version of PAC learning that is usefulfor modeling acquisition we need to revise theseassumptions.

Shalom Lappin NASSLLI 2010 Class 3