Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the...

Statistical Language Models and GrammarsPAC Learning

VC Dimension and LearnabilityProblems with the PAC Framework

Conclusions

Probabilistic LearningComputational Learning Theory and the APS

Shalom Lappin∗King’s College London

∗Joint work with Alex ClarkRoyal Holloway College London

NASSLLI 2010, University of Indiana, Bloomington

June 23, 2010

Shalom Lappin NASSLLI 2010 Class 3

Conclusions

Outline

1 Statistical Language Models and Grammars

2 PAC Learning

3 VC Dimension and Learnability

4 Problems with the PAC Framework

5 Conclusions

Conclusions

Chomsky on Statistical Modeling of Grammar

Chomsky (1957) rejects the use of statistical methods torepresent the distinction between grammatical andungrammatical strings.

1 Colourless green ideas sleep furiously.2 Furiously sleep ideas green colourless.

(1) and (2) both have a probability approaching nil (in1957) of appearing in a corpus or actual speech.(1) is syntactically well formed, even if semanticallyanomalous, while (2) is not.

Conclusions

Chomsky (1957) (p. 17)

If we rank the sequences of a given length in order of statisticalapproximation to English, we will find both grammatical and ungrammaticalsequences scattered throughout the list; there appears to be no particularrelation between order of approximation and grammaticalness. Despite theundeniable interest and importance of semantic and statistical studies oflanguage, they appear to have no direct relevance to the problem ofdetermining or characterizing the set of grammatical utterances. I believe thatwe are forced to conclude that grammar is autonomous and independent ofmeaning, and that probabilistic models give no particular insight into some ofthe basic problems of syntactic structure.

Conclusions

A Smoothed Bigram Model

Chomsky moves from the claim that information theoreticmethods cannot identify the set of grammatical sentencesin the PLD to the conclusion that they are irrelevant tocharacterizing syntactic structure.This argument is not sound.Chomsky assumes a bigram model in which probability ofa word in a string depends on the word that immediatelyprecedes it.Pereira (2000) constructs a smoothed bigram model inwhich the probability of a word depends on the class of theprior word.

Conclusions

Pereira’s model computes the conditional probability of aword wi in a string with the formula

P(wi | wi−1) ≈∑

P(wi | c)P(c | wi−1)

where c is the class of wi−1.We can use distributional patterns of words in a corpus tolearn their classes from training data.Other procedures allow us to compute the values of theparameters P(wi | c) and P(c | wi−1) from this data.When applied to (1) and (2), this model yields a five orderof magnitude difference between their probability values fora corpus of newspaper text.

Conclusions

P(wi | wi−1) ≈∑

Conclusions

P(wi | wi−1) ≈∑

Conclusions

P(wi | wi−1) ≈∑

Conclusions

Independence Assumptions for Statistical Learning

A basic assumption of most statistical learning theories isthat the events which the theory models are generatedrandomly, and independently of each other.This assumption is specified in the principle that events areIndependently and Identically Distributed (IID).The IID is an idealization, and it is open to obviouschallenges in the case of sentences uttered in a discourse.Local dependencies clearly do exist among sentences inparticular discourse contexts.

Conclusions

Sustaining the IID over large Corpora

While local dependencies among sentences violate the IID,it is reasonable to assume that the effect of thesedependencies dissipates as the corpus increases in size.It seems plausible to assume that the IID generally holdsover a large number of events.If this is the case, then the probability distributions for largecorpora tend to converge on the IID.We will adopt it in lieu of a more refined characterization ofthe conditional probability relations among sentences in alarge corpus.

Conclusions

Languages and Distributions

A language model specifies a probability distribution for thestrings of a language.It is reasonable to ask whether learning a language Linvolves acquiring a distinct formal representation of L, orwhether we can reduce knowledge of L to knowing itsdistribution.In the former case the target of probabilistic learning is a(possibly non-probabilistic) grammar.In the latter the language model is itself the target oflearning, and languages are identified directly with theirdistributions.

Conclusions

Arguments for Reducing Languages to Distributions

There are strong arguments for both views.Identifying languages with their distributions is motivatedby the fact there is a substantial amount of psycholinguisticevidence showing that frequency based learning is centralto language acquisition (see Chapter 11, Section 11.3 ofthe monograph).Assuming that knowledge of a language consists inmastering a language model provides a natural and directexplanation for our capacity to filter out the noise of illformed sentences in the PLD.Taking the target of acquisition to be a language modeleliminates an additional formal object, and so simplifies theaccount of language learning.

Conclusions

Arguments against Reducing Languages toDistributions

It is not possible to identify grammaticality directly with highfrequency of occurrence.Some grammatical sentences may occur with very low (ornil) frequency, while ungrammatical sentences appear inthe PLD.Specifying a precise relation between probability andgrammaticality is a non-trivial task.It requires a stochastic model of indirect negative evidence(see Chapter 6 of the monograph for a proposal).We will leave the issue of reducing languages todistributions open.

Conclusions

Problems with the Assumptions of the IIL Paradigm

The characterization of convergence in the limit is toodemanding in that it does not allow learners to approximatethe target within reasonable limits of probability andconfidence.Because IIL does not constrain the set of presentationsand it requires learning for all presentations, it imposes fargreater demands on learnability than apply in actualhuman language learning.By allowing learners unbounded amounts of data, time,and computational resources IIL is unrealistic permissive.

Conclusions

PAC Learning: Probably

Unlike IIL, the PAC framework (Valiant (1984) requires thatlearning of a target hypothesis be to a high degree ofprobability .This approach allows for unususal and perverse data setsfrom which learning of the target is not possible.A PAC model incorporates a confidence parameter δwhose value specifies the lower probability bound onsuccessful learning of the target, in relation to the size ofthe data sample.Learners must acquire the target with probability 1− δ,where δ decreases in size (approaching 0) as the amountof data increases.

Conclusions

PAC Learning: Approximately

The PAC framework characterizes the learning process asconvergence on the target.It allows for both errors of undergeneralization (L \ H) andof overgereralization (H \ L).These errors should decline in proportion to the size of thedata sample.The model includes a parameter ε whose value specifiesthe error rate threshold for successful learning.εD(H) =

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Conclusions

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Conclusions

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Conclusions

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Conclusions

∑w∈L\H pD(w) +

∑w∈H\L pD(w)

Conclusions

Efficiency of Learning in the Size of the Data

The PAC framework requires that learning be efficientrelative to the size of the available data.The size of the data set needed to insure convergence canonly grow polynomially in relation to the values of the twoparameters ε and δ.Therefore, if a hypothesis H is PAC learnable, then thelearning algorithm will converge on H with a data setwhose size is determined by a function that specifies apolynomial in 1/ε and 1/δ.

Conclusions

Distribution Free and Uniform Learning

The classical PAC framework requires that learning beachieved for all probability distributions on the data.Therefore PAC learning is distribution free.It is also uniform for the members of a learnable class.Uniform learning specifies that for a given ε, δ, there is anN such that for every L ∈ L, we can learn L to an accuracyof ε, and with a probability of at least 1− δ, on the basis ofN examples, where N is constant across L.

Conclusions

Defining the PAC Model

Let A be a learning algorithm, and L a class of languages.

If there is a fixed polynomial pn, such thatfor all L from L, andall distributions D, andfor all ε, δ > 0,

when the number of samples N is greater thanpn(1/ε, 1/δ),if the probability of the error(∑

w∈L\H pD(w) +∑

w∈H\L pD(w)) of the hypothesisreturned by A being greater than ε is less than δ,then A has learned the class L.

Conclusions

w∈L\H pD(w) +∑

Conclusions

w∈L\H pD(w) +∑

Conclusions

w∈L\H pD(w) +∑

Conclusions

w∈L\H pD(w) +∑

Conclusions

w∈L\H pD(w) +∑

Conclusions

w∈L\H pD(w) +∑

Conclusions

w∈L\H pD(w) +∑

Conclusions

VC Dimension and Shattering

The Vapnik-Chervonenkis (VC) dimension of a hypothesisspace H is a measure of H’s complexity for PAC learning.It expresses a relation between H and the samples of adata set in terms of the maximal number of data points in asample that the elements of H can cover or shatter.The VC dimension of H is a crucial factor in determiningthe learnability of its elements.

Conclusions

Learning in Finite and Infinite Hypothesis Spaces

Finiteness of H does not insure computational efficiency ofPAC learning.Conversely, tractable convergence is possible in certaincases of an infinite H.Hence, in the PAC framework the finiteness assumptionsof the P&P view of UG do not, in themselves, solve thelearning theoretic problems posed by language acquisition.

Conclusions

Efficiency of Learning in a Finite Hypothesis Space

The number of training examples required scales withlog |H|, where H is the hypothesis space.Within the P&P framework, assuming n binary parameters,the size of the hypothesis space is |H| = 2n.The size of |H| grows exponentially with the number ofparameters, and so learning becomes increasingly difficult.Also, if the parameters are interdependent and difficult toestimate from observed data, then a finite class may stillnot be efficiently learnable.

Conclusions

Efficiency of Learning and the Size of the HypothesisSpace

In general, the feasibility of learning in a finite hypothesisspace does not necessarily depend on the number ofparameters, but on the size of H.Therefore, positing a UG with a finite set of parametersthat yields a finite hypothesis space H of possiblegrammars does not, in itself, entail the efficient learnabilityof the class of grammars that it specifies.If |H| is too large, too much data will be required to permitefficient learning, even where H is finite.

Conclusions

PAC Learning in an Infinite Hypothesis Space

The VC-dimension of the space is critical in determiningthe rate of convergence on a target for an infinitehypothesis space.The VC-dimension of H is the largest value of m such thatthere is a training sample of size m that is shattered by H.A training sample is shattered by H if, for each of the 2m

possible labelings of a sample (assignments from {0,1} toits elements), there is a hypothesis in H that assigns thatlabeling.

Conclusions

Shattering and VC Dimensions: An Example

Assume that each member of H is associated with nreal-valued parameters, and so the hypothesis space isuncountably infinite.Suppose, for example, that the function to be learnedmaps points in one-dimensional space onto 0 and 1.A hypothesis space H is a subset of all possible functionsof this kind.

Conclusions

H might, for instance, contain just those functions thatassign 1s to all points within an interval int of a line and 0sto all points outside of int.The VC dimension of H is the cardinality of the largest setof points for which all possible labelings of the points areexpressed by elements of H (they are shattered by H).The VC dimension of an H consisting only of int functionsis 2, as the hypotheses in H shatter any set of 2 points in aline, but not all sets of 3 points.

Conclusions

Intervals in a Real Number LineKearns and Vazirani (1994)

<—–(X)—-(Y)——>

a. <—–(1)—-(1)——>b. <—–(1)—-(0)——>c. <—–(0)—-(1)——>d. <—–(0)—-(0)——>

a. <—[-(1)—-(1)-]—->b. <—[-(1)-]–(0)——>c. <—–(0)–[-(1)-]—->d. <—–(0)—-(0)-[-]–>

Conclusions

<—–(X)—-(Y)——>

a. <—–(1)—-(1)——>b. <—–(1)—-(0)——>c. <—–(0)—-(1)——>d. <—–(0)—-(0)——>

a. <—[-(1)—-(1)-]—->b. <—[-(1)-]–(0)——>c. <—–(0)–[-(1)-]—->d. <—–(0)—-(0)-[-]–>

Conclusions

<—–(X)—-(Y)——>

a. <—–(1)—-(1)——>b. <—–(1)—-(0)——>c. <—–(0)—-(1)——>d. <—–(0)—-(0)——>

a. <—[-(1)—-(1)-]—->b. <—[-(1)-]–(0)——>c. <—–(0)–[-(1)-]—->d. <—–(0)—-(0)-[-]–>

Conclusions

Intervals in a Real Number LineVC dimension is 2

The pair of points in a-d can be covered by all possiblelabelings that the interval hypotheses in H specify.

e. <—(1)—(0)—(1)—>

The labeling of the triple in e cannot be expressed by any singlebracketing.

Conclusions

e. <—(1)—(0)—(1)—>

Conclusions

e. <—(1)—(0)—(1)—>

Conclusions

Tractable Learning in an Infinite Hypothesis Space

A hypothesis space H has infinite VC-dimension if for anyvalue of m, there is a training sample of size m that isshattered by H.PAC learning converges in the limit if and only if theVC-dimension of the hypothesis space is finite.The number of training examples required is roughly linearin the VC-dimension of the hypothesis space, and soefficient PAC learning is possible in an infinite H if theVC-dimension of H is relatively small.It is possible to improve the convergence rate for PAClearning by adding a distributional bias over the elementsof H.

Conclusions

PAC Learning and the Class of Finite Languages

The class of finite languages is not uniformly PAClearnable, because a hypothesis space HL for this classhas infinite VC dimension.HL contains all finite subsets of the set of strings formedfrom a vocabulary Σ∗.This set of subsets is infinite.A sample observation set O, of any size, of strings fromthis vocabulary will be shattered by HL.

Conclusions

The members of HL will yield all possible Boolean values(1 for inclusion in a language of L and 0 for exclusion) forthe elements of O, regardless of how large it is.Therefore, this class is unlearnable in the PAC framework.By contrast, the class of finite languages is identifiable inthe limit in Gold’s positive evidence only model, wheretractability in resources of time and data is not arequirement for learning.

Conclusions

PAC Learning and the Class of Regular Languages

The class of regular languages is also not uniformly PAClearnable for similar reasons.The VC dimension of this class is infinite.If HL consists of the infinite set of possible Finite StateAutomata that generate a regular language L, then anysample of strings from the vocabulary of L will beshattered by HL.

Conclusions

Imposing an Upper Bound on the Size of theLanguage Class

As Nowak et al. (2002) observe, by imposing an upperbound k on the cardinality of the sets of finite languages inHL, one achieves finite VC dimension for this hypothesisspace.The VC-dimension of such an HL whose elements arebounded in size is at most k .In this case the class of languages in HL is uniformly PAClearnable.Similarly, bounding the size of regular grammars andCFGs results in finite VC dimension and renders theseclasses uniformly learnable.

Conclusions

PAC Learnability and Learning Priors

The fact that unform PAC learning requires a finite VCdimension for the target class might appear to support aversion of the APS.The hypothesis class for language acquisition must berestricted to a set of grammars that has finiteVC-dimension to insure acquisition.Learners need to have prior knowledge of these bounds onthe target language class.This learning prior is clearly domain specific, and so itentails a form of linguistic nativism.

Conclusions

PAC Learnability and Bounded Language Classes

In fact, this argument does not go through when wedistinguish the target class from the hypothesis space.As in the case of IIL, learners can formulate hypothesisthat fall outside a PAC learnable class..Any class of finite, regular, or context free languages canbe learned up to an arbitrary cardinality bound k .

Conclusions

This cardinality bound need not be specified as a prior ofthe learning algorithm.The algorithm can test progressively larger representationsof a language against the data until it arrives at the targethypothesis.So for any k , the class of finite/regular/context freelanguages of k cardinality can be uniformly PAC learned.The union of these bounded classes gives the fullunbounded class.

Conclusions

Labeled Data Samples

In the classical PAC learning framework all data samplesare labeled for (non) membership in the set identified bythe target hypothesis.As in the case of the negative evidence version of IIL, thisposes a serious difficulty for modeling languageacquisition.The sentences of the PLD are not labeled forgrammaticality.Specifying a version of PAC learning that uses positiveevidence only requires a major revision of the framework.

Conclusions

A Positive Evidence only PAC Model

Defining a viable positive evidence only PAC framework isa non-trivial task.A naive approach on which all and only grammaticalstrings are assigned non-zero probability and sodistributions are effectively restricted to grammaticalstrings will not succeed.If the standard characterization of the error rate issustained, then an algorithm which treats every string asgrammatical will have a null error rate.Therefore, the model will be vacuous, as every class willbe PAC learnable.

Conclusions

Distribution Free Learning

The classical PAC framework condition that learning bepossible on all probability distributions for the data samplesis problematic.It corresponds to the IIL requirement that learning for aclass be achieved under all presentations.Some distributions are, in effect, adversarial in that theyassign high probability to data that is eccentric for alanguage, and in this way they undermine learning.Imposing appropriate restrictions on possible distributionsfor PAC learning can render certain interesting classes oflanguage learnable.

Conclusions

Uniform Learning

The PAC framework requires that learning be at a uniformrate for the elements of a class, in relation to the availabledata, as specified by a constant upper bound on the size ofthe data set.As we have seen in the case of infinite VC dimension, thiscondition is problematic if the class containsrepresentations of unbounded complexity.By allowing for non-uniform learning in which differentelements of such a class can be acquired at ratesexpressed by distinct polynomial functions on data sets, itis possible to expand the set of learnable classes.

Conclusions

Uniform Learning

Conclusions

Uniform Learning

Conclusions

Uniform Acquisition of Natural Languages

It is generally agreed that natural languages exhibit roughlythe same degree of complexity in their grammars.Christiansen and Chater (2008), Kirby (2001), Kirby (2007),and Kirby and Hurford (2002) explain this property on thebasis of information theoretic conditions on transmissionand learning, which shape the evolution of language.If this account is correct, then the common complexity oflanguages is not due to UG, but largely to domain generalconstraints on human learning and information processing.On this view, even if learning is, in general, non-uniform fortarget classes, language acquisition will be uniform acrosslanguages because of their shared complexity properties.

Conclusions

Probabilistic learning models capture stochastic elementsof language acquisition that IIL does not.PAC learning allows for gradual convergence on a targethypothesis, and it permits learning within a specified errorrate.It also requires that learning be efficient in its relation to theamount of required data.The fact that language classes with infinite VC dimensionare not uniformly learnable does not entail that an upperbound on the size of a possible grammar must be specifiedas a learning prior for language acquisition.

Conclusions

We can explain the fact that the grammars of all naturallanguages appear to exhibit roughly the same degree ofcomplexity without positing a strong set of domain specificlearning priors.The fact that PAC learning requires labeled data renders itan inappropriate model for language acquisition.Its assumptions that learning is distribution free anduniform for the elements of a class are also problematic.In order to develop a version of PAC learning that is usefulfor modeling acquisition we need to revise theseassumptions.

Conclusions

Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the...

Documents

Transcript of Probabilistic Learning - Computational Learning Theory and ...nasslli/Lappin3.pdf · represent the...