PAC-Bayes Analysis: Background and...

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

PAC-Bayes Analysis: Background andApplications

John Shawe-TaylorUniversity College London

Chicago/TTI WorkshopJune, 2009

Including joint work with John Langford, Amiran Ambroladzeand Emilio Parrado-Hernández, Cédric Archambeau,

Matthew Higgs, Manfred OpperJohn Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

GPs and SDEs

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

GPs and SDEs

1 Background to Approach2 PAC-Bayes Analysis

DefinitionsPAC-Bayes TheoremApplications

3 Linear ClassifiersGeneral ApproachLearning the prior

4 Maximum entropy classificationGeneralisationOptimisation

5 GPs and SDEsGaussian Process regressionVariational approximationGeneralisation

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

GPs and SDEs

Version space: evidence

f(x1,w) = 0

f(x4,w) = 0 f(x2,w) = 0

f(x3,w)=0

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

GPs and SDEs

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

GPs and SDEs

PAC-Bayes Theorem

GPs and SDEs

PAC-Bayes Theorem

GPs and SDEs

PAC-Bayes Theorem

GPs and SDEs

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct

GPs and SDEs

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

GPs and SDEs

cD = Pr(x ,y)∼D(c(x) 6= y)

cS =1m

∑(x ,y)∈S

GPs and SDEs

cD = Pr(x ,y)∼D(c(x) 6= y)

cS =1m

∑(x ,y)∈S

GPs and SDEs

cD = Pr(x ,y)∼D(c(x) 6= y)

cS =1m

∑(x ,y)∈S

GPs and SDEs

Definitions for main resultAssessing the posterior

The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)

We are interested in the relation between two quantities:

QD = Ec∼Q[cD]

the true error rate of the probabilistic classifier and

QS = Ec∼Q[cS]

its empirical error rate

GPs and SDEs

Definitions for main resultAssessing the posterior

The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)

We are interested in the relation between two quantities:

QD = Ec∼Q[cD]

the true error rate of the probabilistic classifier and

QS = Ec∼Q[cS]

its empirical error rate

GPs and SDEs

Definitions for main resultGeneralisation error

Note that this does not bound the posterior average but wehave

Pr(x ,y)∼D(sgn (Ec∼Q[c(x)]) 6= y) ≤ 2QD.

since for any point x misclassified by sgn (Ec∼Q[c(x)]) theprobability of a random c ∼ Q misclassifying is at least 0.5.

GPs and SDEs

PAC-Bayes Theorem

Fix an arbitrary D, arbitrary prior P, and confidence δ, thenwith probability at least 1− δ over samples S ∼ Dm, allposteriors Q satisfy

KL(QS‖QD) ≤ KL(Q‖P) + ln((m + 1)/δ)

where KL is the KL divergence between distributions

KL(Q‖P) = Ec∼Q

]with QS and QD considered as distributions on {0,+1}.

GPs and SDEs

Finite Classes

If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by

KL(err(hi)‖err(hi)) ≤ − log(pi) + ln((m + 1)/δ)

This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.

GPs and SDEs

Finite Classes

If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by

KL(err(hi)‖err(hi)) ≤ − log(pi) + ln((m + 1)/δ)

This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.

GPs and SDEs

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

GPs and SDEs

General ApproachLearning the prior

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

GPs and SDEs

Linear classifiers

GPs and SDEs

Linear classifiers

GPs and SDEs

PAC-Bayes Bound for SVM (1/2)

Prior P is Gaussian N (0,1)

GPs and SDEs

Prior P is Gaussian N (0,1)

Posterior is in the direction w

GPs and SDEs

µ Prior P is Gaussian N (0,1)

at distance µ from the origin

GPs and SDEs

µ Prior P is Gaussian N (0,1)

at distance µ from the origin

Posterior Q is Gaussian

GPs and SDEs

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

GPs and SDEs

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]