Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ......

104
Outline Links PAC-Bayes Analysis Linear Classifiers Data Dependent Priors in PAC-Bayes Bounds John Shawe-Taylor University College London Joint work with Emilio Parrado-Hernández and Amiran Ambroladze August, 2010 John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Transcript of Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ......

Page 1: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

Data Dependent Priors in PAC-Bayes Bounds

John Shawe-TaylorUniversity College London

Joint work with Emilio Parrado-Hernández and AmiranAmbroladze

August, 2010

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 2: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

1 Links

2 PAC-Bayes AnalysisDefinitionsPAC-Bayes TheoremProof outlineApplications

3 Linear ClassifiersGeneral ApproachLearning the priorNew prior for linear functionsPrior-SVM

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 3: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 4: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 5: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 6: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 7: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 8: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 9: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 10: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 11: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 12: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 13: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 14: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 15: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 16: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 17: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 18: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultAssessing the posterior

The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)We are interested in the relation between two quantities:

QD = Ec∼Q[cD]

the true error rate of the probabilistic classifier and

QS = Ec∼Q[cS]

its empirical error rate

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 19: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultAssessing the posterior

The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)We are interested in the relation between two quantities:

QD = Ec∼Q[cD]

the true error rate of the probabilistic classifier and

QS = Ec∼Q[cS]

its empirical error rate

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 20: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Definitions for main resultGeneralisation error

Note that this does not bound the posterior average but wehave

Pr(x ,y)∼D(sgn (Ec∼Q[c(x)]) 6= y) ≤ 2QD.

since for any point x misclassified by sgn (Ec∼Q[c(x)]) theprobability of a random c ∼ Q misclassifying is at least 0.5.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 21: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

PAC-Bayes Theorem

Fix an arbitrary D, arbitrary prior P, and confidence δ, thenwith probability at least 1− δ over samples S ∼ Dm, allposteriors Q satisfy

KL(QS‖QD) ≤KL(Q‖P) + ln((m + 1)/δ)

m

where KL is the KL divergence between distributions

KL(Q‖P) = Ec∼Q

[ln

Q(c)P(c)

]with QS and QD considered as distributions on {0,+1}.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 22: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Ingredients of proof (1/3)1

PrS∼Dm

(Ec∼P

1PrS′∼Dm(cS = cS′)

≤ m + 1δ

)≥ 1− δ

This follows from considering the expectation divided intoprobability of particular empirical error for any c:

ES∼Dm1

PrS′∼Dm(cS = cS′)=∑

k

PrS∼Dm(cS = k)1

PrS′∼Dm(cS′ = k)= m+1.

Taking expectations wrt to c and reversing the expectations

ES∼DmEc∼P1

PrS′∼Dm(cS = cS′)= m + 1

and the result follows from Markov’s inequality.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 23: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Ingredients of proof (1/3)1

PrS∼Dm

(Ec∼P

1PrS′∼Dm(cS = cS′)

≤ m + 1δ

)≥ 1− δ

This follows from considering the expectation divided intoprobability of particular empirical error for any c:

ES∼Dm1

PrS′∼Dm(cS = cS′)=∑

k

PrS∼Dm(cS = k)1

PrS′∼Dm(cS′ = k)= m+1.

Taking expectations wrt to c and reversing the expectations

ES∼DmEc∼P1

PrS′∼Dm(cS = cS′)= m + 1

and the result follows from Markov’s inequality.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 24: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Ingredients of proof (2/3)

1

Ec∼Q ln 1PrS′∼Dm (cS=cS′ )

m≥ KL(QS‖QD)

This follows by considering the probabilities that the twoempirical estimates are equal, applying the relative entropyChernoff bound and then using the concavity of the KLdivergence as a function of both arguments.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 25: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Ingredients of proof (2/3)

1

Ec∼Q ln 1PrS′∼Dm (cS=cS′ )

m≥ KL(QS‖QD)

This follows by considering the probabilities that the twoempirical estimates are equal, applying the relative entropyChernoff bound and then using the concavity of the KLdivergence as a function of both arguments.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 26: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Ingredients of proof (3/3)

1 Consider the distribution

PG(c) =1

PrS′∼Dm(cS′ = cS)Ed∼P1

PrS′∼Dm (dS=dS′ )

P(c)

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 27: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Ingredients of proof (2/3)

0 ≤ KL(Q‖PG)

= KL(Q‖P)− Ec∼Q ln1

PrS′∼Dm(cS′ = cS)

+ lnEd∼P1

PrS′∼Dm(dS = dS′)

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 28: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Ingredients of proof (3/3)

mKL(QS‖QD) ≤ Ec∼Q ln1

PrS′∼Dm(cS′ = cS)

≤ KL(Q‖P) + lnEd∼P1

PrS′∼Dm(dS = dS′)

≤ KL(Q‖P) +m + 1δ

with probability greater than 1− δ.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 29: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Finite Classes

If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by

KL(err(hi)‖err(hi)) ≤− log(pi) + ln((m + 1)/δ)

m

This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 30: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Finite Classes

If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by

KL(err(hi)‖err(hi)) ≤− log(pi) + ln((m + 1)/δ)

m

This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 31: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Other extensions/applications

Matthias Seeger developed the theory for bounding theerror of a Gaussian process classifier.Olivier Catoni has extended the result to exchangeabledistributions enabling him to get a PAC-Bayes version ofVapnik-Chervonenkis bounds.Germain et al have extended to more general lossfunctions than just binary.David McAllester has extended the approach to structuredoutput learning.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 32: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Other extensions/applications

Matthias Seeger developed the theory for bounding theerror of a Gaussian process classifier.Olivier Catoni has extended the result to exchangeabledistributions enabling him to get a PAC-Bayes version ofVapnik-Chervonenkis bounds.Germain et al have extended to more general lossfunctions than just binary.David McAllester has extended the approach to structuredoutput learning.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 33: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Other extensions/applications

Matthias Seeger developed the theory for bounding theerror of a Gaussian process classifier.Olivier Catoni has extended the result to exchangeabledistributions enabling him to get a PAC-Bayes version ofVapnik-Chervonenkis bounds.Germain et al have extended to more general lossfunctions than just binary.David McAllester has extended the approach to structuredoutput learning.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 34: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Other extensions/applications

Matthias Seeger developed the theory for bounding theerror of a Gaussian process classifier.Olivier Catoni has extended the result to exchangeabledistributions enabling him to get a PAC-Bayes version ofVapnik-Chervonenkis bounds.Germain et al have extended to more general lossfunctions than just binary.David McAllester has extended the approach to structuredoutput learning.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 35: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 36: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 37: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 38: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

DefinitionsPAC-Bayes TheoremProof outlineApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 39: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 40: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 41: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 42: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (1/2)

P

0

W

Prior P is Gaussian N (0,1)

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 43: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

Prior P is Gaussian N (0,1)

Posterior is in the direction w

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 44: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

μPrior P is Gaussian N (0,1)

Posterior is in the direction w

at distance µ from the origin

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 45: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

Q

μ

Prior P is Gaussian N (0,1)

Posterior is in the direction w

at distance µ from the origin

Posterior Q is Gaussian

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 46: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 47: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 48: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 49: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 50: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training error

QS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywTφ(x))/(‖φ(x)‖‖w‖)

F (t) = 1− 1√2π

∫ t

−∞e−x2/2dx

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 51: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training error

QS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywTφ(x))/(‖φ(x)‖‖w‖)

F (t) = 1− 1√2π

∫ t

−∞e−x2/2dx

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 52: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 53: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 54: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 55: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 56: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 57: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 58: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 59: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 60: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 61: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 62: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 63: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 64: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New prior for the SVM (3/3)

w r

0

W

Solve SVM with subset of patterns

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 65: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New prior for the SVM (3/3)

w r

0

W

P

µ

Solve SVM with subset of patterns

Prior in the direction wr

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 66: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New prior for the SVM (3/3)

w rµ

0

Q wW

P

µ

Solve SVM with subset of patterns

Prior in the direction wr

Posterior like PAC-Bayes Bound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 67: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New prior for the SVM (3/3)

w r

distancebetweendistributions

µ

0

Q wW

P

µ

Solve SVM with subset of patterns

Prior in the direction wr

Posterior like PAC-Bayes Bound

New bound proportional to KL(P‖Q)

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 68: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New Bound for the SVM (2/3)

SVM performance may be tightly bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤0.5‖µw − ηw r‖2 + ln (m−r+1)J

δ

m − r

QD(w , µ) true performance of the classifier

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 69: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New Bound for the SVM (2/3)

SVM performance may be tightly bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤0.5‖µw − ηw r‖2 + ln (m−r+1)J

δ

m − r

QD(w , µ) true performance of the classifier

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 70: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New Bound for the SVM (2/3)

SVM performance may be tightly bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤0.5‖µw − ηw r‖2 + ln (m−r+1)J

δ

m − r

QS(w , µ) stochastic measure of the training error on remainingdata

Q(w , µ)S = Em−r [F (µγ(x , y))]

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 71: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New Bound for the SVM (2/3)

SVM performance may be tightly bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤0.5‖µw − ηw r‖2 + ln (m−r+1)J

δ

m − r

QS(w , µ) stochastic measure of the training error on remainingdata

Q(w , µ)S = Em−r [F (µγ(x , y))]

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 72: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New Bound for the SVM (2/3)

SVM performance may be tightly bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤0.5‖µw − ηw r‖2 + ln (m−r+1)J

δ

m − r

0.5‖µw − ηw r‖2 distance between prior and posterior

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 73: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New Bound for the SVM (2/3)

SVM performance may be tightly bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤0.5‖µw − ηw r‖2 + ln (m−r+1)J

δ

m − r

0.5‖µw − ηw r‖2 distance between prior and posterior

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 74: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New Bound for the SVM (2/3)

SVM performance may be tightly bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤0.5‖µw − ηw r‖2 + ln (m−r+1)J

δ

m − r

Penalty term only dependent on the remaining data m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 75: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

New Bound for the SVM (2/3)

SVM performance may be tightly bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤0.5‖µw − ηw r‖2 + ln (m−r+1)J

δ

m − r

Penalty term only dependent on the remaining data m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 76: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Prior-SVM

New bound proportional to ‖µw − ηw r‖2

Classifier that optimises the boundOptimisation problem to determine the p-SVM

minw ,ξi

[12‖w −w r‖2 + C

m−r∑i=1

ξi

]s.t. yiwTφ(x i) ≥ 1− ξi i = 1, . . . ,m − r

ξi ≥ 0 i = 1, . . . ,m − r

The p-SVM is only solved with the remaining points

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 77: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Prior-SVM

New bound proportional to ‖µw − ηw r‖2

Classifier that optimises the boundOptimisation problem to determine the p-SVM

minw ,ξi

[12‖w −w r‖2 + C

m−r∑i=1

ξi

]s.t. yiwTφ(x i) ≥ 1− ξi i = 1, . . . ,m − r

ξi ≥ 0 i = 1, . . . ,m − r

The p-SVM is only solved with the remaining points

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 78: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Prior-SVM

New bound proportional to ‖µw − ηw r‖2

Classifier that optimises the boundOptimisation problem to determine the p-SVM

minw ,ξi

[12‖w −w r‖2 + C

m−r∑i=1

ξi

]s.t. yiwTφ(x i) ≥ 1− ξi i = 1, . . . ,m − r

ξi ≥ 0 i = 1, . . . ,m − r

The p-SVM is only solved with the remaining points

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 79: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Prior-SVM

New bound proportional to ‖µw − ηw r‖2

Classifier that optimises the boundOptimisation problem to determine the p-SVM

minw ,ξi

[12‖w −w r‖2 + C

m−r∑i=1

ξi

]s.t. yiwTφ(x i) ≥ 1− ξi i = 1, . . . ,m − r

ξi ≥ 0 i = 1, . . . ,m − r

The p-SVM is only solved with the remaining points

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 80: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Bound for p-SVM

1 Determine the prior with a subset of the training examplesto obtain w r

2 Solve p-SVM and obtain w3 Margin for the stochastic classifier Qs

γ(x j , yj) =yjwTφ(x j)

‖φ(x j)‖‖w‖j = 1, . . . ,m − r

4 Linear search to obtain the optimal value of µ. Thisintroduces an insignificant extra penalty term

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 81: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Bound for p-SVM

1 Determine the prior with a subset of the training examplesto obtain w r

2 Solve p-SVM and obtain w3 Margin for the stochastic classifier Qs

γ(x j , yj) =yjwTφ(x j)

‖φ(x j)‖‖w‖j = 1, . . . ,m − r

4 Linear search to obtain the optimal value of µ. Thisintroduces an insignificant extra penalty term

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 82: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Bound for p-SVM

1 Determine the prior with a subset of the training examplesto obtain w r

2 Solve p-SVM and obtain w3 Margin for the stochastic classifier Qs

γ(x j , yj) =yjwTφ(x j)

‖φ(x j)‖‖w‖j = 1, . . . ,m − r

4 Linear search to obtain the optimal value of µ. Thisintroduces an insignificant extra penalty term

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 83: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Bound for p-SVM

1 Determine the prior with a subset of the training examplesto obtain w r

2 Solve p-SVM and obtain w3 Margin for the stochastic classifier Qs

γ(x j , yj) =yjwTφ(x j)

‖φ(x j)‖‖w‖j = 1, . . . ,m − r

4 Linear search to obtain the optimal value of µ. Thisintroduces an insignificant extra penalty term

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 84: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

η-Prior-SVM

Consider using a prior distribution P that is elongated inthe direction of wr

This will mean that there is low penalty for large projectionsonto this directionTranslates into an optimisation:

minv,η,ξi

[12‖v‖2 + C

m−r∑i=1

ξi

]subject to

yi(v + ηwr )Tφ(xi) ≥ 1− ξi i = 1, . . . ,m − r

ξi ≥ 0 i = 1, . . . ,m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 85: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

η-Prior-SVM

Consider using a prior distribution P that is elongated inthe direction of wr

This will mean that there is low penalty for large projectionsonto this directionTranslates into an optimisation:

minv,η,ξi

[12‖v‖2 + C

m−r∑i=1

ξi

]subject to

yi(v + ηwr )Tφ(xi) ≥ 1− ξi i = 1, . . . ,m − r

ξi ≥ 0 i = 1, . . . ,m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 86: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

η-Prior-SVM

Consider using a prior distribution P that is elongated inthe direction of wr

This will mean that there is low penalty for large projectionsonto this directionTranslates into an optimisation:

minv,η,ξi

[12‖v‖2 + C

m−r∑i=1

ξi

]subject to

yi(v + ηwr )Tφ(xi) ≥ 1− ξi i = 1, . . . ,m − r

ξi ≥ 0 i = 1, . . . ,m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 87: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

η-Prior-SVM

Consider using a prior distribution P that is elongated inthe direction of wr

This will mean that there is low penalty for large projectionsonto this directionTranslates into an optimisation:

minv,η,ξi

[12‖v‖2 + C

m−r∑i=1

ξi

]subject to

yi(v + ηwr )Tφ(xi) ≥ 1− ξi i = 1, . . . ,m − r

ξi ≥ 0 i = 1, . . . ,m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 88: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Bound for η-prior-SVM

Prior is elongated along the line of wr but spherical withvariance 1 in other directionsPosterior again on the line of w at a distance µ chosen tooptimise the bound.Resulting bound depends on a benign parameter τdetermining the variance in the direction wr

KL(QS\R(w, µ)‖QD(w, µ)) ≤

0.5(ln(τ2) + τ−2 − 1 + P‖wr (µw−wr )2/τ2 + P⊥wr

(µw)2) + ln(m−r+1δ )

m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 89: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Bound for η-prior-SVM

Prior is elongated along the line of wr but spherical withvariance 1 in other directionsPosterior again on the line of w at a distance µ chosen tooptimise the bound.Resulting bound depends on a benign parameter τdetermining the variance in the direction wr

KL(QS\R(w, µ)‖QD(w, µ)) ≤

0.5(ln(τ2) + τ−2 − 1 + P‖wr (µw−wr )2/τ2 + P⊥wr

(µw)2) + ln(m−r+1δ )

m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 90: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Bound for η-prior-SVM

Prior is elongated along the line of wr but spherical withvariance 1 in other directionsPosterior again on the line of w at a distance µ chosen tooptimise the bound.Resulting bound depends on a benign parameter τdetermining the variance in the direction wr

KL(QS\R(w, µ)‖QD(w, µ)) ≤

0.5(ln(τ2) + τ−2 − 1 + P‖wr (µw−wr )2/τ2 + P⊥wr

(µw)2) + ln(m−r+1δ )

m − r

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 91: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Model Selection with the new bound: setup

Comparison with X-fold Xvalidation, PAC-Bayes Bound andthe Prior PAC-Bayes BoundUCI datasetsSelect C and σ that lead to minimum Classification Error(CE)

For X-F XV select the pair that minimize the validation errorFor PAC-Bayes Bound and Prior PAC-Bayes Bound selectthe pair that minimize the bound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 92: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Model Selection with the new bound: setup

Comparison with X-fold Xvalidation, PAC-Bayes Bound andthe Prior PAC-Bayes BoundUCI datasetsSelect C and σ that lead to minimum Classification Error(CE)

For X-F XV select the pair that minimize the validation errorFor PAC-Bayes Bound and Prior PAC-Bayes Bound selectthe pair that minimize the bound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 93: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Model Selection with the new bound: setup

Comparison with X-fold Xvalidation, PAC-Bayes Bound andthe Prior PAC-Bayes BoundUCI datasetsSelect C and σ that lead to minimum Classification Error(CE)

For X-F XV select the pair that minimize the validation errorFor PAC-Bayes Bound and Prior PAC-Bayes Bound selectthe pair that minimize the bound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 94: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Model Selection with the new bound: setup

Comparison with X-fold Xvalidation, PAC-Bayes Bound andthe Prior PAC-Bayes BoundUCI datasetsSelect C and σ that lead to minimum Classification Error(CE)

For X-F XV select the pair that minimize the validation errorFor PAC-Bayes Bound and Prior PAC-Bayes Bound selectthe pair that minimize the bound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 95: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Model Selection with the new bound: setup

Comparison with X-fold Xvalidation, PAC-Bayes Bound andthe Prior PAC-Bayes BoundUCI datasetsSelect C and σ that lead to minimum Classification Error(CE)

For X-F XV select the pair that minimize the validation errorFor PAC-Bayes Bound and Prior PAC-Bayes Bound selectthe pair that minimize the bound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 96: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Description of the Datasets

Problem # samples input dim. Pos/NegHandwritten-digits 5620 64 2791 / 2829

Waveform 5000 21 1647 / 3353Pima 768 8 268 / 500

Ringnorm 7400 20 3664 / 3736Spam 4601 57 1813 / 2788

Table: Description of datasets in terms of number of patterns,number of input variables and number of positive/negative examples.

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 97: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Results

ClassifierSVM ηPrior SVM

Problem 2FCV 10FCV PAC PrPAC PrPAC τ -PrPACdigits Bound – – 0.175 0.107 0.050 0.047

CE 0.007 0.007 0.007 0.014 0.010 0.009waveform Bound – – 0.203 0.185 0.178 0.176

CE 0.090 0.086 0.084 0.088 0.087 0.086pima Bound – – 0.424 0.420 0.428 0.416

CE 0.244 0.245 0.229 0.229 0.233 0.233ringnorm Bound – – 0.203 0.110 0.053 0.050

CE 0.016 0.016 0.018 0.018 0.016 0.016spam Bound – – 0.254 0.198 0.186 0.178

CE 0.066 0.063 0.067 0.077 0.070 0.072

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 98: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Concluding remarks

Frequentist (PAC) and Bayesian approaches to analysinglearning lead to introduction of the PAC-Bayes boundDetailed look at the ingredients of the theoryApplication to bound the performance of an SVMInvestigation of learning of the prior of the distribution ofclassifiersExperiments show the new bound can be tighter ......And reliable for low cost model selectionp-SVM and η-p-SVM: classifiers that optimise the newbound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 99: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Concluding remarks

Frequentist (PAC) and Bayesian approaches to analysinglearning lead to introduction of the PAC-Bayes boundDetailed look at the ingredients of the theoryApplication to bound the performance of an SVMInvestigation of learning of the prior of the distribution ofclassifiersExperiments show the new bound can be tighter ......And reliable for low cost model selectionp-SVM and η-p-SVM: classifiers that optimise the newbound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 100: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Concluding remarks

Frequentist (PAC) and Bayesian approaches to analysinglearning lead to introduction of the PAC-Bayes boundDetailed look at the ingredients of the theoryApplication to bound the performance of an SVMInvestigation of learning of the prior of the distribution ofclassifiersExperiments show the new bound can be tighter ......And reliable for low cost model selectionp-SVM and η-p-SVM: classifiers that optimise the newbound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 101: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Concluding remarks

Frequentist (PAC) and Bayesian approaches to analysinglearning lead to introduction of the PAC-Bayes boundDetailed look at the ingredients of the theoryApplication to bound the performance of an SVMInvestigation of learning of the prior of the distribution ofclassifiersExperiments show the new bound can be tighter ......And reliable for low cost model selectionp-SVM and η-p-SVM: classifiers that optimise the newbound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 102: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Concluding remarks

Frequentist (PAC) and Bayesian approaches to analysinglearning lead to introduction of the PAC-Bayes boundDetailed look at the ingredients of the theoryApplication to bound the performance of an SVMInvestigation of learning of the prior of the distribution ofclassifiersExperiments show the new bound can be tighter ......And reliable for low cost model selectionp-SVM and η-p-SVM: classifiers that optimise the newbound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 103: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Concluding remarks

Frequentist (PAC) and Bayesian approaches to analysinglearning lead to introduction of the PAC-Bayes boundDetailed look at the ingredients of the theoryApplication to bound the performance of an SVMInvestigation of learning of the prior of the distribution ofclassifiersExperiments show the new bound can be tighter ......And reliable for low cost model selectionp-SVM and η-p-SVM: classifiers that optimise the newbound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds

Page 104: Data Dependent Priors in PAC-Bayes Bounds - … · PAC-Bayes Analysis Linear Classifiers Data ... its empirical error rate John Shawe-Taylor University ... John Shawe-Taylor University

OutlineLinks

PAC-Bayes AnalysisLinear Classifiers

General ApproachLearning the priorNew prior for linear functionsPrior-SVM

Concluding remarks

Frequentist (PAC) and Bayesian approaches to analysinglearning lead to introduction of the PAC-Bayes boundDetailed look at the ingredients of the theoryApplication to bound the performance of an SVMInvestigation of learning of the prior of the distribution ofclassifiersExperiments show the new bound can be tighter ......And reliable for low cost model selectionp-SVM and η-p-SVM: classifiers that optimise the newbound

John Shawe-Taylor University College London Data Dependent Priors in PAC-Bayes Bounds