PAC-Bayes Analysis: Background and...

Post on 30-Dec-2018

245 views 0 download

Transcript of PAC-Bayes Analysis: Background and...

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

PAC-Bayes Analysis: Background andApplications

John Shawe-TaylorUniversity College London

Chicago/TTI WorkshopJune, 2009

Including joint work with John Langford, Amiran Ambroladzeand Emilio Parrado-Hernández, Cédric Archambeau,

Matthew Higgs, Manfred OpperJohn Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

1 Background to Approach2 PAC-Bayes Analysis

DefinitionsPAC-Bayes TheoremApplications

3 Linear ClassifiersGeneral ApproachLearning the prior

4 Maximum entropy classificationGeneralisationOptimisation

5 GPs and SDEsGaussian Process regressionVariational approximationGeneralisation

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Version space: evidence

C3

f(x1,w) = 0

f(x4,w) = 0 f(x2,w) = 0

f(x3,w)=0

w

w’

C1

C2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultAssessing the posterior

The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)

We are interested in the relation between two quantities:

QD = Ec∼Q[cD]

the true error rate of the probabilistic classifier and

QS = Ec∼Q[cS]

its empirical error rate

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultAssessing the posterior

The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)

We are interested in the relation between two quantities:

QD = Ec∼Q[cD]

the true error rate of the probabilistic classifier and

QS = Ec∼Q[cS]

its empirical error rate

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultGeneralisation error

Note that this does not bound the posterior average but wehave

Pr(x ,y)∼D(sgn (Ec∼Q[c(x)]) 6= y) ≤ 2QD.

since for any point x misclassified by sgn (Ec∼Q[c(x)]) theprobability of a random c ∼ Q misclassifying is at least 0.5.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

Fix an arbitrary D, arbitrary prior P, and confidence δ, thenwith probability at least 1− δ over samples S ∼ Dm, allposteriors Q satisfy

KL(QS‖QD) ≤ KL(Q‖P) + ln((m + 1)/δ)

m

where KL is the KL divergence between distributions

KL(Q‖P) = Ec∼Q

[ln

Q(c)

P(c)

]with QS and QD considered as distributions on {0,+1}.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Finite Classes

If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by

KL(err(hi)‖err(hi)) ≤ − log(pi) + ln((m + 1)/δ)

m

This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Finite Classes

If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by

KL(err(hi)‖err(hi)) ≤ − log(pi) + ln((m + 1)/δ)

m

This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (1/2)

P

0

W

Prior P is Gaussian N (0,1)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

Prior P is Gaussian N (0,1)

Posterior is in the direction w

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

µ Prior P is Gaussian N (0,1)

Posterior is in the direction w

at distance µ from the origin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

Q

µ Prior P is Gaussian N (0,1)

Posterior is in the direction w

at distance µ from the origin

Posterior Q is Gaussian

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Form of the SVM bound

Note that bound holds for all posterior distributions so thatwe can choose µ to optimise the boundIf we define the inverse of the KL by

KL−1(q,A) = max{p : KL(q‖p) ≤ A}

then have with probability at least 1− δ

Pr (〈w, φ(x)〉 6= y) ≤ 2 minµ

KL−1

(Em[F (µγ(x , y))],

µ2/2 + ln m+1δ

m

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Form of the SVM bound

Note that bound holds for all posterior distributions so thatwe can choose µ to optimise the boundIf we define the inverse of the KL by

KL−1(q,A) = max{p : KL(q‖p) ≤ A}

then have with probability at least 1− δ

Pr (〈w, φ(x)〉 6= y) ≤ 2 minµ

KL−1

(Em[F (µγ(x , y))],

µ2/2 + ln m+1δ

m

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Gives SVM Optimisation

Primal form:

minw ,ξi

[12‖w‖

2 + C∑m

i=1 ξi]

s.t. yiwT φ(x i) ≥ 1− ξi i = 1, . . . ,mξi ≥ 0 i = 1, . . . ,m

Dual form:

maxα

[∑mi=1 αi − 1

2∑m

i,j=1 αiαjyiyjκ(x i ,x j)]

s.t. 0 ≤ αi ≤ C i = 1, . . . ,m

where κ(x i ,x j) = 〈φ(x i),φ(x j)〉 and〈w ,φ(x)〉 =

∑mi=1 αiyiκ(x i ,x).

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Slack variable conversion

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Tightness of the new bound

Problem PAC-Bayes Bound Prior-PAC-Bayes BoundWdbc 0.346 ± 0.006 0.284 ± 0.021

Waveform 0.197 ± 0.002 0.143 ± 0.005Ringnorm 0.211 ± 0.001 0.093 ± 0.004

Pima 0.399 ± 0.007 0.374 ± 0.020Landsat 0.035 ± 0.001 0.023 ± 0.002

Handwritten-digits 0.159 ± 0.001 0.084 ± 0.003Spam 0.243 ± 0.002 0.161 ± 0.006

Average 0.227 0.166

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Model Selection with the new bound: results

Problem PAC-SVM Prior-PAC-Bayes Ten Fold XValWdbc 0.070 ± 0.024 0.070 ± 0.024 0.067 ± 0.024

Waveform 0.090 ± 0.008 0.091 ± 0.008 0.086 ± 0.008Ringnorm 0.034 ± 0.003 0.024 ± 0.003 0.016 ± 0.003

Pima 0.241 ± 0.031 0.236 ± 0.031 0.245 ± 0.040Landsat 0.011 ± 0.002 0.007 ± 0.002 0.005 ± 0.002

Handwritten-digits 0.015 ± 0.002 0.016 ± 0.002 0.007 ± 0.002Spam 0.090 ± 0.009 0.088 ± 0.009 0.063 ± 0.008

Average 0.079 0.076 0.070

Test error achieved by the three settings.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Model selection with p-SVM

Problem PAC-SVM Prior-PAC-SVM PriorSVM η-PriorSVMWdbc 0.070 ± 0.024 0.070 ± 0.024 0.068 ± 0.0236 0.073 ± 0.023

Waveform 0.090 ± 0.008 0.091 ± 0.008 0.085 ± 0.0188 0.085 ± 0.007Ringnorm 0.034 ± 0.003 0.024 ± 0.003 0.014 ± 0.0077 0.015 ± 0.003

Pima 0.241 ± 0.031 0.236 ± 0.031 0.237 ± 0.0323 0.242 ± 0.033Landsat 0.011 ± 0.002 0.007 ± 0.002 0.006 ± 0.0019 0.006 ± 0.002

Hand-digits 0.015 ± 0.002 0.016 ± 0.002 0.011 ± 0.0028 0.011 ± 0.003Spam 0.090 ± 0.009 0.088 ± 0.009 0.075 ± 0.0093 0.080 ± 0.009

Average 0.079 0.076 0.071 0.073

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Tightness of the bound with p-SVM

Problem PAC-SVM Prior-PAC-SVM PriorSVM η-PriorSVMWdbc 0.346 ± 0.006 0.284 ± 0.021 0.308 ± 0.0252 0.271 ± 0.027

Waveform 0.197 ± 0.002 0.143 ± 0.005 0.156 ± 0.0054 0.136 ± 0.006Ringnorm 0.211 ± 0.001 0.093 ± 0.004 0.054 ± 0.0038 0.049 ± 0.003

Pima 0.399 ± 0.007 0.374 ± 0.020 0.418 ± 0.0182 0.391 ± 0.021Landsat 0.035 ± 0.001 0.023 ± 0.002 0.027 ± 0.0032 0.022 ± 0.002

Hand-digits 0.159 ± 0.001 0.084 ± 0.003 0.046 ± 0.0045 0.042 ± 0.004Spam 0.243 ± 0.002 0.161 ± 0.006 0.171 ± 0.0065 0.145 ± 0.007

Average 0.227 0.166 0.169 0.151

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Maximum entropy learning

consider function class for X is a subset of the `∞ unit ball

F =

{fw : x ∈ X 7→ sgn

(N∑

i=1

wixi

): ‖w‖1 ≤ 1

},

want posterior distribution Q(w) such that can bound

P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .

Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define

eQ(w)(= QS(w)) =1m

m∑i=1

Eq∼Q(w) [I [q(x i) 6= yi ]] .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Maximum entropy learning

consider function class for X is a subset of the `∞ unit ball

F =

{fw : x ∈ X 7→ sgn

(N∑

i=1

wixi

): ‖w‖1 ≤ 1

},

want posterior distribution Q(w) such that can bound

P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .

Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define

eQ(w)(= QS(w)) =1m

m∑i=1

Eq∼Q(w) [I [q(x i) 6= yi ]] .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Maximum entropy learning

consider function class for X is a subset of the `∞ unit ball

F =

{fw : x ∈ X 7→ sgn

(N∑

i=1

wixi

): ‖w‖1 ≤ 1

},

want posterior distribution Q(w) such that can bound

P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .

Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define

eQ(w)(= QS(w)) =1m

m∑i=1

Eq∼Q(w) [I [q(x i) 6= yi ]] .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Posterior distribution Q(w)

Classifier q involves random weight vector W ∈ RN plusrandom threshold Θ

qW ,Θ(x) = sgn (〈W ,x〉 −Θ) .

The distribution Q(w) of W will be discrete with

W = sgn(wi)ei ; with probability |wi |, i = 1, . . . ,N,

where ei is the unit vector. The distribution of Θ is uniformon the interval [−1,1].

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Posterior distribution Q(w)

Classifier q involves random weight vector W ∈ RN plusrandom threshold Θ

qW ,Θ(x) = sgn (〈W ,x〉 −Θ) .

The distribution Q(w) of W will be discrete with

W = sgn(wi)ei ; with probability |wi |, i = 1, . . . ,N,

where ei is the unit vector. The distribution of Θ is uniformon the interval [−1,1].

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Error expression

Proposition

With the above definitions, we have for w satisfying ‖w‖1 = 1,that for any (x , y) ∈ X × {−1,+1},

Pq∼Q(w)(q(x) 6= y) = 0.5(1− y〈w ,x〉).

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Error expression proof

Proof.

Pq∼Q(w)(q(x) 6= y) =N∑

i=1

|wi |PΘ (sgn (sgn(wi)〈ei ,x〉 −Θ) 6= y)

=N∑

i=1

|wi |PΘ (sgn (sgn(wi)xi −Θ) 6= y)

= 0.5N∑

i=1

|wi |(1− ysgn(wi)xi)

= 0.5(1− y〈w ,x〉),

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Generalisation error

Corollary

P(x ,y)∼D (fw (x) 6= y) ≤ 2eQ(w).

Proof.

Pq∼Q(w)(q(x) 6= y) ≥ 0.5⇔

fw (x) 6= y .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Base result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

KL(eQ(w)‖eQ(w)) ≤∑N

i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)

m

Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Base result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

KL(eQ(w)‖eQ(w)) ≤∑N

i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)

m

Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Base result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

KL(eQ(w)‖eQ(w)) ≤∑N

i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)

m

Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Interpretation

Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:

eQ(w) =m∑

i=1

0.5(1− yi〈w ,x i〉)

Function of margin – but just linear function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Interpretation

Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:

eQ(w) =m∑

i=1

0.5(1− yi〈w ,x i〉)

Function of margin – but just linear function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Interpretation

Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:

eQ(w) =m∑

i=1

0.5(1− yi〈w ,x i〉)

Function of margin – but just linear function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Boosting the bound

Trick to boost the power of the bound is to take Tindependent samples of the distribution Q(w) and vote forthe classification:

qW ,Θ(x) = sgn

(T∑

i=1

sgn(〈W t ,x〉 −Θt)) ,

Now empirical error becomes

eQ(w) =0.5T

m

m∑i=1

bT/2c∑t=0

(Tt

)(1 + yi〈w ,x i〉)t (1− yi〈w ,x i〉)T−t ,

giving sigmoid like loss as function of the margin.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Boosting the bound

Trick to boost the power of the bound is to take Tindependent samples of the distribution Q(w) and vote forthe classification:

qW ,Θ(x) = sgn

(T∑

i=1

sgn(〈W t ,x〉 −Θt)) ,

Now empirical error becomes

eQ(w) =0.5T

m

m∑i=1

bT/2c∑t=0

(Tt

)(1 + yi〈w ,x i〉)t (1− yi〈w ,x i〉)T−t ,

giving sigmoid like loss as function of the margin.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Full result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

P(x ,y)∼D (fw (x) 6= y) ≤

2KL−1

(eQT (w),

T∑N

i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)

m

),

Note penalty factor of T applied to KL

Behaves like the (inverse) margin in usual bounds

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Full result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

P(x ,y)∼D (fw (x) 6= y) ≤

2KL−1

(eQT (w),

T∑N

i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)

m

),

Note penalty factor of T applied to KL

Behaves like the (inverse) margin in usual bounds

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Full result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

P(x ,y)∼D (fw (x) 6= y) ≤

2KL−1

(eQT (w),

T∑N

i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)

m

),

Note penalty factor of T applied to KL

Behaves like the (inverse) margin in usual bounds

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Algorithmics

Bound motivates the optimisation:

minw ,ρ,ξ

N∑j=1

|wj | ln |wj | − Cρ+ Dm∑

i=1

ξi

subject to: yi〈w ,x i〉 ≥ ρ− ξi ,1 ≤ i ≤ m,‖w‖1 ≤ 1, ξi ≥ 0,1 ≤ i ≤ m.

This follows the SVM route of approximating the sigmoidlike loss by the (convex) hinge loss

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Algorithmics

Bound motivates the optimisation:

minw ,ρ,ξ

N∑j=1

|wj | ln |wj | − Cρ+ Dm∑

i=1

ξi

subject to: yi〈w ,x i〉 ≥ ρ− ξi ,1 ≤ i ≤ m,‖w‖1 ≤ 1, ξi ≥ 0,1 ≤ i ≤ m.

This follows the SVM route of approximating the sigmoidlike loss by the (convex) hinge loss

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Dual optimisation

maxα

L = −N∑

j=1

exp

(∣∣∣∣∣m∑

i=1

αiyixij

∣∣∣∣∣− 1− λ

)− λ

subject to:m∑

i=1

αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.

Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Dual optimisation

maxα

L = −N∑

j=1

exp

(∣∣∣∣∣m∑

i=1

αiyixij

∣∣∣∣∣− 1− λ

)− λ

subject to:m∑

i=1

αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.

Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Dual optimisation

maxα

L = −N∑

j=1

exp

(∣∣∣∣∣m∑

i=1

αiyixij

∣∣∣∣∣− 1− λ

)− λ

subject to:m∑

i=1

αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.

Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Dual optimisation

maxα

L = −N∑

j=1

exp

(∣∣∣∣∣m∑

i=1

αiyixij

∣∣∣∣∣− 1− λ

)− λ

subject to:m∑

i=1

αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.

Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Results: effect of varying T

0 5 10 15 20 25 30 35 400.9

0.95

1

1.05

1.1

1.15

Bou

nd v

alue

Value of T

Bound on Ionosphere

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Results

Bound and test errors:

Data Bound Error SVM errorIonosphere 0.63 0.28 0.24

Votes 0.78 0.35 0.35Glass 0.69 0.46 0.47

Haberman 0.64 0.25 0.26Credit 0.60 0.25 0.28

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Gaussian Process Regression

GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):

2KL(Q‖P) = log det(

I +1σ2 K

)−tr((

σ2I + K)−1

K)

+∥∥∥R(K + σ2I)−1y

∥∥∥2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Gaussian Process Regression

GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):

2KL(Q‖P) = log det(

I +1σ2 K

)−tr((

σ2I + K)−1

K)

+∥∥∥R(K + σ2I)−1y

∥∥∥2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Gaussian Process Regression

GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):

2KL(Q‖P) = log det(

I +1σ2 K

)−tr((

σ2I + K)−1

K)

+∥∥∥R(K + σ2I)−1y

∥∥∥2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Gaussian Process Regression

GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):

2KL(Q‖P) = log det(

I +1σ2 K

)−tr((

σ2I + K)−1

K)

+∥∥∥R(K + σ2I)−1y

∥∥∥2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Applying PAC-Bayes theorem

Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:

hεf (x , y) =

{1; if |y − f (x)| ≤ ε;0; otherwise.

Can compute expected value of hεf under posteriorfunction:

Ef∼Q [hεf (x , y)] =12

erf

(y + ε−m(x)√

2v(x)

)− 1

2erf

(y − ε−m(x)√

2v(x)

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Applying PAC-Bayes theorem

Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:

hεf (x , y) =

{1; if |y − f (x)| ≤ ε;0; otherwise.

Can compute expected value of hεf under posteriorfunction:

Ef∼Q [hεf (x , y)] =12

erf

(y + ε−m(x)√

2v(x)

)− 1

2erf

(y − ε−m(x)√

2v(x)

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Applying PAC-Bayes theorem

Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:

hεf (x , y) =

{1; if |y − f (x)| ≤ ε;0; otherwise.

Can compute expected value of hεf under posteriorfunction:

Ef∼Q [hεf (x , y)] =12

erf

(y + ε−m(x)√

2v(x)

)− 1

2erf

(y − ε−m(x)√

2v(x)

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Result

Furthermore can lower bound expected value of point(x , y) in the posterior distribution by

2εN (y |m(x), v(x)) ≥ Ef∼Q [hεf (x , y)]− supτ∈[ε,ε]

ε2

22

v(x)√

2eπ.

enabling an application of the PB Theorem to give:

E[N (y |m(x), v(x)) +

ε

2v(x)√

2eπ

]≥ 1

2εKL−1

(E(ε),

D + ln((m + 1)/δ)

m

)where E(ε) is the empirical average of Ef∼Q

[hεf (x , y)

]and

D is the KL between prior and posterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Experimental Results

The robot arm problem (R), 150 training points and 51 testpoints.The Boston housing problem (H), 455 training points and51 test points.The forest fire problem (F), 450 training points 67 testpoints.

Dat σ e KL−1 etest KL−1 varGP etest

R 0.0494 0.8903 0.4782 0.8419H 0.1924 0.8699 0.4645 0.7155 0.8401 0.9416F 1.0129 0.5694 0.4557 0.5533

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Experimental Results

We can also plot the test accuracy and bound as a functionof ε:

Figure: Gaussian noise: Plot of E(x,y)∼D[1− α(x)] against ε with forvarying noise level η.

(a) η = 1 (b) η = 3 (c) η = 5

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Experimental Results

With Laplace noise:

Figure: Laplace noise: Plot of E(x,y)∼D[1− α(x)] against ε with forvarying η.

(a) η = 1 (b) η = 3 (c) η = 5

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Experimental Results

Robot arm problem and Boston Housing:

Figure: Confidence levels for Robot arm problem

(a) Robot arm (b) Boston housing

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Stochastic Differential Equation Models

Consider modelling a time varying process with a(non-linear) stochastic differential equation:

dx = f(x, t)dt +√

Σ dW

f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:

∆xk ≡ xk+1 − xk = f(xk )∆t +√

∆t Σ εk .

where εk is zero mean, unit variance Gaussian noise.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Stochastic Differential Equation Models

Consider modelling a time varying process with a(non-linear) stochastic differential equation:

dx = f(x, t)dt +√

Σ dW

f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:

∆xk ≡ xk+1 − xk = f(xk )∆t +√

∆t Σ εk .

where εk is zero mean, unit variance Gaussian noise.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Stochastic Differential Equation Models

Consider modelling a time varying process with a(non-linear) stochastic differential equation:

dx = f(x, t)dt +√

Σ dW

f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:

∆xk ≡ xk+1 − xk = f(xk )∆t +√

∆t Σ εk .

where εk is zero mean, unit variance Gaussian noise.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Variational approximation

We use the Bayesian approach to data modelling with anoise model given by:

p(yn|x(tn)) = N (yn|Hx(tn),R),

We consider a variational approximation of the posteriorusing a time-varying linear SDE:

dx = fL(x, t)dt +√

Σ dW,

wherefL(x, t) = −A(t)x + b(t).

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Variational approximation

We use the Bayesian approach to data modelling with anoise model given by:

p(yn|x(tn)) = N (yn|Hx(tn),R),

We consider a variational approximation of the posteriorusing a time-varying linear SDE:

dx = fL(x, t)dt +√

Σ dW,

wherefL(x, t) = −A(t)x + b(t).

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Girsanov change of measure

Measure for the drift f denoted by P and the one for drift fLby Q.The KL divergence in the infinite dimensional setting isgiven by Radon-Nikodym derivative of Q with respect to P:

KL[Q‖P] =∫

dQ ln dQdP = EQ ln dQ

dP ,

which can be computed as

dQdP

= exp{−∫ tf

t0(f− fL)>Σ−1/2 dWt + 1

2

∫ tft0

(f− fL)>Σ−1(f− fL) dt},

where W is a Wiener process with respect to Q.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Girsanov change of measure

Measure for the drift f denoted by P and the one for drift fLby Q.The KL divergence in the infinite dimensional setting isgiven by Radon-Nikodym derivative of Q with respect to P:

KL[Q‖P] =∫

dQ ln dQdP = EQ ln dQ

dP ,

which can be computed as

dQdP

= exp{−∫ tf

t0(f− fL)>Σ−1/2 dWt + 1

2

∫ tft0

(f− fL)>Σ−1(f− fL) dt},

where W is a Wiener process with respect to Q.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

KL divergence

Hence, KL divergence is

KL[Q‖P] = 12

∫ tft0

⟨(f(x(t), t)− fL(x(t), t))>Σ−1(f(x(t), t)− fL(x(t), t))

⟩qt

dt ,

where 〈 · 〉qtdenotes the expectation with respect to the

marginal density at time t of the measure Q.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Variational approximation

As approximating SDE is linear, marginal distribution qt isGaussian

qt (x) = N (x|m(t),S(t)).

with the mean m(t) and covariance S(t) described byordinary differential equations (ODEs):

dmdt

= −Am + b,

dSdt

= −AS− SAT + Σ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Variational approximation

As approximating SDE is linear, marginal distribution qt isGaussian

qt (x) = N (x|m(t),S(t)).

with the mean m(t) and covariance S(t) described byordinary differential equations (ODEs):

dmdt

= −Am + b,

dSdt

= −AS− SAT + Σ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Algorithmics

Using Lagrangian methods can derive algorithm that findsthe variational approximation by minimising the KLdivergence between posterior and approximatingdistribution.But KL also appears in the PAC-Bayes bound – is itpossible to define appropriate loss over paths ω thatcaptures the properties of interest?

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Algorithmics

Using Lagrangian methods can derive algorithm that findsthe variational approximation by minimising the KLdivergence between posterior and approximatingdistribution.But KL also appears in the PAC-Bayes bound – is itpossible to define appropriate loss over paths ω thatcaptures the properties of interest?

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimation

For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by

hω(y, t) =

{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.

where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimation

For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by

hω(y, t) =

{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.

where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimation

For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by

hω(y, t) =

{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.

where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimation

For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by

hω(y, t) =

{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.

where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Generalisation analysis

For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above

KL(Q‖P) =

∫dq ln

dqdpsde

.

If we now consider a fixed sample (y, t) we can estimate

Eω∼Q [hω(y, t)] =

∫I [‖Hx− y‖ ≤ ε] dqt (x),

For sufficiently small values of ε we can approximate by

≈ Vdεd

(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))

),

where Vd is the volume of a unit ball in Rd .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Generalisation analysis

For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above

KL(Q‖P) =

∫dq ln

dqdpsde

.

If we now consider a fixed sample (y, t) we can estimate

Eω∼Q [hω(y, t)] =

∫I [‖Hx− y‖ ≤ ε] dqt (x),

For sufficiently small values of ε we can approximate by

≈ Vdεd

(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))

),

where Vd is the volume of a unit ball in Rd .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Generalisation analysis

For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above

KL(Q‖P) =

∫dq ln

dqdpsde

.

If we now consider a fixed sample (y, t) we can estimate

Eω∼Q [hω(y, t)] =

∫I [‖Hx− y‖ ≤ ε] dqt (x),

For sufficiently small values of ε we can approximate by

≈ Vdεd

(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))

),

where Vd is the volume of a unit ball in Rd .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimates

Note that eQ is simply

eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),

while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimates

Note that eQ is simply

eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),

while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimates

Note that eQ is simply

eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),

while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Refining the distributions

Overcome this weakness by taking K -fold productdistributions and defining h(ω1,...,ωK ) as

h(ω1,...,ωK )(y, t) =

{1; if there exists 1 ≤ i ≤ K such that‖y− Hωi(t)‖ ≤ ε;0; otherwise.

We now have

E(ω1,...,ωK )∼QK

[h(ω1,...,ωK )(y, t)

]≈ 1−

(1−

∫I [‖Hx− y‖ ≤ ε] dqt (x)

)K

≈ KVdεdN (y|Hm(t),HS(t)HT ),

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Refining the distributions

Overcome this weakness by taking K -fold productdistributions and defining h(ω1,...,ωK ) as

h(ω1,...,ωK )(y, t) =

{1; if there exists 1 ≤ i ≤ K such that‖y− Hωi(t)‖ ≤ ε;0; otherwise.

We now have

E(ω1,...,ωK )∼QK

[h(ω1,...,ωK )(y, t)

]≈ 1−

(1−

∫I [‖Hx− y‖ ≤ ε] dqt (x)

)K

≈ KVdεdN (y|Hm(t),HS(t)HT ),

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Final result

Putting all together gives final bound:

E(y,t)∼µ

[N (y|Hm(t),HS(t)HT )

]≥

1VdεdK

KL−1(

KVdεd E[N (y|Hm(t),HS(t)HT )

],

K∫ T

0 Esde(t)dt + ln((m + 1)/δ)

m

).

where

Esde(t) = 12

⟨(f(x)− fL(x, t))TΣ−1(f(x)− fL(x, t))

⟩qt,

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Small scale experiment

We applied the analysis to the results of performing avariational Bayesian approximation to the Lorentz attractorin three dimension. The quality of the fit with 49 exampleswas good.

−20

−10

0

10

20

−20

−10

0

10

20

10

15

20

25

30

35

40

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Small scale experiment

chose Vdεd to optimise the bound – fairly small ball

implying that our approximation should be reasonable.compared the bound with the left hand side estimated on arandom draw of 99 test points. The corresponding valuesare

m dt eQ A eQ KL−1(·, ·)/V49 0.005 0.137 3.536 0.128 0.004

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Small scale experiment

chose Vdεd to optimise the bound – fairly small ball

implying that our approximation should be reasonable.compared the bound with the left hand side estimated on arandom draw of 99 test points. The corresponding valuesare

m dt eQ A eQ KL−1(·, ·)/V49 0.005 0.137 3.536 0.128 0.004

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications