PAC-Bayes Analysis: Background and...
Transcript of PAC-Bayes Analysis: Background and...
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
PAC-Bayes Analysis: Background andApplications
John Shawe-TaylorUniversity College London
Chicago/TTI WorkshopJune, 2009
Including joint work with John Langford, Amiran Ambroladzeand Emilio Parrado-Hernández, Cédric Archambeau,
Matthew Higgs, Manfred OpperJohn Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Aims
Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Aims
Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Aims
Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Aims
Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Aims
Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
1 Background to Approach2 PAC-Bayes Analysis
DefinitionsPAC-Bayes TheoremApplications
3 Linear ClassifiersGeneral ApproachLearning the prior
4 Maximum entropy classificationGeneralisationOptimisation
5 GPs and SDEsGaussian Process regressionVariational approximationGeneralisation
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General perspectives
The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Frequentist approach
Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Bayesian approach
Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Bayesian approach
Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Bayesian approach
Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Historical notes: Bayesian approach
Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Version space: evidence
C3
f(x1,w) = 0
f(x4,w) = 0 f(x2,w) = 0
f(x3,w)=0
w
w’
C1
C2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Evidence and generalisation
Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Evidence and generalisation
Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Evidence and generalisation
Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Evidence and generalisation
Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
PAC-Bayes Theorem
First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
PAC-Bayes Theorem
First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
PAC-Bayes Theorem
First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
PAC-Bayes Theorem
First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultPrior and posterior distributions
The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultPrior and posterior distributions
The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultPrior and posterior distributions
The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultError measures
Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm
It is also used to measure generalisation error cD of aclassifier c:
cD = Pr(x ,y)∼D(c(x) 6= y)
The empirical generalisation error is denoted cS:
cS =1m
∑(x ,y)∈S
I[c(x) 6= y ] where I[·] indicator function.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultError measures
Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm
It is also used to measure generalisation error cD of aclassifier c:
cD = Pr(x ,y)∼D(c(x) 6= y)
The empirical generalisation error is denoted cS:
cS =1m
∑(x ,y)∈S
I[c(x) 6= y ] where I[·] indicator function.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultError measures
Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm
It is also used to measure generalisation error cD of aclassifier c:
cD = Pr(x ,y)∼D(c(x) 6= y)
The empirical generalisation error is denoted cS:
cS =1m
∑(x ,y)∈S
I[c(x) 6= y ] where I[·] indicator function.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultError measures
Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm
It is also used to measure generalisation error cD of aclassifier c:
cD = Pr(x ,y)∼D(c(x) 6= y)
The empirical generalisation error is denoted cS:
cS =1m
∑(x ,y)∈S
I[c(x) 6= y ] where I[·] indicator function.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultAssessing the posterior
The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)
We are interested in the relation between two quantities:
QD = Ec∼Q[cD]
the true error rate of the probabilistic classifier and
QS = Ec∼Q[cS]
its empirical error rate
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultAssessing the posterior
The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)
We are interested in the relation between two quantities:
QD = Ec∼Q[cD]
the true error rate of the probabilistic classifier and
QS = Ec∼Q[cS]
its empirical error rate
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Definitions for main resultGeneralisation error
Note that this does not bound the posterior average but wehave
Pr(x ,y)∼D(sgn (Ec∼Q[c(x)]) 6= y) ≤ 2QD.
since for any point x misclassified by sgn (Ec∼Q[c(x)]) theprobability of a random c ∼ Q misclassifying is at least 0.5.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
PAC-Bayes Theorem
Fix an arbitrary D, arbitrary prior P, and confidence δ, thenwith probability at least 1− δ over samples S ∼ Dm, allposteriors Q satisfy
KL(QS‖QD) ≤ KL(Q‖P) + ln((m + 1)/δ)
m
where KL is the KL divergence between distributions
KL(Q‖P) = Ec∼Q
[ln
Q(c)
P(c)
]with QS and QD considered as distributions on {0,+1}.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Finite Classes
If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by
KL(err(hi)‖err(hi)) ≤ − log(pi) + ln((m + 1)/δ)
m
This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Finite Classes
If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by
KL(err(hi)‖err(hi)) ≤ − log(pi) + ln((m + 1)/δ)
m
This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Linear classifiers and SVMs
Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Linear classifiers and SVMs
Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Linear classifiers and SVMs
Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
DefinitionsPAC-Bayes TheoremApplications
Linear classifiers and SVMs
Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Linear classifiers
We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Linear classifiers
We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Linear classifiers
We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (1/2)
P
0
W
Prior P is Gaussian N (0,1)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (1/2)
P
0
w
W
Prior P is Gaussian N (0,1)
Posterior is in the direction w
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (1/2)
P
0
w
W
µ Prior P is Gaussian N (0,1)
Posterior is in the direction w
at distance µ from the origin
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (1/2)
P
0
w
W
Q
µ Prior P is Gaussian N (0,1)
Posterior is in the direction w
at distance µ from the origin
Posterior Q is Gaussian
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]
)as centre of the Gaussian gives the
same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]
)as centre of the Gaussian gives the
same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]
)as centre of the Gaussian gives the
same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]
)as centre of the Gaussian gives the
same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]
γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√
2π
∫ t−∞ e−x2/2dx
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]
γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√
2π
∫ t−∞ e−x2/2dx
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]
γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√
2π
∫ t−∞ e−x2/2dx
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]
γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√
2π
∫ t−∞ e−x2/2dx
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]
γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√
2π
∫ t−∞ e−x2/2dx
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
PAC-Bayes Bound for SVM (2/2)
Linear classifiers performance may be bounded by
KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1
δ
m
δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Form of the SVM bound
Note that bound holds for all posterior distributions so thatwe can choose µ to optimise the boundIf we define the inverse of the KL by
KL−1(q,A) = max{p : KL(q‖p) ≤ A}
then have with probability at least 1− δ
Pr (〈w, φ(x)〉 6= y) ≤ 2 minµ
KL−1
(Em[F (µγ(x , y))],
µ2/2 + ln m+1δ
m
)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Form of the SVM bound
Note that bound holds for all posterior distributions so thatwe can choose µ to optimise the boundIf we define the inverse of the KL by
KL−1(q,A) = max{p : KL(q‖p) ≤ A}
then have with probability at least 1− δ
Pr (〈w, φ(x)〉 6= y) ≤ 2 minµ
KL−1
(Em[F (µγ(x , y))],
µ2/2 + ln m+1δ
m
)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Gives SVM Optimisation
Primal form:
minw ,ξi
[12‖w‖
2 + C∑m
i=1 ξi]
s.t. yiwT φ(x i) ≥ 1− ξi i = 1, . . . ,mξi ≥ 0 i = 1, . . . ,m
Dual form:
maxα
[∑mi=1 αi − 1
2∑m
i,j=1 αiαjyiyjκ(x i ,x j)]
s.t. 0 ≤ αi ≤ C i = 1, . . . ,m
where κ(x i ,x j) = 〈φ(x i),φ(x j)〉 and〈w ,φ(x)〉 =
∑mi=1 αiyiκ(x i ,x).
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Slack variable conversion
−2 −1.5 −1 −0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Learning the prior (1/3)
Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Learning the prior (1/3)
Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Learning the prior (1/3)
Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Learning the prior (1/3)
Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Learning the prior (1/3)
Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Tightness of the new bound
Problem PAC-Bayes Bound Prior-PAC-Bayes BoundWdbc 0.346 ± 0.006 0.284 ± 0.021
Waveform 0.197 ± 0.002 0.143 ± 0.005Ringnorm 0.211 ± 0.001 0.093 ± 0.004
Pima 0.399 ± 0.007 0.374 ± 0.020Landsat 0.035 ± 0.001 0.023 ± 0.002
Handwritten-digits 0.159 ± 0.001 0.084 ± 0.003Spam 0.243 ± 0.002 0.161 ± 0.006
Average 0.227 0.166
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Model Selection with the new bound: results
Problem PAC-SVM Prior-PAC-Bayes Ten Fold XValWdbc 0.070 ± 0.024 0.070 ± 0.024 0.067 ± 0.024
Waveform 0.090 ± 0.008 0.091 ± 0.008 0.086 ± 0.008Ringnorm 0.034 ± 0.003 0.024 ± 0.003 0.016 ± 0.003
Pima 0.241 ± 0.031 0.236 ± 0.031 0.245 ± 0.040Landsat 0.011 ± 0.002 0.007 ± 0.002 0.005 ± 0.002
Handwritten-digits 0.015 ± 0.002 0.016 ± 0.002 0.007 ± 0.002Spam 0.090 ± 0.009 0.088 ± 0.009 0.063 ± 0.008
Average 0.079 0.076 0.070
Test error achieved by the three settings.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Model selection with p-SVM
Problem PAC-SVM Prior-PAC-SVM PriorSVM η-PriorSVMWdbc 0.070 ± 0.024 0.070 ± 0.024 0.068 ± 0.0236 0.073 ± 0.023
Waveform 0.090 ± 0.008 0.091 ± 0.008 0.085 ± 0.0188 0.085 ± 0.007Ringnorm 0.034 ± 0.003 0.024 ± 0.003 0.014 ± 0.0077 0.015 ± 0.003
Pima 0.241 ± 0.031 0.236 ± 0.031 0.237 ± 0.0323 0.242 ± 0.033Landsat 0.011 ± 0.002 0.007 ± 0.002 0.006 ± 0.0019 0.006 ± 0.002
Hand-digits 0.015 ± 0.002 0.016 ± 0.002 0.011 ± 0.0028 0.011 ± 0.003Spam 0.090 ± 0.009 0.088 ± 0.009 0.075 ± 0.0093 0.080 ± 0.009
Average 0.079 0.076 0.071 0.073
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
General ApproachLearning the prior
Tightness of the bound with p-SVM
Problem PAC-SVM Prior-PAC-SVM PriorSVM η-PriorSVMWdbc 0.346 ± 0.006 0.284 ± 0.021 0.308 ± 0.0252 0.271 ± 0.027
Waveform 0.197 ± 0.002 0.143 ± 0.005 0.156 ± 0.0054 0.136 ± 0.006Ringnorm 0.211 ± 0.001 0.093 ± 0.004 0.054 ± 0.0038 0.049 ± 0.003
Pima 0.399 ± 0.007 0.374 ± 0.020 0.418 ± 0.0182 0.391 ± 0.021Landsat 0.035 ± 0.001 0.023 ± 0.002 0.027 ± 0.0032 0.022 ± 0.002
Hand-digits 0.159 ± 0.001 0.084 ± 0.003 0.046 ± 0.0045 0.042 ± 0.004Spam 0.243 ± 0.002 0.161 ± 0.006 0.171 ± 0.0065 0.145 ± 0.007
Average 0.227 0.166 0.169 0.151
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Maximum entropy learning
consider function class for X is a subset of the `∞ unit ball
F =
{fw : x ∈ X 7→ sgn
(N∑
i=1
wixi
): ‖w‖1 ≤ 1
},
want posterior distribution Q(w) such that can bound
P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .
Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define
eQ(w)(= QS(w)) =1m
m∑i=1
Eq∼Q(w) [I [q(x i) 6= yi ]] .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Maximum entropy learning
consider function class for X is a subset of the `∞ unit ball
F =
{fw : x ∈ X 7→ sgn
(N∑
i=1
wixi
): ‖w‖1 ≤ 1
},
want posterior distribution Q(w) such that can bound
P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .
Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define
eQ(w)(= QS(w)) =1m
m∑i=1
Eq∼Q(w) [I [q(x i) 6= yi ]] .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Maximum entropy learning
consider function class for X is a subset of the `∞ unit ball
F =
{fw : x ∈ X 7→ sgn
(N∑
i=1
wixi
): ‖w‖1 ≤ 1
},
want posterior distribution Q(w) such that can bound
P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .
Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define
eQ(w)(= QS(w)) =1m
m∑i=1
Eq∼Q(w) [I [q(x i) 6= yi ]] .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Posterior distribution Q(w)
Classifier q involves random weight vector W ∈ RN plusrandom threshold Θ
qW ,Θ(x) = sgn (〈W ,x〉 −Θ) .
The distribution Q(w) of W will be discrete with
W = sgn(wi)ei ; with probability |wi |, i = 1, . . . ,N,
where ei is the unit vector. The distribution of Θ is uniformon the interval [−1,1].
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Posterior distribution Q(w)
Classifier q involves random weight vector W ∈ RN plusrandom threshold Θ
qW ,Θ(x) = sgn (〈W ,x〉 −Θ) .
The distribution Q(w) of W will be discrete with
W = sgn(wi)ei ; with probability |wi |, i = 1, . . . ,N,
where ei is the unit vector. The distribution of Θ is uniformon the interval [−1,1].
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Error expression
Proposition
With the above definitions, we have for w satisfying ‖w‖1 = 1,that for any (x , y) ∈ X × {−1,+1},
Pq∼Q(w)(q(x) 6= y) = 0.5(1− y〈w ,x〉).
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Error expression proof
Proof.
Pq∼Q(w)(q(x) 6= y) =N∑
i=1
|wi |PΘ (sgn (sgn(wi)〈ei ,x〉 −Θ) 6= y)
=N∑
i=1
|wi |PΘ (sgn (sgn(wi)xi −Θ) 6= y)
= 0.5N∑
i=1
|wi |(1− ysgn(wi)xi)
= 0.5(1− y〈w ,x〉),
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Generalisation error
Corollary
P(x ,y)∼D (fw (x) 6= y) ≤ 2eQ(w).
Proof.
Pq∼Q(w)(q(x) 6= y) ≥ 0.5⇔
fw (x) 6= y .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Base result
TheoremWith probability at least 1− δ over the draw of training sets ofsize m
KL(eQ(w)‖eQ(w)) ≤∑N
i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)
m
Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Base result
TheoremWith probability at least 1− δ over the draw of training sets ofsize m
KL(eQ(w)‖eQ(w)) ≤∑N
i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)
m
Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Base result
TheoremWith probability at least 1− δ over the draw of training sets ofsize m
KL(eQ(w)‖eQ(w)) ≤∑N
i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)
m
Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Interpretation
Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:
eQ(w) =m∑
i=1
0.5(1− yi〈w ,x i〉)
Function of margin – but just linear function.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Interpretation
Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:
eQ(w) =m∑
i=1
0.5(1− yi〈w ,x i〉)
Function of margin – but just linear function.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Interpretation
Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:
eQ(w) =m∑
i=1
0.5(1− yi〈w ,x i〉)
Function of margin – but just linear function.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Boosting the bound
Trick to boost the power of the bound is to take Tindependent samples of the distribution Q(w) and vote forthe classification:
qW ,Θ(x) = sgn
(T∑
i=1
sgn(〈W t ,x〉 −Θt)) ,
Now empirical error becomes
eQ(w) =0.5T
m
m∑i=1
bT/2c∑t=0
(Tt
)(1 + yi〈w ,x i〉)t (1− yi〈w ,x i〉)T−t ,
giving sigmoid like loss as function of the margin.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Boosting the bound
Trick to boost the power of the bound is to take Tindependent samples of the distribution Q(w) and vote forthe classification:
qW ,Θ(x) = sgn
(T∑
i=1
sgn(〈W t ,x〉 −Θt)) ,
Now empirical error becomes
eQ(w) =0.5T
m
m∑i=1
bT/2c∑t=0
(Tt
)(1 + yi〈w ,x i〉)t (1− yi〈w ,x i〉)T−t ,
giving sigmoid like loss as function of the margin.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Full result
TheoremWith probability at least 1− δ over the draw of training sets ofsize m
P(x ,y)∼D (fw (x) 6= y) ≤
2KL−1
(eQT (w),
T∑N
i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)
m
),
Note penalty factor of T applied to KL
Behaves like the (inverse) margin in usual bounds
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Full result
TheoremWith probability at least 1− δ over the draw of training sets ofsize m
P(x ,y)∼D (fw (x) 6= y) ≤
2KL−1
(eQT (w),
T∑N
i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)
m
),
Note penalty factor of T applied to KL
Behaves like the (inverse) margin in usual bounds
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Full result
TheoremWith probability at least 1− δ over the draw of training sets ofsize m
P(x ,y)∼D (fw (x) 6= y) ≤
2KL−1
(eQT (w),
T∑N
i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)
m
),
Note penalty factor of T applied to KL
Behaves like the (inverse) margin in usual bounds
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Algorithmics
Bound motivates the optimisation:
minw ,ρ,ξ
N∑j=1
|wj | ln |wj | − Cρ+ Dm∑
i=1
ξi
subject to: yi〈w ,x i〉 ≥ ρ− ξi ,1 ≤ i ≤ m,‖w‖1 ≤ 1, ξi ≥ 0,1 ≤ i ≤ m.
This follows the SVM route of approximating the sigmoidlike loss by the (convex) hinge loss
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Algorithmics
Bound motivates the optimisation:
minw ,ρ,ξ
N∑j=1
|wj | ln |wj | − Cρ+ Dm∑
i=1
ξi
subject to: yi〈w ,x i〉 ≥ ρ− ξi ,1 ≤ i ≤ m,‖w‖1 ≤ 1, ξi ≥ 0,1 ≤ i ≤ m.
This follows the SVM route of approximating the sigmoidlike loss by the (convex) hinge loss
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Dual optimisation
maxα
L = −N∑
j=1
exp
(∣∣∣∣∣m∑
i=1
αiyixij
∣∣∣∣∣− 1− λ
)− λ
subject to:m∑
i=1
αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.
Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Dual optimisation
maxα
L = −N∑
j=1
exp
(∣∣∣∣∣m∑
i=1
αiyixij
∣∣∣∣∣− 1− λ
)− λ
subject to:m∑
i=1
αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.
Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Dual optimisation
maxα
L = −N∑
j=1
exp
(∣∣∣∣∣m∑
i=1
αiyixij
∣∣∣∣∣− 1− λ
)− λ
subject to:m∑
i=1
αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.
Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Dual optimisation
maxα
L = −N∑
j=1
exp
(∣∣∣∣∣m∑
i=1
αiyixij
∣∣∣∣∣− 1− λ
)− λ
subject to:m∑
i=1
αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.
Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Results: effect of varying T
0 5 10 15 20 25 30 35 400.9
0.95
1
1.05
1.1
1.15
Bou
nd v
alue
Value of T
Bound on Ionosphere
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
GeneralisationOptimisation
Results
Bound and test errors:
Data Bound Error SVM errorIonosphere 0.63 0.28 0.24
Votes 0.78 0.35 0.35Glass 0.69 0.46 0.47
Haberman 0.64 0.25 0.26Credit 0.60 0.25 0.28
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Gaussian Process Regression
GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):
2KL(Q‖P) = log det(
I +1σ2 K
)−tr((
σ2I + K)−1
K)
+∥∥∥R(K + σ2I)−1y
∥∥∥2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Gaussian Process Regression
GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):
2KL(Q‖P) = log det(
I +1σ2 K
)−tr((
σ2I + K)−1
K)
+∥∥∥R(K + σ2I)−1y
∥∥∥2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Gaussian Process Regression
GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):
2KL(Q‖P) = log det(
I +1σ2 K
)−tr((
σ2I + K)−1
K)
+∥∥∥R(K + σ2I)−1y
∥∥∥2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Gaussian Process Regression
GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):
2KL(Q‖P) = log det(
I +1σ2 K
)−tr((
σ2I + K)−1
K)
+∥∥∥R(K + σ2I)−1y
∥∥∥2
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Applying PAC-Bayes theorem
Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:
hεf (x , y) =
{1; if |y − f (x)| ≤ ε;0; otherwise.
Can compute expected value of hεf under posteriorfunction:
Ef∼Q [hεf (x , y)] =12
erf
(y + ε−m(x)√
2v(x)
)− 1
2erf
(y − ε−m(x)√
2v(x)
)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Applying PAC-Bayes theorem
Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:
hεf (x , y) =
{1; if |y − f (x)| ≤ ε;0; otherwise.
Can compute expected value of hεf under posteriorfunction:
Ef∼Q [hεf (x , y)] =12
erf
(y + ε−m(x)√
2v(x)
)− 1
2erf
(y − ε−m(x)√
2v(x)
)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Applying PAC-Bayes theorem
Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:
hεf (x , y) =
{1; if |y − f (x)| ≤ ε;0; otherwise.
Can compute expected value of hεf under posteriorfunction:
Ef∼Q [hεf (x , y)] =12
erf
(y + ε−m(x)√
2v(x)
)− 1
2erf
(y − ε−m(x)√
2v(x)
)
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
GP Result
Furthermore can lower bound expected value of point(x , y) in the posterior distribution by
2εN (y |m(x), v(x)) ≥ Ef∼Q [hεf (x , y)]− supτ∈[ε,ε]
ε2
22
v(x)√
2eπ.
enabling an application of the PB Theorem to give:
E[N (y |m(x), v(x)) +
ε
2v(x)√
2eπ
]≥ 1
2εKL−1
(E(ε),
D + ln((m + 1)/δ)
m
)where E(ε) is the empirical average of Ef∼Q
[hεf (x , y)
]and
D is the KL between prior and posterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
GP Experimental Results
The robot arm problem (R), 150 training points and 51 testpoints.The Boston housing problem (H), 455 training points and51 test points.The forest fire problem (F), 450 training points 67 testpoints.
Dat σ e KL−1 etest KL−1 varGP etest
R 0.0494 0.8903 0.4782 0.8419H 0.1924 0.8699 0.4645 0.7155 0.8401 0.9416F 1.0129 0.5694 0.4557 0.5533
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
GP Experimental Results
We can also plot the test accuracy and bound as a functionof ε:
Figure: Gaussian noise: Plot of E(x,y)∼D[1− α(x)] against ε with forvarying noise level η.
(a) η = 1 (b) η = 3 (c) η = 5
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
GP Experimental Results
With Laplace noise:
Figure: Laplace noise: Plot of E(x,y)∼D[1− α(x)] against ε with forvarying η.
(a) η = 1 (b) η = 3 (c) η = 5
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
GP Experimental Results
Robot arm problem and Boston Housing:
Figure: Confidence levels for Robot arm problem
(a) Robot arm (b) Boston housing
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Stochastic Differential Equation Models
Consider modelling a time varying process with a(non-linear) stochastic differential equation:
dx = f(x, t)dt +√
Σ dW
f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:
∆xk ≡ xk+1 − xk = f(xk )∆t +√
∆t Σ εk .
where εk is zero mean, unit variance Gaussian noise.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Stochastic Differential Equation Models
Consider modelling a time varying process with a(non-linear) stochastic differential equation:
dx = f(x, t)dt +√
Σ dW
f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:
∆xk ≡ xk+1 − xk = f(xk )∆t +√
∆t Σ εk .
where εk is zero mean, unit variance Gaussian noise.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Stochastic Differential Equation Models
Consider modelling a time varying process with a(non-linear) stochastic differential equation:
dx = f(x, t)dt +√
Σ dW
f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:
∆xk ≡ xk+1 − xk = f(xk )∆t +√
∆t Σ εk .
where εk is zero mean, unit variance Gaussian noise.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Variational approximation
We use the Bayesian approach to data modelling with anoise model given by:
p(yn|x(tn)) = N (yn|Hx(tn),R),
We consider a variational approximation of the posteriorusing a time-varying linear SDE:
dx = fL(x, t)dt +√
Σ dW,
wherefL(x, t) = −A(t)x + b(t).
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Variational approximation
We use the Bayesian approach to data modelling with anoise model given by:
p(yn|x(tn)) = N (yn|Hx(tn),R),
We consider a variational approximation of the posteriorusing a time-varying linear SDE:
dx = fL(x, t)dt +√
Σ dW,
wherefL(x, t) = −A(t)x + b(t).
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Girsanov change of measure
Measure for the drift f denoted by P and the one for drift fLby Q.The KL divergence in the infinite dimensional setting isgiven by Radon-Nikodym derivative of Q with respect to P:
KL[Q‖P] =∫
dQ ln dQdP = EQ ln dQ
dP ,
which can be computed as
dQdP
= exp{−∫ tf
t0(f− fL)>Σ−1/2 dWt + 1
2
∫ tft0
(f− fL)>Σ−1(f− fL) dt},
where W is a Wiener process with respect to Q.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Girsanov change of measure
Measure for the drift f denoted by P and the one for drift fLby Q.The KL divergence in the infinite dimensional setting isgiven by Radon-Nikodym derivative of Q with respect to P:
KL[Q‖P] =∫
dQ ln dQdP = EQ ln dQ
dP ,
which can be computed as
dQdP
= exp{−∫ tf
t0(f− fL)>Σ−1/2 dWt + 1
2
∫ tft0
(f− fL)>Σ−1(f− fL) dt},
where W is a Wiener process with respect to Q.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
KL divergence
Hence, KL divergence is
KL[Q‖P] = 12
∫ tft0
⟨(f(x(t), t)− fL(x(t), t))>Σ−1(f(x(t), t)− fL(x(t), t))
⟩qt
dt ,
where 〈 · 〉qtdenotes the expectation with respect to the
marginal density at time t of the measure Q.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Variational approximation
As approximating SDE is linear, marginal distribution qt isGaussian
qt (x) = N (x|m(t),S(t)).
with the mean m(t) and covariance S(t) described byordinary differential equations (ODEs):
dmdt
= −Am + b,
dSdt
= −AS− SAT + Σ.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Variational approximation
As approximating SDE is linear, marginal distribution qt isGaussian
qt (x) = N (x|m(t),S(t)).
with the mean m(t) and covariance S(t) described byordinary differential equations (ODEs):
dmdt
= −Am + b,
dSdt
= −AS− SAT + Σ.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Algorithmics
Using Lagrangian methods can derive algorithm that findsthe variational approximation by minimising the KLdivergence between posterior and approximatingdistribution.But KL also appears in the PAC-Bayes bound – is itpossible to define appropriate loss over paths ω thatcaptures the properties of interest?
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Algorithmics
Using Lagrangian methods can derive algorithm that findsthe variational approximation by minimising the KLdivergence between posterior and approximatingdistribution.But KL also appears in the PAC-Bayes bound – is itpossible to define appropriate loss over paths ω thatcaptures the properties of interest?
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Error estimation
For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by
hω(y, t) =
{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.
where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Error estimation
For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by
hω(y, t) =
{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.
where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Error estimation
For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by
hω(y, t) =
{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.
where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Error estimation
For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by
hω(y, t) =
{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.
where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Generalisation analysis
For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above
KL(Q‖P) =
∫dq ln
dqdpsde
.
If we now consider a fixed sample (y, t) we can estimate
Eω∼Q [hω(y, t)] =
∫I [‖Hx− y‖ ≤ ε] dqt (x),
For sufficiently small values of ε we can approximate by
≈ Vdεd
(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))
),
where Vd is the volume of a unit ball in Rd .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Generalisation analysis
For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above
KL(Q‖P) =
∫dq ln
dqdpsde
.
If we now consider a fixed sample (y, t) we can estimate
Eω∼Q [hω(y, t)] =
∫I [‖Hx− y‖ ≤ ε] dqt (x),
For sufficiently small values of ε we can approximate by
≈ Vdεd
(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))
),
where Vd is the volume of a unit ball in Rd .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Generalisation analysis
For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above
KL(Q‖P) =
∫dq ln
dqdpsde
.
If we now consider a fixed sample (y, t) we can estimate
Eω∼Q [hω(y, t)] =
∫I [‖Hx− y‖ ≤ ε] dqt (x),
For sufficiently small values of ε we can approximate by
≈ Vdεd
(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))
),
where Vd is the volume of a unit ball in Rd .
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Error estimates
Note that eQ is simply
eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),
while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Error estimates
Note that eQ is simply
eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),
while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Error estimates
Note that eQ is simply
eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),
while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Refining the distributions
Overcome this weakness by taking K -fold productdistributions and defining h(ω1,...,ωK ) as
h(ω1,...,ωK )(y, t) =
{1; if there exists 1 ≤ i ≤ K such that‖y− Hωi(t)‖ ≤ ε;0; otherwise.
We now have
E(ω1,...,ωK )∼QK
[h(ω1,...,ωK )(y, t)
]≈ 1−
(1−
∫I [‖Hx− y‖ ≤ ε] dqt (x)
)K
≈ KVdεdN (y|Hm(t),HS(t)HT ),
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Refining the distributions
Overcome this weakness by taking K -fold productdistributions and defining h(ω1,...,ωK ) as
h(ω1,...,ωK )(y, t) =
{1; if there exists 1 ≤ i ≤ K such that‖y− Hωi(t)‖ ≤ ε;0; otherwise.
We now have
E(ω1,...,ωK )∼QK
[h(ω1,...,ωK )(y, t)
]≈ 1−
(1−
∫I [‖Hx− y‖ ≤ ε] dqt (x)
)K
≈ KVdεdN (y|Hm(t),HS(t)HT ),
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Final result
Putting all together gives final bound:
E(y,t)∼µ
[N (y|Hm(t),HS(t)HT )
]≥
1VdεdK
KL−1(
KVdεd E[N (y|Hm(t),HS(t)HT )
],
K∫ T
0 Esde(t)dt + ln((m + 1)/δ)
m
).
where
Esde(t) = 12
⟨(f(x)− fL(x, t))TΣ−1(f(x)− fL(x, t))
⟩qt,
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Small scale experiment
We applied the analysis to the results of performing avariational Bayesian approximation to the Lorentz attractorin three dimension. The quality of the fit with 49 exampleswas good.
−20
−10
0
10
20
−20
−10
0
10
20
10
15
20
25
30
35
40
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Small scale experiment
chose Vdεd to optimise the bound – fairly small ball
implying that our approximation should be reasonable.compared the bound with the left hand side estimated on arandom draw of 99 test points. The corresponding valuesare
m dt eQ A eQ KL−1(·, ·)/V49 0.005 0.137 3.536 0.128 0.004
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Small scale experiment
chose Vdεd to optimise the bound – fairly small ball
implying that our approximation should be reasonable.compared the bound with the left hand side estimated on arandom draw of 99 test points. The corresponding valuesare
m dt eQ A eQ KL−1(·, ·)/V49 0.005 0.137 3.536 0.128 0.004
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications
Background to ApproachPAC-Bayes Analysis
Linear ClassifiersMaximum entropy classification
GPs and SDEs
Gaussian Process regressionVariational approximationGeneralisation
Conclusions
Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:
Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.
John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications