PAC-Bayes Analysis: Background and...

159
Background to Approach PAC-Bayes Analysis Linear Classifiers Maximum entropy classification GPs and SDEs PAC-Bayes Analysis: Background and Applications John Shawe-Taylor University College London Chicago/TTI Workshop June, 2009 Including joint work with John Langford, Amiran Ambroladze and Emilio Parrado-Hernández, Cédric Archambeau, Matthew Higgs, Manfred Opper John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Transcript of PAC-Bayes Analysis: Background and...

Page 1: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

PAC-Bayes Analysis: Background andApplications

John Shawe-TaylorUniversity College London

Chicago/TTI WorkshopJune, 2009

Including joint work with John Langford, Amiran Ambroladzeand Emilio Parrado-Hernández, Cédric Archambeau,

Matthew Higgs, Manfred OpperJohn Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 2: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 3: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 4: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 5: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 6: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Aims

Hope to give you:PAC-Bayes frameworkCore resultHow to apply to Support Vector MachinesApplication to maximum entropy classificationApplication to Gaussian Processes and dynamical systemsmodeling

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 7: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

1 Background to Approach2 PAC-Bayes Analysis

DefinitionsPAC-Bayes TheoremApplications

3 Linear ClassifiersGeneral ApproachLearning the prior

4 Maximum entropy classificationGeneralisationOptimisation

5 GPs and SDEsGaussian Process regressionVariational approximationGeneralisation

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 8: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 9: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 10: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 11: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 12: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General perspectives

The goal of different theories is to capture the keyelements that enable an understanding and analysis ofdifferent phenomenaSeveral theories of machine learning: notably Bayesianand frequentistDifferent assumptions and hence different range ofapplicability and range of resultsBayesian able to make more detailed probabilisticpredictionsFrequentist makes only i.i.d. assumption

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 13: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 14: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 15: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 16: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 17: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Frequentist approach

Pioneered in Russia by Vapnik and ChervonenkisIntroduced in the west by Valiant under the name of‘probably approximately correct’Typical results state that with probability at least 1− δ(probably), any classifier from hypothesis class which haslow training error will have low generalisation error(approximately correct).Has the status of a statistical test: The confidence isdenoted by δ: the probability that the sample ismisleading/unusual.SVM bound using luckiness framework by S-T et al. (1998)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 18: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 19: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 20: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 21: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Historical notes: Bayesian approach

Name derives from Bayes theorem: we assume a priordistribution over functions or classifiers and then useBayes rule to update the prior based on the likelihood ofthe data for each functionThis gives the posterior distribution: Bayesian will classifyaccording to the expected classification under the posterior– best strategy given that the prior is correctCan be used for model selection by evaluating the‘evidence’ for a model (see for example David McKay) –this is related to the volume of version space consistentwith the dataGaussian processes for regression justified within thismodel

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 22: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Version space: evidence

C3

f(x1,w) = 0

f(x4,w) = 0 f(x2,w) = 0

f(x3,w)=0

w

w’

C1

C2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 23: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 24: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 25: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 26: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Evidence and generalisation

Link between evidence and generalisation hypothesised byMcKayFirst formal link was obtained by S-T & Williamson (1997):PAC Analysis of a Bayes EstimatorBound on generalisation in terms of the volume of thesphere that can be inscribed in the version space –included a dependence on the dimensionality of the spaceUsed Luckiness framework – a data-dependent style offrequentist bound also used to bound generalisation ofSVMs for which no dependence on the dimensionality isneeded, just on the margin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 27: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 28: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 29: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 30: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

First version proved by McAllester in 1999Improved proof and bound due to Seeger in 2002 withapplication to Gaussian processesApplication to SVMs by Langford and S-T also in 2002Excellent tutorial by Langford appeared in 2005 in JMLR

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 31: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 32: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 33: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultPrior and posterior distributions

The PAC-Bayes theorem involves a class of classifiers Ctogether with a prior distribution P and posterior Q over CThe distribution P must be chosen before learning, but thebound holds for all choices of Q, hence Q does not need tobe the classical Bayesian posteriorThe bound holds for all (prior) choices of P – hence it’svalidity is not affected by a poor choice of P though thequality of the resulting bound may be – contrast withstandard Bayes analysis which only holds if the priorassumptions are correct

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 34: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 35: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 36: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 37: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultError measures

Being a frequentist (PAC) style result we assume anunknown distribution D on the input space X .D is used to generate the labelled training samples i.i.d.,i.e. S ∼ Dm

It is also used to measure generalisation error cD of aclassifier c:

cD = Pr(x ,y)∼D(c(x) 6= y)

The empirical generalisation error is denoted cS:

cS =1m

∑(x ,y)∈S

I[c(x) 6= y ] where I[·] indicator function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 38: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultAssessing the posterior

The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)

We are interested in the relation between two quantities:

QD = Ec∼Q[cD]

the true error rate of the probabilistic classifier and

QS = Ec∼Q[cS]

its empirical error rate

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 39: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultAssessing the posterior

The result is concerned with bounding the performance ofa probabilistic classifier that given a test input x chooses aclassifier c ∼ Q (the posterior) and returns c(x)

We are interested in the relation between two quantities:

QD = Ec∼Q[cD]

the true error rate of the probabilistic classifier and

QS = Ec∼Q[cS]

its empirical error rate

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 40: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Definitions for main resultGeneralisation error

Note that this does not bound the posterior average but wehave

Pr(x ,y)∼D(sgn (Ec∼Q[c(x)]) 6= y) ≤ 2QD.

since for any point x misclassified by sgn (Ec∼Q[c(x)]) theprobability of a random c ∼ Q misclassifying is at least 0.5.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 41: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

PAC-Bayes Theorem

Fix an arbitrary D, arbitrary prior P, and confidence δ, thenwith probability at least 1− δ over samples S ∼ Dm, allposteriors Q satisfy

KL(QS‖QD) ≤ KL(Q‖P) + ln((m + 1)/δ)

m

where KL is the KL divergence between distributions

KL(Q‖P) = Ec∼Q

[ln

Q(c)

P(c)

]with QS and QD considered as distributions on {0,+1}.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 42: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Finite Classes

If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by

KL(err(hi)‖err(hi)) ≤ − log(pi) + ln((m + 1)/δ)

m

This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 43: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Finite Classes

If we take a finite class of functions h1, . . . ,hN with priordistribution p1, . . . ,pN and assume that the posterior isconcentrated on a single function, the generalisation isbounded by

KL(err(hi)‖err(hi)) ≤ − log(pi) + ln((m + 1)/δ)

m

This is the standard result for finite classes with the slightrefinement that it involves the KL divergence betweenempirical and true error and the extra log(m + 1) term onthe rhs.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 44: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 45: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 46: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 47: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

DefinitionsPAC-Bayes TheoremApplications

Linear classifiers and SVMs

Focus in on linear function application (Langford & ST)How the application is madeExtensions to learning the priorSome results on UCI datasets to give an idea of what canbe achieved

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 48: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 49: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 50: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Linear classifiers

We will choose the prior and posterior distributions to beGaussians with unit variance.The prior P will be centered at the origin with unit varianceThe specification of the centre for the posterior Q(w , µ) willbe by a unit vector w and a scale factor µ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 51: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (1/2)

P

0

W

Prior P is Gaussian N (0,1)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 52: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

Prior P is Gaussian N (0,1)

Posterior is in the direction w

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 53: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

µ Prior P is Gaussian N (0,1)

Posterior is in the direction w

at distance µ from the origin

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 54: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (1/2)

P

0

w

W

Q

µ Prior P is Gaussian N (0,1)

Posterior is in the direction w

at distance µ from the origin

Posterior Q is Gaussian

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 55: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 56: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 57: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 58: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖ QD(w , µ) ) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QD(w , µ) true performance of the stochastic classifierSVM is deterministic classifier that exactly corresponds tosgn(Ec∼Q(w,µ)[c(x)]

)as centre of the Gaussian gives the

same classification as halfspace with more weight.Hence its error bounded by 2QD(w, µ), since as observedabove if x misclassified at least half of c ∼ Q err.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 59: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 60: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 61: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 62: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 63: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL( QS(w , µ) ‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

QS(w , µ) stochastic measure of the training errorQS(w , µ) = Em[F (µγ(x , y))]

γ(x , y) = (ywT φ(x))/(‖φ(x)‖‖w‖)F (t) = 1− 1√

∫ t−∞ e−x2/2dx

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 64: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 65: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 66: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 67: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

Prior P ≡ Gaussian centered on the originPosterior Q ≡ Gaussian along w at a distance µ from theoriginKL(P‖Q) = µ2/2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 68: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 69: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 70: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

PAC-Bayes Bound for SVM (2/2)

Linear classifiers performance may be bounded by

KL(QS(w , µ)‖QD(w , µ)) ≤KL(P‖Q(w , µ)) + ln m+1

δ

m

δ is the confidenceThe bound holds with probability 1− δ over the randomi.i.d. selection of the training data.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 71: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Form of the SVM bound

Note that bound holds for all posterior distributions so thatwe can choose µ to optimise the boundIf we define the inverse of the KL by

KL−1(q,A) = max{p : KL(q‖p) ≤ A}

then have with probability at least 1− δ

Pr (〈w, φ(x)〉 6= y) ≤ 2 minµ

KL−1

(Em[F (µγ(x , y))],

µ2/2 + ln m+1δ

m

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 72: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Form of the SVM bound

Note that bound holds for all posterior distributions so thatwe can choose µ to optimise the boundIf we define the inverse of the KL by

KL−1(q,A) = max{p : KL(q‖p) ≤ A}

then have with probability at least 1− δ

Pr (〈w, φ(x)〉 6= y) ≤ 2 minµ

KL−1

(Em[F (µγ(x , y))],

µ2/2 + ln m+1δ

m

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 73: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Gives SVM Optimisation

Primal form:

minw ,ξi

[12‖w‖

2 + C∑m

i=1 ξi]

s.t. yiwT φ(x i) ≥ 1− ξi i = 1, . . . ,mξi ≥ 0 i = 1, . . . ,m

Dual form:

maxα

[∑mi=1 αi − 1

2∑m

i,j=1 αiαjyiyjκ(x i ,x j)]

s.t. 0 ≤ αi ≤ C i = 1, . . . ,m

where κ(x i ,x j) = 〈φ(x i),φ(x j)〉 and〈w ,φ(x)〉 =

∑mi=1 αiyiκ(x i ,x).

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 74: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Slack variable conversion

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 75: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 76: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 77: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 78: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 79: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Learning the prior (1/3)

Bound depends on the distance between prior andposteriorBetter prior (closer to posterior) would lead to tighterboundLearn the prior P with part of the dataIntroduce the learnt prior in the boundCompute stochastic error with remaining data

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 80: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Tightness of the new bound

Problem PAC-Bayes Bound Prior-PAC-Bayes BoundWdbc 0.346 ± 0.006 0.284 ± 0.021

Waveform 0.197 ± 0.002 0.143 ± 0.005Ringnorm 0.211 ± 0.001 0.093 ± 0.004

Pima 0.399 ± 0.007 0.374 ± 0.020Landsat 0.035 ± 0.001 0.023 ± 0.002

Handwritten-digits 0.159 ± 0.001 0.084 ± 0.003Spam 0.243 ± 0.002 0.161 ± 0.006

Average 0.227 0.166

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 81: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Model Selection with the new bound: results

Problem PAC-SVM Prior-PAC-Bayes Ten Fold XValWdbc 0.070 ± 0.024 0.070 ± 0.024 0.067 ± 0.024

Waveform 0.090 ± 0.008 0.091 ± 0.008 0.086 ± 0.008Ringnorm 0.034 ± 0.003 0.024 ± 0.003 0.016 ± 0.003

Pima 0.241 ± 0.031 0.236 ± 0.031 0.245 ± 0.040Landsat 0.011 ± 0.002 0.007 ± 0.002 0.005 ± 0.002

Handwritten-digits 0.015 ± 0.002 0.016 ± 0.002 0.007 ± 0.002Spam 0.090 ± 0.009 0.088 ± 0.009 0.063 ± 0.008

Average 0.079 0.076 0.070

Test error achieved by the three settings.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 82: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Model selection with p-SVM

Problem PAC-SVM Prior-PAC-SVM PriorSVM η-PriorSVMWdbc 0.070 ± 0.024 0.070 ± 0.024 0.068 ± 0.0236 0.073 ± 0.023

Waveform 0.090 ± 0.008 0.091 ± 0.008 0.085 ± 0.0188 0.085 ± 0.007Ringnorm 0.034 ± 0.003 0.024 ± 0.003 0.014 ± 0.0077 0.015 ± 0.003

Pima 0.241 ± 0.031 0.236 ± 0.031 0.237 ± 0.0323 0.242 ± 0.033Landsat 0.011 ± 0.002 0.007 ± 0.002 0.006 ± 0.0019 0.006 ± 0.002

Hand-digits 0.015 ± 0.002 0.016 ± 0.002 0.011 ± 0.0028 0.011 ± 0.003Spam 0.090 ± 0.009 0.088 ± 0.009 0.075 ± 0.0093 0.080 ± 0.009

Average 0.079 0.076 0.071 0.073

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 83: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

General ApproachLearning the prior

Tightness of the bound with p-SVM

Problem PAC-SVM Prior-PAC-SVM PriorSVM η-PriorSVMWdbc 0.346 ± 0.006 0.284 ± 0.021 0.308 ± 0.0252 0.271 ± 0.027

Waveform 0.197 ± 0.002 0.143 ± 0.005 0.156 ± 0.0054 0.136 ± 0.006Ringnorm 0.211 ± 0.001 0.093 ± 0.004 0.054 ± 0.0038 0.049 ± 0.003

Pima 0.399 ± 0.007 0.374 ± 0.020 0.418 ± 0.0182 0.391 ± 0.021Landsat 0.035 ± 0.001 0.023 ± 0.002 0.027 ± 0.0032 0.022 ± 0.002

Hand-digits 0.159 ± 0.001 0.084 ± 0.003 0.046 ± 0.0045 0.042 ± 0.004Spam 0.243 ± 0.002 0.161 ± 0.006 0.171 ± 0.0065 0.145 ± 0.007

Average 0.227 0.166 0.169 0.151

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 84: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Maximum entropy learning

consider function class for X is a subset of the `∞ unit ball

F =

{fw : x ∈ X 7→ sgn

(N∑

i=1

wixi

): ‖w‖1 ≤ 1

},

want posterior distribution Q(w) such that can bound

P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .

Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define

eQ(w)(= QS(w)) =1m

m∑i=1

Eq∼Q(w) [I [q(x i) 6= yi ]] .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 85: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Maximum entropy learning

consider function class for X is a subset of the `∞ unit ball

F =

{fw : x ∈ X 7→ sgn

(N∑

i=1

wixi

): ‖w‖1 ≤ 1

},

want posterior distribution Q(w) such that can bound

P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .

Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define

eQ(w)(= QS(w)) =1m

m∑i=1

Eq∼Q(w) [I [q(x i) 6= yi ]] .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 86: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Maximum entropy learning

consider function class for X is a subset of the `∞ unit ball

F =

{fw : x ∈ X 7→ sgn

(N∑

i=1

wixi

): ‖w‖1 ≤ 1

},

want posterior distribution Q(w) such that can bound

P(x ,y)∼D(fw (x) 6= y) ≤ 2eQ(w)(= 2QD(w)) = 2E(x ,y)∼D,q∼Q(w) [I [q(x) 6= y ]] .

Given a training sample S = {(x1, y1), . . . , (xm, ym)}, wesimilarly define

eQ(w)(= QS(w)) =1m

m∑i=1

Eq∼Q(w) [I [q(x i) 6= yi ]] .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 87: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Posterior distribution Q(w)

Classifier q involves random weight vector W ∈ RN plusrandom threshold Θ

qW ,Θ(x) = sgn (〈W ,x〉 −Θ) .

The distribution Q(w) of W will be discrete with

W = sgn(wi)ei ; with probability |wi |, i = 1, . . . ,N,

where ei is the unit vector. The distribution of Θ is uniformon the interval [−1,1].

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 88: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Posterior distribution Q(w)

Classifier q involves random weight vector W ∈ RN plusrandom threshold Θ

qW ,Θ(x) = sgn (〈W ,x〉 −Θ) .

The distribution Q(w) of W will be discrete with

W = sgn(wi)ei ; with probability |wi |, i = 1, . . . ,N,

where ei is the unit vector. The distribution of Θ is uniformon the interval [−1,1].

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 89: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Error expression

Proposition

With the above definitions, we have for w satisfying ‖w‖1 = 1,that for any (x , y) ∈ X × {−1,+1},

Pq∼Q(w)(q(x) 6= y) = 0.5(1− y〈w ,x〉).

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 90: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Error expression proof

Proof.

Pq∼Q(w)(q(x) 6= y) =N∑

i=1

|wi |PΘ (sgn (sgn(wi)〈ei ,x〉 −Θ) 6= y)

=N∑

i=1

|wi |PΘ (sgn (sgn(wi)xi −Θ) 6= y)

= 0.5N∑

i=1

|wi |(1− ysgn(wi)xi)

= 0.5(1− y〈w ,x〉),

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 91: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Generalisation error

Corollary

P(x ,y)∼D (fw (x) 6= y) ≤ 2eQ(w).

Proof.

Pq∼Q(w)(q(x) 6= y) ≥ 0.5⇔

fw (x) 6= y .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 92: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Base result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

KL(eQ(w)‖eQ(w)) ≤∑N

i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)

m

Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 93: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Base result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

KL(eQ(w)‖eQ(w)) ≤∑N

i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)

m

Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 94: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Base result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

KL(eQ(w)‖eQ(w)) ≤∑N

i=1 |wi | ln |wi |+ ln(2N) + ln((m + 1)/δ)

m

Proof.Use prior P uniform on unit vectors ±ei .Posterior described above so KL(P‖Q(w)) equalsln(2N)− entropy of w .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 95: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Interpretation

Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:

eQ(w) =m∑

i=1

0.5(1− yi〈w ,x i〉)

Function of margin – but just linear function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 96: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Interpretation

Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:

eQ(w) =m∑

i=1

0.5(1− yi〈w ,x i〉)

Function of margin – but just linear function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 97: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Interpretation

Suggests maximising the entropy as a means ofminimising the bound.Problem that empirical error eQ(w) is too large:

eQ(w) =m∑

i=1

0.5(1− yi〈w ,x i〉)

Function of margin – but just linear function.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 98: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Boosting the bound

Trick to boost the power of the bound is to take Tindependent samples of the distribution Q(w) and vote forthe classification:

qW ,Θ(x) = sgn

(T∑

i=1

sgn(〈W t ,x〉 −Θt)) ,

Now empirical error becomes

eQ(w) =0.5T

m

m∑i=1

bT/2c∑t=0

(Tt

)(1 + yi〈w ,x i〉)t (1− yi〈w ,x i〉)T−t ,

giving sigmoid like loss as function of the margin.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 99: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Boosting the bound

Trick to boost the power of the bound is to take Tindependent samples of the distribution Q(w) and vote forthe classification:

qW ,Θ(x) = sgn

(T∑

i=1

sgn(〈W t ,x〉 −Θt)) ,

Now empirical error becomes

eQ(w) =0.5T

m

m∑i=1

bT/2c∑t=0

(Tt

)(1 + yi〈w ,x i〉)t (1− yi〈w ,x i〉)T−t ,

giving sigmoid like loss as function of the margin.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 100: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Full result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

P(x ,y)∼D (fw (x) 6= y) ≤

2KL−1

(eQT (w),

T∑N

i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)

m

),

Note penalty factor of T applied to KL

Behaves like the (inverse) margin in usual bounds

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 101: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Full result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

P(x ,y)∼D (fw (x) 6= y) ≤

2KL−1

(eQT (w),

T∑N

i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)

m

),

Note penalty factor of T applied to KL

Behaves like the (inverse) margin in usual bounds

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 102: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Full result

TheoremWith probability at least 1− δ over the draw of training sets ofsize m

P(x ,y)∼D (fw (x) 6= y) ≤

2KL−1

(eQT (w),

T∑N

i=1 |wi | ln(|wi |) + T ln(2N) + ln((m + 1)/δ)

m

),

Note penalty factor of T applied to KL

Behaves like the (inverse) margin in usual bounds

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 103: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Algorithmics

Bound motivates the optimisation:

minw ,ρ,ξ

N∑j=1

|wj | ln |wj | − Cρ+ Dm∑

i=1

ξi

subject to: yi〈w ,x i〉 ≥ ρ− ξi ,1 ≤ i ≤ m,‖w‖1 ≤ 1, ξi ≥ 0,1 ≤ i ≤ m.

This follows the SVM route of approximating the sigmoidlike loss by the (convex) hinge loss

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 104: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Algorithmics

Bound motivates the optimisation:

minw ,ρ,ξ

N∑j=1

|wj | ln |wj | − Cρ+ Dm∑

i=1

ξi

subject to: yi〈w ,x i〉 ≥ ρ− ξi ,1 ≤ i ≤ m,‖w‖1 ≤ 1, ξi ≥ 0,1 ≤ i ≤ m.

This follows the SVM route of approximating the sigmoidlike loss by the (convex) hinge loss

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 105: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Dual optimisation

maxα

L = −N∑

j=1

exp

(∣∣∣∣∣m∑

i=1

αiyixij

∣∣∣∣∣− 1− λ

)− λ

subject to:m∑

i=1

αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.

Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 106: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Dual optimisation

maxα

L = −N∑

j=1

exp

(∣∣∣∣∣m∑

i=1

αiyixij

∣∣∣∣∣− 1− λ

)− λ

subject to:m∑

i=1

αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.

Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 107: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Dual optimisation

maxα

L = −N∑

j=1

exp

(∣∣∣∣∣m∑

i=1

αiyixij

∣∣∣∣∣− 1− λ

)− λ

subject to:m∑

i=1

αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.

Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 108: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Dual optimisation

maxα

L = −N∑

j=1

exp

(∣∣∣∣∣m∑

i=1

αiyixij

∣∣∣∣∣− 1− λ

)− λ

subject to:m∑

i=1

αi = C 0 ≤ αi ≤ D,1 ≤ i ≤ m.

Similar to SVM but with exponential functionSurprisingly also gives dual sparsityCoordinate wise descent works very well (cf SMOalgorithm)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 109: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Results: effect of varying T

0 5 10 15 20 25 30 35 400.9

0.95

1

1.05

1.1

1.15

Bou

nd v

alue

Value of T

Bound on Ionosphere

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 110: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

GeneralisationOptimisation

Results

Bound and test errors:

Data Bound Error SVM errorIonosphere 0.63 0.28 0.24

Votes 0.78 0.35 0.35Glass 0.69 0.46 0.47

Haberman 0.64 0.25 0.26Credit 0.60 0.25 0.28

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 111: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Gaussian Process Regression

GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):

2KL(Q‖P) = log det(

I +1σ2 K

)−tr((

σ2I + K)−1

K)

+∥∥∥R(K + σ2I)−1y

∥∥∥2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 112: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Gaussian Process Regression

GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):

2KL(Q‖P) = log det(

I +1σ2 K

)−tr((

σ2I + K)−1

K)

+∥∥∥R(K + σ2I)−1y

∥∥∥2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 113: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Gaussian Process Regression

GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):

2KL(Q‖P) = log det(

I +1σ2 K

)−tr((

σ2I + K)−1

K)

+∥∥∥R(K + σ2I)−1y

∥∥∥2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 114: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Gaussian Process Regression

GP is distribution over real valued functions that ismultivariate Gaussian when restricted to any finite subsetof inputsCharacterised by a kernel that specifies the covariancefunction when marginalising on any finite subsetIf have finite set of input/output observations generatedwith additive Gaussian noise on the outputs, posterior isalso Gaussian processKL divergence between prior and posterior can becomputed as (K = RR′ is a Cholesky decomposition of K ):

2KL(Q‖P) = log det(

I +1σ2 K

)−tr((

σ2I + K)−1

K)

+∥∥∥R(K + σ2I)−1y

∥∥∥2

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 115: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Applying PAC-Bayes theorem

Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:

hεf (x , y) =

{1; if |y − f (x)| ≤ ε;0; otherwise.

Can compute expected value of hεf under posteriorfunction:

Ef∼Q [hεf (x , y)] =12

erf

(y + ε−m(x)√

2v(x)

)− 1

2erf

(y − ε−m(x)√

2v(x)

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 116: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Applying PAC-Bayes theorem

Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:

hεf (x , y) =

{1; if |y − f (x)| ≤ ε;0; otherwise.

Can compute expected value of hεf under posteriorfunction:

Ef∼Q [hεf (x , y)] =12

erf

(y + ε−m(x)√

2v(x)

)− 1

2erf

(y − ε−m(x)√

2v(x)

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 117: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Applying PAC-Bayes theorem

Suggests can use the PB theorem if can createappropriate classifiers indexed by real value functionsConsider for some ε > 0 classifiers:

hεf (x , y) =

{1; if |y − f (x)| ≤ ε;0; otherwise.

Can compute expected value of hεf under posteriorfunction:

Ef∼Q [hεf (x , y)] =12

erf

(y + ε−m(x)√

2v(x)

)− 1

2erf

(y − ε−m(x)√

2v(x)

)

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 118: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Result

Furthermore can lower bound expected value of point(x , y) in the posterior distribution by

2εN (y |m(x), v(x)) ≥ Ef∼Q [hεf (x , y)]− supτ∈[ε,ε]

ε2

22

v(x)√

2eπ.

enabling an application of the PB Theorem to give:

E[N (y |m(x), v(x)) +

ε

2v(x)√

2eπ

]≥ 1

2εKL−1

(E(ε),

D + ln((m + 1)/δ)

m

)where E(ε) is the empirical average of Ef∼Q

[hεf (x , y)

]and

D is the KL between prior and posterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 119: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Experimental Results

The robot arm problem (R), 150 training points and 51 testpoints.The Boston housing problem (H), 455 training points and51 test points.The forest fire problem (F), 450 training points 67 testpoints.

Dat σ e KL−1 etest KL−1 varGP etest

R 0.0494 0.8903 0.4782 0.8419H 0.1924 0.8699 0.4645 0.7155 0.8401 0.9416F 1.0129 0.5694 0.4557 0.5533

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 120: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Experimental Results

We can also plot the test accuracy and bound as a functionof ε:

Figure: Gaussian noise: Plot of E(x,y)∼D[1− α(x)] against ε with forvarying noise level η.

(a) η = 1 (b) η = 3 (c) η = 5

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 121: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Experimental Results

With Laplace noise:

Figure: Laplace noise: Plot of E(x,y)∼D[1− α(x)] against ε with forvarying η.

(a) η = 1 (b) η = 3 (c) η = 5

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 122: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

GP Experimental Results

Robot arm problem and Boston Housing:

Figure: Confidence levels for Robot arm problem

(a) Robot arm (b) Boston housing

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 123: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Stochastic Differential Equation Models

Consider modelling a time varying process with a(non-linear) stochastic differential equation:

dx = f(x, t)dt +√

Σ dW

f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:

∆xk ≡ xk+1 − xk = f(xk )∆t +√

∆t Σ εk .

where εk is zero mean, unit variance Gaussian noise.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 124: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Stochastic Differential Equation Models

Consider modelling a time varying process with a(non-linear) stochastic differential equation:

dx = f(x, t)dt +√

Σ dW

f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:

∆xk ≡ xk+1 − xk = f(xk )∆t +√

∆t Σ εk .

where εk is zero mean, unit variance Gaussian noise.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 125: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Stochastic Differential Equation Models

Consider modelling a time varying process with a(non-linear) stochastic differential equation:

dx = f(x, t)dt +√

Σ dW

f(x, t) is a non-linear drift term and dW is a Wiener processThis is the limit of the discrete time equation:

∆xk ≡ xk+1 − xk = f(xk )∆t +√

∆t Σ εk .

where εk is zero mean, unit variance Gaussian noise.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 126: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Variational approximation

We use the Bayesian approach to data modelling with anoise model given by:

p(yn|x(tn)) = N (yn|Hx(tn),R),

We consider a variational approximation of the posteriorusing a time-varying linear SDE:

dx = fL(x, t)dt +√

Σ dW,

wherefL(x, t) = −A(t)x + b(t).

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 127: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Variational approximation

We use the Bayesian approach to data modelling with anoise model given by:

p(yn|x(tn)) = N (yn|Hx(tn),R),

We consider a variational approximation of the posteriorusing a time-varying linear SDE:

dx = fL(x, t)dt +√

Σ dW,

wherefL(x, t) = −A(t)x + b(t).

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 128: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Girsanov change of measure

Measure for the drift f denoted by P and the one for drift fLby Q.The KL divergence in the infinite dimensional setting isgiven by Radon-Nikodym derivative of Q with respect to P:

KL[Q‖P] =∫

dQ ln dQdP = EQ ln dQ

dP ,

which can be computed as

dQdP

= exp{−∫ tf

t0(f− fL)>Σ−1/2 dWt + 1

2

∫ tft0

(f− fL)>Σ−1(f− fL) dt},

where W is a Wiener process with respect to Q.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 129: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Girsanov change of measure

Measure for the drift f denoted by P and the one for drift fLby Q.The KL divergence in the infinite dimensional setting isgiven by Radon-Nikodym derivative of Q with respect to P:

KL[Q‖P] =∫

dQ ln dQdP = EQ ln dQ

dP ,

which can be computed as

dQdP

= exp{−∫ tf

t0(f− fL)>Σ−1/2 dWt + 1

2

∫ tft0

(f− fL)>Σ−1(f− fL) dt},

where W is a Wiener process with respect to Q.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 130: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

KL divergence

Hence, KL divergence is

KL[Q‖P] = 12

∫ tft0

⟨(f(x(t), t)− fL(x(t), t))>Σ−1(f(x(t), t)− fL(x(t), t))

⟩qt

dt ,

where 〈 · 〉qtdenotes the expectation with respect to the

marginal density at time t of the measure Q.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 131: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Variational approximation

As approximating SDE is linear, marginal distribution qt isGaussian

qt (x) = N (x|m(t),S(t)).

with the mean m(t) and covariance S(t) described byordinary differential equations (ODEs):

dmdt

= −Am + b,

dSdt

= −AS− SAT + Σ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 132: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Variational approximation

As approximating SDE is linear, marginal distribution qt isGaussian

qt (x) = N (x|m(t),S(t)).

with the mean m(t) and covariance S(t) described byordinary differential equations (ODEs):

dmdt

= −Am + b,

dSdt

= −AS− SAT + Σ.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 133: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Algorithmics

Using Lagrangian methods can derive algorithm that findsthe variational approximation by minimising the KLdivergence between posterior and approximatingdistribution.But KL also appears in the PAC-Bayes bound – is itpossible to define appropriate loss over paths ω thatcaptures the properties of interest?

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 134: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Algorithmics

Using Lagrangian methods can derive algorithm that findsthe variational approximation by minimising the KLdivergence between posterior and approximatingdistribution.But KL also appears in the PAC-Bayes bound – is itpossible to define appropriate loss over paths ω thatcaptures the properties of interest?

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 135: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimation

For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by

hω(y, t) =

{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.

where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 136: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimation

For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by

hω(y, t) =

{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.

where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 137: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimation

For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by

hω(y, t) =

{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.

where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 138: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimation

For ω : [0,T ] −→ RD defining a trajectory ω(t) ∈ RD, wedefine the classifier hω by

hω(y, t) =

{1; if ‖y− Hω(t)‖ ≤ ε;0; otherwise.

where the actual observations are linear functions of thestate variable given by the operator H.Prior and posterior distribution over functions are inheritedfrom distributions P and Q over paths ω.Hence, P = psde and Q = q defined by linearapproximating sde.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 139: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Generalisation analysis

For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above

KL(Q‖P) =

∫dq ln

dqdpsde

.

If we now consider a fixed sample (y, t) we can estimate

Eω∼Q [hω(y, t)] =

∫I [‖Hx− y‖ ≤ ε] dqt (x),

For sufficiently small values of ε we can approximate by

≈ Vdεd

(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))

),

where Vd is the volume of a unit ball in Rd .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 140: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Generalisation analysis

For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above

KL(Q‖P) =

∫dq ln

dqdpsde

.

If we now consider a fixed sample (y, t) we can estimate

Eω∼Q [hω(y, t)] =

∫I [‖Hx− y‖ ≤ ε] dqt (x),

For sufficiently small values of ε we can approximate by

≈ Vdεd

(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))

),

where Vd is the volume of a unit ball in Rd .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 141: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Generalisation analysis

For the PAC-Bayes analysis we must compute: KL(Q‖P),eQ, eQ. We have as above

KL(Q‖P) =

∫dq ln

dqdpsde

.

If we now consider a fixed sample (y, t) we can estimate

Eω∼Q [hω(y, t)] =

∫I [‖Hx− y‖ ≤ ε] dqt (x),

For sufficiently small values of ε we can approximate by

≈ Vdεd

(2π|HS(t)HT |)d/2 exp(−(y− Hm(t))T (HS(t)HT )−1(y− Hm(t))

),

where Vd is the volume of a unit ball in Rd .

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 142: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimates

Note that eQ is simply

eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),

while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 143: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimates

Note that eQ is simply

eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),

while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 144: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Error estimates

Note that eQ is simply

eQ = E(y,t)∼µEω∼Q [hω(y, t)] ∝∫N (y|Hm(t),HS(t)HT )dµ(y, t),

while eQ is the empirical average of this quantity.A tension arises in setting ε – if large approximationinaccurate.If eQ and eQ both small, the bound implied by theKL(eQ‖eQ) ≤ C becomes weak.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 145: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Refining the distributions

Overcome this weakness by taking K -fold productdistributions and defining h(ω1,...,ωK ) as

h(ω1,...,ωK )(y, t) =

{1; if there exists 1 ≤ i ≤ K such that‖y− Hωi(t)‖ ≤ ε;0; otherwise.

We now have

E(ω1,...,ωK )∼QK

[h(ω1,...,ωK )(y, t)

]≈ 1−

(1−

∫I [‖Hx− y‖ ≤ ε] dqt (x)

)K

≈ KVdεdN (y|Hm(t),HS(t)HT ),

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 146: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Refining the distributions

Overcome this weakness by taking K -fold productdistributions and defining h(ω1,...,ωK ) as

h(ω1,...,ωK )(y, t) =

{1; if there exists 1 ≤ i ≤ K such that‖y− Hωi(t)‖ ≤ ε;0; otherwise.

We now have

E(ω1,...,ωK )∼QK

[h(ω1,...,ωK )(y, t)

]≈ 1−

(1−

∫I [‖Hx− y‖ ≤ ε] dqt (x)

)K

≈ KVdεdN (y|Hm(t),HS(t)HT ),

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 147: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Final result

Putting all together gives final bound:

E(y,t)∼µ

[N (y|Hm(t),HS(t)HT )

]≥

1VdεdK

KL−1(

KVdεd E[N (y|Hm(t),HS(t)HT )

],

K∫ T

0 Esde(t)dt + ln((m + 1)/δ)

m

).

where

Esde(t) = 12

⟨(f(x)− fL(x, t))TΣ−1(f(x)− fL(x, t))

⟩qt,

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 148: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Small scale experiment

We applied the analysis to the results of performing avariational Bayesian approximation to the Lorentz attractorin three dimension. The quality of the fit with 49 exampleswas good.

−20

−10

0

10

20

−20

−10

0

10

20

10

15

20

25

30

35

40

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 149: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Small scale experiment

chose Vdεd to optimise the bound – fairly small ball

implying that our approximation should be reasonable.compared the bound with the left hand side estimated on arandom draw of 99 test points. The corresponding valuesare

m dt eQ A eQ KL−1(·, ·)/V49 0.005 0.137 3.536 0.128 0.004

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 150: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Small scale experiment

chose Vdεd to optimise the bound – fairly small ball

implying that our approximation should be reasonable.compared the bound with the left hand side estimated on arandom draw of 99 test points. The corresponding valuesare

m dt eQ A eQ KL−1(·, ·)/V49 0.005 0.137 3.536 0.128 0.004

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 151: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 152: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 153: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 154: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 155: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 156: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 157: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 158: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications

Page 159: PAC-Bayes Analysis: Background and Applicationsweb.cse.ohio-state.edu/mlss09/mlss09_talks/1.june-MON/JST_pacbayes... · 2 PAC-Bayes Analysis Definitions PAC-Bayes Theorem Applications

Background to ApproachPAC-Bayes Analysis

Linear ClassifiersMaximum entropy classification

GPs and SDEs

Gaussian Process regressionVariational approximationGeneralisation

Conclusions

Overview of theory and main resultApplication to bound the performance of an SVMExperiments show the new bound can be tighter ......And reliable for low cost model selectionExtended to considering maximum entropy classificationAlso consider lower bounding the accuracy of a posteriordistribution for Gaussian processes (GP)Applied the theory to bound the performance ofestimations made using approximate Bayesian inferencefor dynamical systems:

Prior determined by a non-linear stochastic differentialequation (SDE)Variational approximation results in posterior given by anapproximating linear SDE – hence Gaussian processposterior.

John Shawe-Taylor University College London PAC-Bayes Analysis: Background and Applications