Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of...
-
Upload
archibald-banks -
Category
Documents
-
view
219 -
download
2
Transcript of Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of...
Some Aspects of Bayesian Some Aspects of Bayesian Approach to Model Approach to Model
SelectionSelection
Vetrov DmitryVetrov Dmitry
Dorodnicyn Computing Centre Dorodnicyn Computing Centre of RAS, Moscowof RAS, Moscow
Our research teamOur research team
My colleagueMy colleague
Dmitry Kropotov, Dmitry Kropotov, PhD student of PhD student of MSUMSU
Students:Students:
Nikita PtashkoNikita Ptashko
Pavel TolpeginPavel Tolpegin
Igor TolstovIgor Tolstov
OverviewOverview
Problem formulationProblem formulation Ways of solutionWays of solution Bayesian paradigmBayesian paradigm Bayesian regularization of kernel Bayesian regularization of kernel
classifiersclassifiers
Quality vs. ReliabilityQuality vs. Reliability
A general problem:A general problem:
What means to use for solving a task? Either What means to use for solving a task? Either sophisticated and complex, but accurate; sophisticated and complex, but accurate; or simple but reliable?or simple but reliable?
A trade-off between quality and reliability is A trade-off between quality and reliability is neededneeded
Machine learning Machine learning interpretationinterpretation
RegularizationRegularization
The easiest way to establish a compromise The easiest way to establish a compromise is to regularize criterion function using is to regularize criterion function using some heuristic regularizersome heuristic regularizer
( ) ( ) ( )J w w R w
The general problem is HOW to express The general problem is HOW to express accuracy and reliability in the same accuracy and reliability in the same terms. In other words how to define terms. In other words how to define regularization coefficient ?regularization coefficient ?
General ways of General ways of compromise Icompromise I
Structural Risk Minimization (SRM) – penalizes Structural Risk Minimization (SRM) – penalizes flexibility of classifiers expressed in VC-flexibility of classifiers expressed in VC-dimension of given classifier.dimension of given classifier.
dim( )test trainP P VC
Drawback: VC-dimension is very difficult Drawback: VC-dimension is very difficult to compute and its estimates are too to compute and its estimates are too rough. The upper bound for test error is rough. The upper bound for test error is too high and often exceeds 1too high and often exceeds 1
General ways of General ways of compromise IIcompromise II
Minimal Description Length (MDL) – Minimal Description Length (MDL) – penalizes algorithmic complexity of penalizes algorithmic complexity of classifier. Classifier is considered as a classifier. Classifier is considered as a coding algorithm. We encode both coding algorithm. We encode both training data and algorithm itself trying to training data and algorithm itself trying to minimize the total description lengthminimize the total description length
Description Encoded coderdata
l l l
Important aspectImportant aspect
All the described schemes penalize the All the described schemes penalize the flexibility or complexity of classifier, but is flexibility or complexity of classifier, but is it what we really need?…it what we really need?…
““Complex classifier does not always Complex classifier does not always mean bad classifier.”mean bad classifier.”
Ludmila KunchevaLudmila Kunchevaprivate communicationprivate communication
Maximal likelihood principleMaximal likelihood principle
Well-known maximal likelihood principle Well-known maximal likelihood principle states that we should select the states that we should select the classifier with the largest likelihood (i.e. classifier with the largest likelihood (i.e. accuracy on the training sample)accuracy on the training sample)
arg max ( | )ML trainw
w P D w
( | ) ( | )test train test MLP D D P D w
Bayesian viewBayesian view
dwDwPwDPDDP traintesttraintest )|()|()|(
dwwPwDP
wPwDPDwP
train
traintrain
)()|(
)()|()|(
Likelihood Prior
Evidence
Model SelectionModel Selection
Suppose we have different classifier families Suppose we have different classifier families
and want to know what family is better and want to know what family is better without performing computationally without performing computationally expensive cross-validation techniques. expensive cross-validation techniques.
This problem is also known as model selection This problem is also known as model selection task task
)(),...,( 1 p
Bayesian framework IBayesian framework IFind the best model, i.e. the optimal value of Find the best model, i.e. the optimal value of
hyperparameterhyperparameter
If all models are equally likely then If all models are equally likely then
Note that it is exactly the evidence which Note that it is exactly the evidence which should be maximized to find best modelshould be maximized to find best model
arg max ( | )MP rain
AP D
( | ) ( | ) ( | ) ( | )train train trainP D P D P D w P w dw
Bayesian framework IIBayesian framework II
)|(
)|()|()|(
MPtrain
MPtraintrain DP
wPwDPDwP
dwDwPwDPDDP traintesttraintest )|()|()|(
Now compute posterior parameter Now compute posterior parameter distribution…distribution…
… … and final likelihood of test dataand final likelihood of test data
Why do we need model Why do we need model selecton?selecton?
The answer is simple:The answer is simple:Many classifiers (e.g. neural networks or Many classifiers (e.g. neural networks or
support vector machines) require some support vector machines) require some additional parameters to be set by user additional parameters to be set by user before training starts. before training starts.
IDEA: These parameters can be viewed as IDEA: These parameters can be viewed as model hyperparameters and Bayesian model hyperparameters and Bayesian framework can be applied to select their framework can be applied to select their best valuesbest values
What is evidenceWhat is evidence
)|( wDP train
w
Red model has larger likelihood, but green model has better evidence. It is more stable and we may hope for better generalization
Support vector machinesSupport vector machines
Separating surface is defined as linear Separating surface is defined as linear combination of kernel functionscombination of kernel functions
The weights are determined solving QP The weights are determined solving QP optimization problemoptimization problem
N
iii bxxKwxf
1
),()(
k
jkCw
1
2max
Bottlenecks of SVMBottlenecks of SVM
SVM proved to be one of the best classifiers SVM proved to be one of the best classifiers due to the use of maximal margin principle due to the use of maximal margin principle and kernel trick BUT…and kernel trick BUT…
How to define the best kernel for a particular How to define the best kernel for a particular task and regularization coefficient task and regularization coefficient C C ? ?
Bad kernels may lead to very poor Bad kernels may lead to very poor performance due to overfitting or performance due to overfitting or undertrainingundertraining
Relevance Vector MachinesRelevance Vector Machines
Probabilistic approach to kernel models. Probabilistic approach to kernel models. Weights are interpreted as random Weights are interpreted as random variables with gaussian prior distributionvariables with gaussian prior distribution
Maximal evidence principle is used to select Maximal evidence principle is used to select best values. Most of them tend to infinity. best values. Most of them tend to infinity. Hence the corresponding weights have Hence the corresponding weights have zero values that makes the classifier quite zero values that makes the classifier quite sparsesparse
i
1~ (0, )i iw N
Sparseness of RVMSparseness of RVM
SVM (C=10) RVM
Numerical implementation of Numerical implementation of RVMRVM
We use Laplace approximation to avoid We use Laplace approximation to avoid integration. Then likelihood can be written asintegration. Then likelihood can be written as
WhereWhere
Then evidence can be computed analytically. Then evidence can be computed analytically. Iterative optimization of becomes possibleIterative optimization of becomes possible
1max( | ) ~ ( , )trainP D w N w H
log ( | )w w trainH P D w
i
Evidence interpretationEvidence interpretation
Then evidence is given by Then evidence is given by
but…but…
This is exactly STABILITY with respect to This is exactly STABILITY with respect to weights changes ! The larger is Hessian weights changes ! The larger is Hessian the less is evidencethe less is evidence
/ 2 1/ 2max( | ) (2 ) ( | ) | |N
train trainP D P D w H
Kernel selectionKernel selection
IDEA: To use the same techniques for IDEA: To use the same techniques for kernel determination, e.g. for finding kernel determination, e.g. for finding the best width of gaussian kernelthe best width of gaussian kernel
2
2( , ) exp
2
x yK x y
Sudden problemSudden problem
It appeared that narrow gaussians are It appeared that narrow gaussians are more stable with respect to weight more stable with respect to weight changeschanges
SolutionSolution
We allow the centres of kernels be located in We allow the centres of kernels be located in random points (relevant points). The random points (relevant points). The trade-off between narrow (high accuracy trade-off between narrow (high accuracy on the training set) and wide (stable on the training set) and wide (stable answers) gaussian can finally be found.answers) gaussian can finally be found.
The classifier we got appeared even more The classifier we got appeared even more sparse than RVM!sparse than RVM!
Sparseness of GRVMSparseness of GRVM
RVM GRVM
Some experimental resultsSome experimental resultsErrors Kernels
RVM LOO
SVM LOO
RVM ME
RVM LOO
SVM LOO
RVM ME
Australian 14.9 11.54 10.58 37 188 19
Bupa 25 26.92 21.15 6 179 7
Hepatitis 36.17 31.91 31.91 34 102 11
Pima 22.08 21.65 21.21 29 309 13
Credit 16.35 15.38 15.87 57 217 36
Future workFuture work
Develop quick optimization proceduresDevelop quick optimization procedures Optimize and simultaneously Optimize and simultaneously
during evidence maximizationduring evidence maximization Use different width for different features Use different width for different features
to get more sophisticated kernelsto get more sophisticated kernels Apply this approach to polynomial Apply this approach to polynomial
kernelskernels Apply this approach to regression tasksApply this approach to regression tasks
i
Thank you!Thank you!
Contact information:Contact information:
[email protected]@yandex.ru, , [email protected]@yandex.ru
http://vetrovd.narod.ruhttp://vetrovd.narod.ru