Neural Networks - uniroma1.itispac.diet.uniroma1.it/scardapane/wp-content/... · REGULARIZATION...
Embed Size (px)
Transcript of Neural Networks - uniroma1.itispac.diet.uniroma1.it/scardapane/wp-content/... · REGULARIZATION...

Neural NetworksLecture 5
Regularized Learning Methods
Academic year 2013-2014
Simone Scardapane

Table of Contents
1 REGULARIZATIONRidge Regression in Linear ModelsRegularized ERMRepresenter’s Theorem
2 REGULARIZED LEARNINGSupport Vector MachinesConsistencyKernel Ridge Regression
3 SRM AND BAYESStructural Risk MinimizationBayesian Learning
4 MODEL SELECTION
5 REFERENCES

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Ridge Regression in Linear Models
Ill-Posed Problems
In mathematics, a problem is well-posed if the solution:
• Exists,• Is unique,• Is stable with respect to the data of the problem.
Under this definition, we see that the learning problem, as formulatedunder the ERM principle, is highly ill-posed. This is a rather generalproperty for inverse problems.It is known since the work of Tikhonov and other mathematicians thatan ill-posed problem can be solved by imposing some regularizing con-straints on the solution, i.e., by penalizing “unwanted” behavior, suchas too high complexity or discontinuities.Note: most of this lecture follows the exposition of [EPP00].

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Ridge Regression in Linear Models
Ordinary Least Square
Consider the simple case of f (x) = wTx. Remember we are given adataset in the form {(xi, yi)}N
i=1. We define the matrices X = [x1, . . . , xN]
and y = [y1, . . . , yN]T.
The ordinary least-square (OLS) solution is:
I[w] = ‖Xw− y‖2 (1)
Under some additional assumptions, the solution to (1) is given by:
w∗ = (XTX)−1XTy (2)
Even if (2) can be computed, the matrix inversion can amount in anhighly ill-posed problem.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Ridge Regression in Linear Models
Ridge Regression
A possible solution is to penalize large weights, the so-called RidgeRegression estimation:
Ireg[w] = ‖Xw− y‖2 + λ‖w‖2 (3)
where λ is called a regularization factor. Solution to (3) is now given by:
w∗ = (XTX + λIN)−1XTy (4)
where IN is the N × N identity matrix. For small λ, the quantity λINvanishes and we are left with standard OLS. However, for a sufficientλ, this ensures that the matrix to be inverted is well conditioned.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Regularized ERM
Regularized Learning Methods
Generalizing the previous considerations, consider the following reg-ularized version of ERM:
minf∈H
N∑i=1
L(xi, f (xi)) + λΦ(‖f‖) (5)
The minimization is made on a Reproducing Kernel Hilbert Space H.The additional term Φ(‖f‖) should be a monotone increasing func-tion of the norm of the function (in the following we will always haveφ(‖f‖) = ‖f‖2).As before, the term λ is called a regularization factor. For λ → +∞we will choose a function with zero norm, while for λ → 0 we obtainERM.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Regularized ERM
Norms and Smoothness
An important point is that the actual definition of “smoothness” weare enforcing depends on the norm of the function, which in turn de-pends on the kernel we are choosing.For example, in the case of the linear kernel k(x,y) = xTy we obtain‖f‖2 = wTw, i.e., standard ridge regression.Consider instead the Gaussian kernel that we defined in the previouslecture. It can be shown that its norm is given by:
‖f‖2 =1
2πN
∫X|̃f (ω)|2 exp
{σ2ω2
2
}dω
where f̃ is the Fourier transform of f . Hence, high-frequency compo-nents are more penalized with respect to low-frequency components.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Representer’s Theorem
Statement
Theorem 1 (Representer’s Theorem)
In equation (5), suppose Φ(‖f‖2) is a non-decreasing function. Then, asolution to (5) can always be expressed as:
f (x) =
N∑i=1
αik(x, xi)
Moreover, if Φ(‖f‖2) is monotonically increasing, all solutions have thisform.
The Representer’s theorem is fundamental: a possibly infinite dimen-sional search (over the RKHS) amounts in a finite-dimensional search(over the coefficients).

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Representer’s Theorem
Considering a Bias Term
It is now time to answer a question: where has the bias term b gone?Note that, practically, the inclusion of a bias amounts in a shift of thehyperplane in the feature space, hence in a different decision bound-ary.Theoretically, there is the need of an extension of the Representer’sTheorem to conditionally PSD kernels.It can be shown that using a bias term is equivalent to using a differ-ent kernel (and hence a different feature space) where constant fea-tures are not penalized. See [PMR+01] for a lengthy discussion on thesubject.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Support Vector Machines
C-SVM
The non-linear Support Vector Machine that we derived from a geo-metrical viewpoint fits into this framework, with the use of the hingeloss function:
L(y, f (x)) = (1− yf (x))+, with (a)+ = max {0, a}
This can be shown by first demonstrating that ‖f‖2 = αKα, whereα = [α1, . . . , αN]
T and K is the Gram matrix. Then, the slack variablesare used to make the Hinge loss function differentiable.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Consistency
Consistency of Regularized Learning Methods
The regularization framework can also be used to derive general the-orems on the consistency of learning methods. As an example, forcontinuous kernels, it can be shown that the C-SVM is consistent ifand only if:
• The kernel is universal (see next slide),• The regularization factor λ is chosen “large enough”.
The condition on λ is highly technical but can be simplified in somecontexts. For example, for the Gaussian kernel, the C-SVM is consis-tent if λ is chosen such that:
λ = Nβ−1
for some 0 < β < 1d (where d is the dimensionality of the input).

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Consistency
Universal Kernels
Consider the space of functions induced by the kernel:
F = span {k(·, x) ∈ X}
The kernel is said to be universal if F is dense in C[X ], i.e., for everyf ∈ C[X ] there exists a g ∈ F such that, for every ε > 0 we have:
‖f − g‖ ≤ ε
As an example, the Gaussian kernel is universal, while the polynomialkernel is not.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Consistency
R-SVM
From the regularization framework it is also possible to directly derivea version of SVM for regression, that we will call R-SVM. Consider theε-insensitive loss function:
L(y, f (x)) = (|y− f (x)| − ε)+that penalizes error of at least ε linearly. By introducing two sets ofslack variables we obtain the following differentiable cost function:
minimizeζ+i ,ζ
−i
− 12
N∑i=1
(ζ+i + ζ−i ) + λ‖f‖2
subject to yi − f (xi) ≤ ε+ ζ+i
f (xi)− yi ≤ ε+ ζ−i
ζ+i , ζ−i ≥ 0
(6)

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Consistency
R-SVMThe dual optimization problem of (7) is given by:
maxαi,βi
− εN∑
i=1
(βi − αi) +
N∑i=1
(βi − αi)yi −12
N∑i,j=1
(βi − αi)(βj − αj)k(xi, xj)
s.t. 0 ≤ αi, βi ≤ λf (xi)− yi ≤ ε+ ζ−i
N∑i=1
(βi − αi) = 0
(7)And the final regression function is given by:
f (x) =
N∑i=1
(βi − αi)k(x, xi)

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Kernel Ridge Regression
Kernel Ridge Regression
Another important class of learning methods is obtained by consider-ing the squared loss function (kernel ridge regression).As in linear ridge regression, it can be shown that the solutions satis-fies the following set of linear equations:
(K + λIN)α = y
Although this is simpler to solve that the SVM optimization problem,sparsness is lost in the solution.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Structural Risk Minimization
Structural Risk Minimization
Let us explore the link between regularization and SRM in the case ofthe Hinge loss function (for other loss functions, some technical prob-lems arises [EPP00]).Consider the sequence of spacesH1 ⊂ H2, . . . such that:
‖f‖ ≤ Ai, ∀f ∈ Hi
where we have chosen the scalars Ai such that A1 ≤ A2, . . .. Minimiz-ing the empirical risk on spaceHk amounts in solving:
maxλ≥0
minf∈Hk
N∑i=1
L(yi, f (xi)) + λ(‖f‖2 − A2k) (8)

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Structural Risk Minimization
Structural Risk Minimization (2)
Solving (8) for every Ak gives us a sequence of optimal λ∗1 , λ∗2 , . . ..
After minimizing the empirical risk, we choose the function that min-imizes a given VC bound, with associated λ∗i = λ∗.The overall operation is equivalent to directly solving:
minf∈Hk
N∑i=1
L(yi, f (xi)) + λ∗(‖f‖2) (9)
Hence, regularization can be seen as an approximate solution to SRM,where the regularization factor is chosen depending on the knowledgeof the VC dimension.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Bayesian Learning
Bayesian View
Another perspective on regularization comes from considering the Ba-yesian approach to learning. Suppose we are given the following ele-ments:
• A prior probability distribution P(f ), f ∈ H, that represents oura-priori knowledge on the goodness of each function.
• A likelihood probability distribution P(S|f ) that gives us theprobability of observing a dataset, supposing that the truefunction is f .
According to the Bayes law, once we observe a dataset S, the posteriordistribution is computed as:
P(f |S) =P(S|f )p(f )∑f P(S|f )p(f )
(10)

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Bayesian Learning
Using the Posterior
The Bayes decision function is obtained by averaging over all possiblefunctions:
f (x) =
∫H
P(f |S)f (x)dP
In practice, simpler estimations can be considered:
• Maximum a-Posteriori, i.e., maximizing the posterior.• Maximum Likelihood, i.e., maximizing the likelihood. This is
equivalent to not making prior assumptions on the shape of thefunction (uninformative prior).

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Bayesian Learning
Regularization and Bayes
Suppose we penalize our models as:
P(f ) ∝ exp{−‖f‖2}
Additionally, suppose the noise in the system is normally distributedwith variance σ:
P(S|f ) ∝ exp
{− 1
2σ
N∑i=1
(yi − f (xi))2
}The posterior is proportional to:
P(f |S) ∝ exp
{− 1
2σ
N∑i=1
(yi − f (xi))2 − ‖f‖2
}

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Bayesian Learning
Regularization and Bayes (2)
Taking the MAP estimation is equivalent to minimizing the negativeof the exponent of the posterior:
minf∈Hk
N∑i=1
(yi − f (xi))2 +
2σ2
N(‖f‖2) (11)
Hence, this amount in making a specific, data-independent choice of theregularization factor.Similar considerations can be made for other choice of the loss func-tion and of the regularization function.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Holdout Method
Before concluding, we look at a practical issue: how can we test theaccuracy of a trained model?The simplest idea is the so-called Holdout method:
• Subdivide the original dataset into a training set and a testing set.• Train the model on the former and test it on the latter.• Repeat steps 1-2 a number of times and average the result
(eventually computing a confidence value).
In general, however, the k-fold cross validation is preferable.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Cross-Validation
Here is how the method works:
• Subdivide the original dataset into k equally-sized subsets (folds).• Repeat for i = 1, . . . , k:
• Train the model on the union of all folds except fold i.• Test the obtained model on fold i.
• Average over the results as before.
Typycal values of k are between 3 and 10. A special case is given by k =N, known as leave-one-out cross validation, which possess interestingtheoretical properties but is computationally expensive.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Model Selection
Cross validation can be used for model selection, i.e., choosing theoptimal parameters of the model.Suppose we have a set of M possible configurations. We can perform ak-fold cross validation on every configuration and choose the optimalone. Note that in this case we can have two nested cross-validations,one for testing and one for validating.As an example, for C-SVM with polynomial kernel, we may test allconfiguration for C = 2.−15, . . . , 25, and p = 1, . . . , 15.More powerful methods exist for specific cases (such as C-SVM witha Gaussian kernel [HRTZ05]).

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Bibliography I
O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, Choosingmultiple parameters for support vector machines, Machine learning(2002), 131–159.
T. Evgeniou, M. Pontil, and T. Poggio, Regularization networks andsupport vector machines, Advances in Computational Mathematics13 (2000), 1–50.
T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, The entireregularization path for the support vector machine, Journal ofMachine Learning Research 5 (2005), 1391–1415.
C.A. Micchelli, Y. Xu, and H. Zhang, Universal kernels, TheJournal of Machine Learning Research 7 (2006), 2651–2667.
T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri, b,Tech. report, 2001.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES
Bibliography II
I. Steinwart, Support Vector Machines are Universally Consistent,Journal of Complexity 18 (2002), no. 3, 768–791.
Andreas Steinwart, Ingo and Christmann, Support vectormachines, 1st ed., 2008.