# Neural Networks - uniroma1.itispac.diet.uniroma1.it/scardapane/wp-content/... · REGULARIZATION...

Embed Size (px)

### Transcript of Neural Networks - uniroma1.itispac.diet.uniroma1.it/scardapane/wp-content/... · REGULARIZATION...

Neural NetworksLecture 5

Regularized Learning Methods

Academic year 2013-2014

Simone Scardapane

Table of Contents

1 REGULARIZATIONRidge Regression in Linear ModelsRegularized ERMRepresenter’s Theorem

2 REGULARIZED LEARNINGSupport Vector MachinesConsistencyKernel Ridge Regression

3 SRM AND BAYESStructural Risk MinimizationBayesian Learning

4 MODEL SELECTION

5 REFERENCES

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Ridge Regression in Linear Models

Ill-Posed Problems

In mathematics, a problem is well-posed if the solution:

• Exists,• Is unique,• Is stable with respect to the data of the problem.

Under this definition, we see that the learning problem, as formulatedunder the ERM principle, is highly ill-posed. This is a rather generalproperty for inverse problems.It is known since the work of Tikhonov and other mathematicians thatan ill-posed problem can be solved by imposing some regularizing con-straints on the solution, i.e., by penalizing “unwanted” behavior, suchas too high complexity or discontinuities.Note: most of this lecture follows the exposition of [EPP00].

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Ridge Regression in Linear Models

Ordinary Least Square

Consider the simple case of f (x) = wTx. Remember we are given adataset in the form {(xi, yi)}N

i=1. We define the matrices X = [x1, . . . , xN]

and y = [y1, . . . , yN]T.

The ordinary least-square (OLS) solution is:

I[w] = ‖Xw− y‖2 (1)

Under some additional assumptions, the solution to (1) is given by:

w∗ = (XTX)−1XTy (2)

Even if (2) can be computed, the matrix inversion can amount in anhighly ill-posed problem.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Ridge Regression in Linear Models

Ridge Regression

A possible solution is to penalize large weights, the so-called RidgeRegression estimation:

Ireg[w] = ‖Xw− y‖2 + λ‖w‖2 (3)

where λ is called a regularization factor. Solution to (3) is now given by:

w∗ = (XTX + λIN)−1XTy (4)

where IN is the N × N identity matrix. For small λ, the quantity λINvanishes and we are left with standard OLS. However, for a sufficientλ, this ensures that the matrix to be inverted is well conditioned.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Regularized ERM

Regularized Learning Methods

Generalizing the previous considerations, consider the following reg-ularized version of ERM:

minf∈H

N∑i=1

L(xi, f (xi)) + λΦ(‖f‖) (5)

The minimization is made on a Reproducing Kernel Hilbert Space H.The additional term Φ(‖f‖) should be a monotone increasing func-tion of the norm of the function (in the following we will always haveφ(‖f‖) = ‖f‖2).As before, the term λ is called a regularization factor. For λ → +∞we will choose a function with zero norm, while for λ → 0 we obtainERM.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Regularized ERM

Norms and Smoothness

An important point is that the actual definition of “smoothness” weare enforcing depends on the norm of the function, which in turn de-pends on the kernel we are choosing.For example, in the case of the linear kernel k(x,y) = xTy we obtain‖f‖2 = wTw, i.e., standard ridge regression.Consider instead the Gaussian kernel that we defined in the previouslecture. It can be shown that its norm is given by:

‖f‖2 =1

2πN

∫X|̃f (ω)|2 exp

{σ2ω2

2

}dω

where f̃ is the Fourier transform of f . Hence, high-frequency compo-nents are more penalized with respect to low-frequency components.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Representer’s Theorem

Statement

Theorem 1 (Representer’s Theorem)

In equation (5), suppose Φ(‖f‖2) is a non-decreasing function. Then, asolution to (5) can always be expressed as:

f (x) =

N∑i=1

αik(x, xi)

Moreover, if Φ(‖f‖2) is monotonically increasing, all solutions have thisform.

The Representer’s theorem is fundamental: a possibly infinite dimen-sional search (over the RKHS) amounts in a finite-dimensional search(over the coefficients).

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Representer’s Theorem

Considering a Bias Term

It is now time to answer a question: where has the bias term b gone?Note that, practically, the inclusion of a bias amounts in a shift of thehyperplane in the feature space, hence in a different decision bound-ary.Theoretically, there is the need of an extension of the Representer’sTheorem to conditionally PSD kernels.It can be shown that using a bias term is equivalent to using a differ-ent kernel (and hence a different feature space) where constant fea-tures are not penalized. See [PMR+01] for a lengthy discussion on thesubject.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Support Vector Machines

C-SVM

The non-linear Support Vector Machine that we derived from a geo-metrical viewpoint fits into this framework, with the use of the hingeloss function:

L(y, f (x)) = (1− yf (x))+, with (a)+ = max {0, a}

This can be shown by first demonstrating that ‖f‖2 = αKα, whereα = [α1, . . . , αN]

T and K is the Gram matrix. Then, the slack variablesare used to make the Hinge loss function differentiable.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Consistency

Consistency of Regularized Learning Methods

The regularization framework can also be used to derive general the-orems on the consistency of learning methods. As an example, forcontinuous kernels, it can be shown that the C-SVM is consistent ifand only if:

• The kernel is universal (see next slide),• The regularization factor λ is chosen “large enough”.

The condition on λ is highly technical but can be simplified in somecontexts. For example, for the Gaussian kernel, the C-SVM is consis-tent if λ is chosen such that:

λ = Nβ−1

for some 0 < β < 1d (where d is the dimensionality of the input).

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Consistency

Universal Kernels

Consider the space of functions induced by the kernel:

F = span {k(·, x) ∈ X}

The kernel is said to be universal if F is dense in C[X ], i.e., for everyf ∈ C[X ] there exists a g ∈ F such that, for every ε > 0 we have:

‖f − g‖ ≤ ε

As an example, the Gaussian kernel is universal, while the polynomialkernel is not.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Consistency

R-SVM

From the regularization framework it is also possible to directly derivea version of SVM for regression, that we will call R-SVM. Consider theε-insensitive loss function:

L(y, f (x)) = (|y− f (x)| − ε)+that penalizes error of at least ε linearly. By introducing two sets ofslack variables we obtain the following differentiable cost function:

minimizeζ+i ,ζ

−i

− 12

N∑i=1

(ζ+i + ζ−i ) + λ‖f‖2

subject to yi − f (xi) ≤ ε+ ζ+i

f (xi)− yi ≤ ε+ ζ−i

ζ+i , ζ−i ≥ 0

(6)

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Consistency

R-SVMThe dual optimization problem of (7) is given by:

maxαi,βi

− εN∑

i=1

(βi − αi) +

N∑i=1

(βi − αi)yi −12

N∑i,j=1

(βi − αi)(βj − αj)k(xi, xj)

s.t. 0 ≤ αi, βi ≤ λf (xi)− yi ≤ ε+ ζ−i

N∑i=1

(βi − αi) = 0

(7)And the final regression function is given by:

f (x) =

N∑i=1

(βi − αi)k(x, xi)

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Kernel Ridge Regression

Kernel Ridge Regression

Another important class of learning methods is obtained by consider-ing the squared loss function (kernel ridge regression).As in linear ridge regression, it can be shown that the solutions satis-fies the following set of linear equations:

(K + λIN)α = y

Although this is simpler to solve that the SVM optimization problem,sparsness is lost in the solution.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Structural Risk Minimization

Structural Risk Minimization

Let us explore the link between regularization and SRM in the case ofthe Hinge loss function (for other loss functions, some technical prob-lems arises [EPP00]).Consider the sequence of spacesH1 ⊂ H2, . . . such that:

‖f‖ ≤ Ai, ∀f ∈ Hi

where we have chosen the scalars Ai such that A1 ≤ A2, . . .. Minimiz-ing the empirical risk on spaceHk amounts in solving:

maxλ≥0

minf∈Hk

N∑i=1

L(yi, f (xi)) + λ(‖f‖2 − A2k) (8)

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Structural Risk Minimization

Structural Risk Minimization (2)

Solving (8) for every Ak gives us a sequence of optimal λ∗1 , λ∗2 , . . ..

After minimizing the empirical risk, we choose the function that min-imizes a given VC bound, with associated λ∗i = λ∗.The overall operation is equivalent to directly solving:

minf∈Hk

N∑i=1

L(yi, f (xi)) + λ∗(‖f‖2) (9)

Hence, regularization can be seen as an approximate solution to SRM,where the regularization factor is chosen depending on the knowledgeof the VC dimension.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Bayesian Learning

Bayesian View

Another perspective on regularization comes from considering the Ba-yesian approach to learning. Suppose we are given the following ele-ments:

• A prior probability distribution P(f ), f ∈ H, that represents oura-priori knowledge on the goodness of each function.

• A likelihood probability distribution P(S|f ) that gives us theprobability of observing a dataset, supposing that the truefunction is f .

According to the Bayes law, once we observe a dataset S, the posteriordistribution is computed as:

P(f |S) =P(S|f )p(f )∑f P(S|f )p(f )

(10)

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Bayesian Learning

Using the Posterior

The Bayes decision function is obtained by averaging over all possiblefunctions:

f (x) =

∫H

P(f |S)f (x)dP

In practice, simpler estimations can be considered:

• Maximum a-Posteriori, i.e., maximizing the posterior.• Maximum Likelihood, i.e., maximizing the likelihood. This is

equivalent to not making prior assumptions on the shape of thefunction (uninformative prior).

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Bayesian Learning

Regularization and Bayes

Suppose we penalize our models as:

P(f ) ∝ exp{−‖f‖2}

Additionally, suppose the noise in the system is normally distributedwith variance σ:

P(S|f ) ∝ exp

{− 1

2σ

N∑i=1

(yi − f (xi))2

}The posterior is proportional to:

P(f |S) ∝ exp

{− 1

2σ

N∑i=1

(yi − f (xi))2 − ‖f‖2

}

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Bayesian Learning

Regularization and Bayes (2)

Taking the MAP estimation is equivalent to minimizing the negativeof the exponent of the posterior:

minf∈Hk

N∑i=1

(yi − f (xi))2 +

2σ2

N(‖f‖2) (11)

Hence, this amount in making a specific, data-independent choice of theregularization factor.Similar considerations can be made for other choice of the loss func-tion and of the regularization function.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Holdout Method

Before concluding, we look at a practical issue: how can we test theaccuracy of a trained model?The simplest idea is the so-called Holdout method:

• Subdivide the original dataset into a training set and a testing set.• Train the model on the former and test it on the latter.• Repeat steps 1-2 a number of times and average the result

(eventually computing a confidence value).

In general, however, the k-fold cross validation is preferable.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Cross-Validation

Here is how the method works:

• Subdivide the original dataset into k equally-sized subsets (folds).• Repeat for i = 1, . . . , k:

• Train the model on the union of all folds except fold i.• Test the obtained model on fold i.

• Average over the results as before.

Typycal values of k are between 3 and 10. A special case is given by k =N, known as leave-one-out cross validation, which possess interestingtheoretical properties but is computationally expensive.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Model Selection

Cross validation can be used for model selection, i.e., choosing theoptimal parameters of the model.Suppose we have a set of M possible configurations. We can perform ak-fold cross validation on every configuration and choose the optimalone. Note that in this case we can have two nested cross-validations,one for testing and one for validating.As an example, for C-SVM with polynomial kernel, we may test allconfiguration for C = 2.−15, . . . , 25, and p = 1, . . . , 15.More powerful methods exist for specific cases (such as C-SVM witha Gaussian kernel [HRTZ05]).

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Bibliography I

O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, Choosingmultiple parameters for support vector machines, Machine learning(2002), 131–159.

T. Evgeniou, M. Pontil, and T. Poggio, Regularization networks andsupport vector machines, Advances in Computational Mathematics13 (2000), 1–50.

T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, The entireregularization path for the support vector machine, Journal ofMachine Learning Research 5 (2005), 1391–1415.

C.A. Micchelli, Y. Xu, and H. Zhang, Universal kernels, TheJournal of Machine Learning Research 7 (2006), 2651–2667.

T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri, b,Tech. report, 2001.

REGULARIZATION REGULARIZED LEARNING SRM AND BAYES MODEL SELECTION REFERENCES

Bibliography II

I. Steinwart, Support Vector Machines are Universally Consistent,Journal of Complexity 18 (2002), no. 3, 768–791.

Andreas Steinwart, Ingo and Christmann, Support vectormachines, 1st ed., 2008.