Stastical Learning Theory

8/4/2019 Stastical Learning Theory

1/194

Statistical Learning Theory

Olivier BousquetDepartment of Empirical Inference

Max Planck Institute of Biological Cybernetics

[email protected]

Machine Learning Summer School, August 2003


2/194

Roadmap (1)

Lecture 1: Introduction

Part I: Binary Classification

Lecture 2: Basic bounds

Lecture 3: VC theory

Lecture 4: Capacity measures

Lecture 5: Advanced topics

O. Bousquet Statistical Learning Theory 1


3/194

Roadmap (2)

Part II: Real-Valued Classification

Lecture 6: Margin and loss functions

Lecture 7: Regularization

Lecture 8: SVM

O. Bousquet Statistical Learning Theory 2


4/194

Lecture 1

The Learning Problem Context

Formalization

Approximation/Estimation trade-off

Algorithms and Bounds

O. Bousquet Statistical Learning Theory Lecture 1 3


5/194

Learning and Inference

The inductive inference process:1. Observe a phenomenon

2. Construct a model of the phenomenon

3. Make predictions

This is more or less the definition of natural sciences !

The goal of Machine Learning is to automate this process

The goal of Learning Theory is to formalize it.



6/194

Pattern recognition

We consider here the supervised learning framework for patternrecognition:

Data consists of pairs (instance, label)

Label is +1 or 1

Algorithm constructs a function (instance label)

Goal: make few mistakes on future unseen instances



7/194

Approximation/Interpolation

It is always possible to build a function that fits exactly the data.

0 0.5 1 1.50

0.5

1

1.5

But is it reasonable ?



8/194

Occams Razor

Idea: look for regularities in the observed phenomenonThese can ge generalized from the observed past to the future

choose the simplest consistent model

How to measure simplicity ?

Physics: number of constants Description length

Number of parameters

...



9/194

No Free Lunch

No Free Lunch if there is no assumption on how the past is related to the future,

prediction is impossible

if there is no restriction on the possible phenomena, generalization

is impossible

We need to make assumptions Simplicity is not absolute Data will never replace knowledge Generalization = data + knowledge



10/194

Assumptions

Two types of assumptions

Future observations related to past ones

Stationarity of the phenomenon

Constraints on the phenomenon Notion of simplicity



11/194


12/194

Probabilistic Model

Relationship between past and future observations

Sampled independently from the same distribution

Independence: each new observation yields maximum information

Identical distribution: the observations give information about theunderlying phenomenon (here a probability distribution)



13/194

Probabilistic Model

We consider an input space X and output space Y.Here: classification case Y= {1, 1}.

Assumption: The pairs (X, Y) X Yare distributedaccording to P (unknown).

Data: We observe a sequence of n i.i.d. pairs (Xi, Yi) sampled

according to P.

Goal: construct a function g : X Ywhich predicts Y from X.



14/194

Probabilistic Model

Criterion to choose our function:

Low probability of error P(g(X) = Y).Risk

R(g) = P(g(X) = Y) = E

1[g(X)=Y]

P is unknown so that we cannot directly measure the risk Can only measure the agreement on the data Empirical Risk

Rn(g) =1

n

n

i=1 1[g(Xi)=Yi]



15/194

Target function

P can be decomposed as PX P(Y|X) (x) = E [Y|X = x] = 2P [Y = 1|X = x] 1 is the regression

function

t(x) = sgn (x) is the target function in the deterministic case Y = t(X) (P [Y = 1|X] {0, 1})

in general, n(x) = min(P [Y = 1|X = x] , 1P [Y = 1|X = x])(1 (x))/2 is the noise level



16/194

Assumptions about P

Need assumptions about P.

Indeed, if t(x) is totally chaotic, there is no possible generalization

from finite data.

Assumptions can be

Preference (e.g. a priori probability distribution on possible functions)

Restriction (set of possible functions)Treating lack of knowledge

Bayesian approach: uniform distribution Learning Theory approach: worst case analysis



17/194


18/194

Overfitting/Underfitting

The data can mislead you. Underfittingmodel too small to fit the data

Overfittingartificially good agreement with the data

No way to detect them from the data ! Need extra validation data.



19/194

Empirical Risk Minimization

Choose a model G (set of possible functions)

Minimize the empirical risk in the model

mingG Rn(g)

What if the Bayes classifier is not in the model ?



20/194

Approximation/Estimation

Bayes risk

R = infg

R(g) .

Best risk a deterministic function can have (risk of the target function,

or Bayes classifier).

Decomposition: R(g) = inf

gGR(g)

R(gn) R = R(g) R + R(gn) R(g) Approximation Estimation

Only the estimation error is random (i.e. depends on the data).



21/194

Structural Risk Minimization

Choose a collection of models {Gd : d = 1, 2, . . .} Minimize the empirical risk in each model Minimize the penalized empirical risk

mind

mingGd

Rn(g) + pen(d, n)

pen(d, n) gives preference to models where estimation error is small

pen(d, n) measures the size or capacity of the model



22/194

Regularization

Choose a large model G (possibly dense) Choose a regularizer g Minimize the regularized empirical risk

mingG Rn(g) + g2

Choose an optimal trade-off (regularization parameter).

Most methods can be thought of as regularization methods.



23/194

Bounds (1)

A learning algorithm

Takes as input the data (X1, Y1), . . . , (Xn, Yn) Produces a function gn

Can we estimate the risk of gn ?

random quantity (depends on the data).

need probabilistic bounds



24/194

Bounds (2)

Error boundsR(gn) Rn(gn) + B

Estimation from an empirical quantity

Relative error bounds

Best in a class

R(gn) R(g) + B Bayes risk

R(gn)

R + B

Theoretical guarantees



25/194

Lecture 2

Basic Bounds

Probability tools

Relationship with empirical processes

Law of large numbers

Union bound




26/194

Probability Tools (1)

Basic facts

Union: P [A or B] P [A] + P [B]

Inclusion: If A B, then P [A] P [B].

Inversion: IfP [X t] F(t) then with probability at least 1 ,X F1().

Expectation: If X

0, E [X] = 0 P [X t] dt.



27/194

Probability Tools (2)

Basic inequalities

Jensen: for f convex, f(E [X]) E [f(X)]

Markov: If X 0 then for all t > 0, P [X t] E[X]t

Chebyshev: for t > 0, P [|X E [X] | t] Var[X]t2

Chernoff: for all t R, P [X t] inf0 E e(Xt)



28/194

Error bounds

Recall that we want to bound R(gn) = E 1[gn(X)=Y] where gn hasbeen constructed from (X1, Y1), . . . , (Xn, Yn). Cannot be observed (P is unknown)

Random (depends on the data) we want to bound

P [R(gn) Rn(gn) > ]



29/194

Loss class

For convenience, let Zi = (Xi, Yi) and Z = (X, Y). GivenG

define

the loss class

F= {f : (x, y) 1[g(x)=y] : g G}

Denote P f =E

[f(X, Y)] and Pnf =

1

n ni=1 f(Xi, Yi)Quantity of interest:

P f Pnf

We will go back and forth between Fand G (bijection)



30/194

Empirical process

Empirical process:

{P f Pnf}fF

Process = collection of random variables (here indexed by functionsin

F)

Empirical = distribution of each random variable Many techniques exist to control the supremum

supfF

P f Pnf



31/194

The Law of Large Numbers

R(g) Rn(g) = E [f(Z)] 1nn

i=1

f(Zi)

difference between the expectation and the empirical average of ther.v. f(Z)

Law of large numbers

P

lim

n1

n

ni=1

f(Zi) E [f(Z)] = 0

= 1 .

can we quantify it ?



32/194

Hoeffdings Inequality

Quantitative version of law of large numbers.

Assumes bounded random variables

Theorem 1. LetZ1, . . . , Zn be n i.i.d. random variables. Iff(Z) [a, b]. Then for all > 0, we have

P

1

n

ni=1

f(Zi) E [f(Z)]

>

2 exp

2n

2

(b a)2

.

Lets rewrite it to better understand



33/194

Hoeffdings InequalityWrite

= 2 exp 2n2(b a)2

Then

P

|Pnf P f| > (b a)

log 2

2n

or [Inversion] with probability at least 1 ,

|Pnf P f| (b a)

log 22n



34/194

Hoeffdings inequality

Lets apply to f(Z) = 1[g(X)=Y].

For any g, and any > 0, with probability at least 1

R(g)

Rn(g) +

log 2

2n

. (1)

Notice that one has to consider a fixed function f and the probability is

with respect to the sampling of the data.

If the function depends on the data this does not apply !



35/194

Limitations

For each fixed function f F, there is a set S of samples for whichP f Pnf

log 2

2n (P [S] 1 )

They may be different for different functions

The function chosen by the algorithm depends on the sample

For the observed sample, only some of the functions in Fwill satisfythis inequality !



36/194

Limitations

What we need to bound is

P fn Pnfn

where fn is the function chosen by the algorithm based on the data.

For any fixed sample, there exists a function f such that

P f Pnf = 1

Take the function which is f(Xi) = Yi on the data and f(X) = Yeverywhere else.

This does not contradict Hoeffding but shows it is not enough



37/194

Limitations

Risk

Function class

R

Remp

f f f opt m

R[f]

R [f]emp

Hoeffdings inequality quantifies differences for a fixed function


U if D i i


38/194

Uniform Deviations

Before seeing the data, we do not know which function the algorithm

will choose.

The trick is to consider uniform deviations

R(fn) Rn(fn) supfF(R(f) Rn(f))

We need a bound which holds simultaneously for all functions in a class


U i B d


39/194

Union Bound

Consider two functions f1, f2 and define

Ci = {(x1, y1), . . . , (xn, yn) : P fi Pnfi > }

From Hoeffdings inequality, for each i

P [Ci]

We want to bound the probability of being bad for i = 1 or i = 2

P[C1 C2]

P[C1] +

P[C2]


Fi i C


40/194

Finite CaseMore generally

P [C1 . . . CN] N

i=1

P [Ci]

We have

P [f {f1, . . . , f N} : P f Pnf > ]

N

i=1

P [P fi Pnfi > ]

N exp 2n2O. Bousquet Statistical Learning Theory Lecture 2 39

Fi it C


41/194

Finite Case

We obtain, for

G=

{g1, . . . , gN

}, for all > 0

with probability at least 1 ,

g

G, R(g)

Rn(g) + log N + log

1

2m

This is a generalization bound !

Coding interpretation

log N is the number of bits to specify a function in F


A i ti /E ti ti


42/194

Approximation/EstimationLet

g

= arg mingG R(g)If gn minimizes the empirical risk in G,

Rn(g) Rn(gn) 0

Thus

R(gn) = R(gn) R(g) + R(g) Rn(g) Rn(gn) + R(gn) R(g) + R(g)

2 supgG |R(g) Rn(g)| + R(g)


A i ti /E ti ti


43/194

Approximation/Estimation

We obtain with probability at least 1

R(gn) R(g) + 2

log N + log 22m

The first term decreases if N increasesThe second term increases

The size ofG controls the trade-off


S (1)


44/194

Summary (1)

Inference requires assumptions

- Data sampled i.i.d. from P

- Restrict the possible functions to G

- Choose a sequence of models Gm to have more flexibility/control


Summary (2)


45/194

Summary (2)

Bounds are valid w.r.t. repeated sampling

- For a fixed function g, for most of the samples

R(g) Rn(g) 1/

n

- For most of the samples if|G| = N

supgG

R(g) Rn(g)

log N/n

Extra variability because the chosen gn changes with the data


Improvements


46/194

Improvements

We obtained

supgG

R(g) Rn(g)

log N + log 22n

To be improved

Hoeffding only uses boundedness, not the variance Union bound as bad as if independent Supremum is not what the algorithm chooses.

Next we improve the union bound and extend it to the infinite case


Refined union bound (1)


47/194


For each f

F,

P

P f Pnf >

log 1(f)

2n

(f)

P

f F: P f Pnf >

log 1(f)

2n

fF

(f)

Choose (f) = p(f) with fFp(f) = 1




48/194


With probability at least 1

,

f F, P f Pnf +

log 1p(f) + log1

2n

Applies to countably infinite F Can put knowledge about the algorithm into p(f)

But p chosen before seeing the data




49/194


Good p means good bound. The bound can be improved if you knowahead of time the chosen function (knowledge improves the bound)

In the infinite case, how to choose the p (since it implies an ordering)

The trick is to look at Fthrough the data


Lecture 3


50/194

Lecture 3

Infinite Case: Vapnik-Chervonenkis Theory

Growth function

Vapnik-Chervonenkis dimension

Proof of the VC bound

VC entropy

SRM


Infinite Case


51/194

Infinite CaseMeasure of the size of an infinite class ?

Consider

Fz1,...,zn = {(f(z1), . . . , f (zn)) : f F}

The size of this set is the number of possible ways in which the data

(z1, . . . , zn) can be classified.

Growth function

SF(n) = sup(z1,...,zn)

|Fz1,...,zn|

Note that SF(n) = SG(n)


Infinite Case


52/194

Infinite Case

Result (Vapnik-Chervonenkis)With probability at least 1

g G, R(g) Rn(g) +

log SG(2n) + log 48n

Always better than N in the finite case How to compute SG(n) in general ?

use VC dimension


VC Dimension


53/194

VC Dimension

Notice that since g

{1, 1

}, SG(n)

2n

If SG(n) = 2n, the class of functions can generate any classification onn points (shattering)

Definition 2. The VC-dimension ofG is the largest n such that

SG(n) = 2n


VC Dimension


54/194

VC Dimension

Hyperplanes

In Rd, V C(hyperplanes) = d + 1


VC Dimension


55/194

Number of Parameters

Is VC dimension equal to number of parameters ?

One parameter{sgn(sin(tx)) : t R}

Infinite VC dimension !


VC Dimension


56/194

We want to know S

G(n) but we only know

SG(n) = 2n for n hWhat happens for n h ?

m

log(S(m))

h


Vapnik-Chervonenkis-Sauer-Shelah Lemma


57/194

p

Lemma 3. Let

Gbe a class of functions with finite VC-dimension h.

Then for all n N,SG(n)

hi=0

ni

and for all n h,

SG(n) enh

h

phase transition


VC Bound


58/194

Let

Gbe a class with VC dimension h.

With probability at least 1

g G, R(g) Rn(g) +

h log 2enh + log4

8n

So the error is of order h log n

n


Interpretation


59/194

VC dimension: measure of effective dimension

Depends on geometry of the class

Gives a natural definition of simplicity (by quantifying the potentialoverfitting)

Not related to the number of parameters

Finiteness guarantees learnability under any distribution


Symmetrization (lemma)


60/194

y ( )

Key ingredient in VC bounds: Symmetrization

Let Z1, . . . , Zn an independent (ghost) sample and P

n the

corresponding empirical measure.

Lemma 4. For any t > 0, such that nt

2

2,

P

supfF

(P Pn)f t

2P

supfF

(Pn Pn)f t/2


Symmetrization (proof 1)


61/194

( )

fn the function achieving the supremum (depends on Z1, . . . , Zn)

1[(PPn)fn>t]1[(PPn)fnt (PPn)fnt/2]

Taking expectations with respect to the second sample gives

1[(PPn)fn>t]P (P Pn)fn < t/2 P (Pn Pn)fn > t/2


Symmetrization (proof 2)


62/194

By Chebyshev inequality,

P (P Pn)fn t/2 4Var [fn]nt2 1nt2

Hence

1[(PPn)fn>t](1 1

nt2) P (Pn Pn)fn > t/2

Take expectation with respect to first sample.


Proof of VC bound (1)


63/194

Symmetrization allows to replace expectation by average on ghost

sample Function class projected on the double sample

FZ1,...,Zn,Z1,...,Zn

Union bound on FZ1,...,Zn,Z1,...,Zn Variant of Hoeffdings inequality

P Pnf P

nf > t 2ent2/2


Proof of VC bound (2)


64/194

P

supfF(P Pn)f t

2P supfF(Pn Pn)f t/2= 2P

supfFZ1,...,Zn,Z1,...,Zn (Pn Pn)f t/2 2SF(2n)P

(Pn Pn)f t/2

4SF(2n)ent

2/8


VC Entropy (1)


65/194

VC dimension is distribution independent

The same bound holds for any distribution

It is loose for most distributions

A similar proof can give a distribution-dependent result


VC Entropy (2)


66/194

Denote the size of the projection N (

F, z1 . . . , zn) := #

Fz

1,...,z

n The VC entropy is defined as

HF(n) = log E [N (F, Z1, . . . , Zn)] ,

VC entropy bound: with probability at least 1

g G, R(g) Rn(g) +

HG(2n) + log 28n


VC Entropy (proof)


67/194

Introduce i {1, 1} (probability 1/2), Rademacher variables

2P

supfFZ,Z(P

n Pn)f t/2

2EP

supfFZ,Z

1

n

n

i=1i(f(Z

i) f(Zi)) t/2

2E NF, Z , Z P

1

n

ni=1

i t/2

2E NF, Z , Z

ent2/8


From Bounds to Algorithms


68/194

For any distribution, H

G(n)/n

0 ensures consistency of empirical

risk minimizer (i.e. convergence to best in the class)

Does it means we can learn anything ?

No because of the approximation of the class

Need to trade-off approximation and estimation error (assessed bythe bound)

Use the bound to control the trade-off


Structural risk minimization


69/194

Structural risk minimization (SRM) (Vapnik, 1979): minimize the

right hand side of

R(g) Rn(g) + B(h, n).

To this end, introduce a structure on G.

Learning machine a set of functions and an induction principle


SRM: The Picture


70/194

R(f* )

h

training error

capacity term

error

Sn1 Sn Sn+1

structure

bound on test error


Lecture 4


71/194

Capacity Measures

Covering numbers

Rademacher averages

Relationships


Covering numbers


72/194

Define a (random) distance d between functions, e.g.

d(f, f) = 1n

#{f(Zi) = f(Zi) : i = 1, . . . , n}

Normalized Hamming distance of the projections on the sample

A set f1, . . . , f N covers Fat radius if

F Ni=1B(fi, )

Covering number N(F, , n) is the minimum size of a cover ofradius

Note that N(F, , n) = N(G, , n).


Bound with covering numbers


73/194

When the covering numbers are finite, one can approximate the class

G by a finite set of functions

Result

P [g G : R(g) Rn(g) t] 8E [N(G, t , n)] ent2/128


Covering numbers and VC dimension


74/194

Notice that for all t, N(

G, t , n)

#

GZ = N(

G, Z)

Hence N(G, t , n) h log enh

Haussler

N(G, t , n) Ch(4e)h 1th

Independent of n


Refinement


75/194

VC entropy corresponds to log covering numbers at minimal scale

Covering number bound is a generalization where the scale is adaptedto the error

Is this the right scale ?

It turns out that results can be improved by considering all scales (chaining)


Rademacher averages


76/194

Rademacher variables: 1, . . . , n independent random variables

with

P [i = 1] = P [i = 1] =1

2

Notation (randomized empirical measure) Rnf =

1n ni=1 if(Zi)

Rademacher average: R(F) = E supfFRnf Conditional Rademacher average Rn(F) = E

supfFRnfO. Bousquet Statistical Learning Theory Lecture 4 75

Result


77/194

Distribution dependent

f F, P f Pnf + 2R(F) +

log 12n

,

Data dependent

f F, P f Pnf + 2Rn(F) +

2log 2n

,


Concentration


78/194

Hoeffdings inequality is a concentration inequality

When n increases, the average is concentrated around the expectation

Generalization to functions that depend on i.i.d. random variablesexist


McDiarmids Inequality


79/194

Assume for all i = 1, . . . , n,

supz1,...,zn,z

i

|F(z1, . . . , zi, . . . , zn) F(z1, . . . , zi, . . . , zn)| c

then for all > 0,

P [|F E [F] | > ] 2 exp

22

nc2


Proof of Rademacher average bounds


80/194

Use concentration to relate supf

FP f

Pnf to its expectation

Use symmetrization to relate expectation to Rademacher average

Use concentration again to relate Rademacher average to conditionalone


Application (1)


81/194

supfFA(f) + B(f) supfFA(f) + supfFB(f)Hence

| supf

F

C(f) supf

F

A(f)| supf

F

(C(f) A(f))

this gives

| supfF

(P f Pnf) supfF

(P f Pnf)| supfF

(P

nf Pnf)


Application (2)


82/194

f {0, 1} hence,

Pnf Pnf =1

n(f(Zi) f(Zi))

1

n

thus

| supfF(P f Pnf) supfF(P f Pnf)| 1

n

McDiarmids inequality can be applied with c = 1/n


Symmetrization (1)


83/194

Upper bound

E

supfF

P f Pnf

2E

supfF

Rnf

Lower bound

E

supfF

|P f Pnf|

12E

supfF

Rnf

12

n


Symmetrization (2)


84/194

E[supfF

P f Pnf]

= E[supfF

E

P

nf Pnf]

EZ,Z[supfFP

nf Pnf]

= E,Z,Z

supfF

1

n

n

i=1i(f(Z

i) f(Zi))

2E[sup

fFRnf]


Loss class and initial class


85/194

R(F) = E

supgG

1

n

ni=1

i1[g(Xi)=Yi]

= E supgG1

n

n

i=1

i1

2(1

Yig(Xi))

=1

2E

supgG

1

n

ni=1

iYig(Xi)

=

1

2R(G)


Computing Rademacher averages (1)


86/194

12Esup

gG1n

ni=1

ig(Xi)=

1

2+ E

supgG

1

n

n

i=11 ig(Xi)

2

=1

2 E

infgG

1

n

ni=1

1 ig(Xi)2

=1

2 E infgG Rn(g, )


Computing Rademacher averages (2)


87/194

Not harder than computing empirical risk minimizer

Pick i randomly and minimize error with respect to labels i

Intuition: measure how much the class can fit random noise

Large class R(G) = 12


Concentration again


88/194

Let

F = E

supfF

Rnf

Expectation with respect to i only, with (Xi, Yi) fixed.

F satisfies McDiarmids assumptions with c = 1n

E [F] = R(F) can be estimated by F = Rn(F)


Relationship with VC dimension


89/194

For a finite set

F=

{f1, . . . , f N

}R(F) 2

log N/n

Consequence for VC class Fwith dimension h

R(F) 2

h log enhn

.

Recovers VC bound with a concentration proof


Chaining


90/194

Using covering numbers at all scales, the geometry of the class is

better captured

DudleyRn(F)

Cn

0

log N(F, t , n) dt

ConsequenceR(F) C

h

n

Removes the unnecessary log n factor !


Lecture 5

Advanced Topics


91/194

Advanced Topics


Noise conditions

Localized Rademacher averages

PAC-Bayesian bounds


Binomial tails


92/194

Pnf

B(p, n) binomial distribution p = P f

P [P f Pnf t] = n(pt)

k=0

nk

pk(1 p)nk

Can be upper bounded Exponential 1p1pt

n(1pt)

pp+tn(p+t)

Bennett e np1p((1t/p) log(1t/p)+t/p)

Bernstein e nt2

2p(1p)+2t/3

Hoeffding e2nt2


Tail behavior

( 2/ ( ))


93/194

For small deviations, Gaussian behavior

exp(

nt2/2p(1

p))

Gaussian with variance p(1 p)

For large deviations, Poisson behavior exp(3nt/2)

Tails heavier than Gaussian

Can upper bound with a Gaussian with large (maximum) varianceexp(2nt2)


Illustration (1)

Maximum variance (p = 0.5)


94/194

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

De

viationbound

BinoCDF

Bernstein

BennettBinomial Tail

Best Gausian


Illustration (2)

Small variance (p = 0 1)


95/194

Small variance (p = 0.1)

0.1 0.15 0.2 0.25 0.3 0.350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

Deviationbound

BinoCDF

SubGaussian

Bernstein

Bennett

Binomial Tail

Best Gausian

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.440

0.01

0.02

0.03

0.04

0.05

0.06

0.07

x

Deviationbound

BinoCDF

Bernstein

Bennett

Binomial Tail

Best Gausian


Taking the variance into account (1)

E h f i f h diff i P f (1 P f ) P f


96/194

Each function f

Fhas a different variance P f(1

P f)

P f.

For each f F, by Bernsteins inequality

P f

Pnf +

2P f log 1

n

+2log 1

3n

The Gaussian part dominates (for P f not too small, or n largeenough), it depends on P f


Taking the variance into account (2)


97/194

Central Limit Theorem

nP f PnfP f(1 P f)

N(0, 1)

Idea is to consider the ratioP f Pnf

P f


Normalization

Here (f {0, 1}), Var [f] P f2 = P f


98/194

(f { , }), [f ] f f

Large variance large risk. After normalization, fluctuations are more uniform

supfF

P f PnfP f

not necessarily attained at functions with large variance.

Focus of learning: functions with small error P f (hence smallvariance).

The normalized supremum takes this into account.


Relative deviations

Vapnik-Chervonenkis 1974


99/194

For > 0 with probability at least 1 ,

f F, P f PnfP f

2

log SF(2n) + log 4n

and

f F, Pnf P fPnf

2

log SF(2n) + log 4n


Proof sketch

1. Symmetrization


100/194

P

supfF

P f PnfP f

t

2P

supfF

Pnf Pnf(Pnf + Pnf)/2

t

2. Randomization

= 2EP

supfF

1n

ni=1 i(f(Z

i) f(Zi))

(Pnf + Pnf)/2 t

3. Tail bound


Consequences

From the fact


101/194

A B + CA A B + C2 + BC

we get

f F, P f Pnf + 2Pnflog SF(2n) + log 4n

+4log SF(2n) + log 4

n


Zero noise

Ideal situation


102/194

gn empirical risk minimizer t G

R = 0 (no noise, n(X) = 0 a.s.)In that case

Rn(gn) = 0

R(gn) = O(d log nn ).


Interpolating between rates ?

Rates are not correctly estimated by this inequality


103/194

y y q y

Consequence of relative error bounds

P fn P f + 2

P flog SF(2n) + log 4

n

+4log SF(2n) + log 4n

The quantity which is small is not P f but P fn P f

But relative error bounds do not apply to differences


Definitions

P = PX P (Y |X)


105/194

deterministic case, one can assume that n is well-behaved.

Two kinds of assumptions

n not too close to 1/2

n not often too close to 1/2


Massart Condition

For some c > 0, assume


106/194

> ,

|(X)| > 1c

almost surely

There is no region where the decision is completely random

Noise bounded away from 1/2


Tsybakov Condition

Let [0, 1], equivalent conditions


107/194

(1) c > 0, g {1, 1}X,P [g(X)(X) 0] c(R(g) R)

(2) c > 0, A X, A dP(x) c(A |(x)|dP(x))

(3) B > 0, t 0, P [|(X)| t] Bt 1


Equivalence

(1) (2) Recall R(g) R = E |(X)|1[g0] . For each


108/194

( ) ( ) (g) |( )| [g0]function g, there exists a set A such that 1[A] = 1[g0] (2) (3) Let A = {x : |(x)| t}

P [|| t] = A dP(x) c(A |(x)|dP(x))

ct(

A

dP(x))

P [

|

| t]

c

11 t

1


(3) (1)

R(g) R = E |(X)| 1[g0]tE 1 1


109/194

tE 1[g0]1[||>t]

= tP [|| > t] tE 1[g>0]1[||>t] t(1 Bt 1 ) tP [g > 0] = t(P [g 0] Bt 1 )

Take t = (1)P[g0]B (1)/

P [g 0] B1

(1 )(1 ) (R(g) R

)


Remarks

is in [0, 1] because


110/194

R(g) R = E |(X)|1[g0] E 1[g0]

= 0 no condition

= 1 gives Massarts condition


Consequences

Under Massarts condition


111/194

E

(1[g(X)=Y] 1[t(X)=Y])2

c(R(g) R)

Under Tsybakovs conditionE

(1[g(X)=Y] 1[t(X)=Y])2

c(R(g) R)


Relative loss class

F is the loss class associated to G


112/194

The relative loss class is defined as

F= {f f : f F}

It satisfiesP f

2 c(P f)


Finite case

Union bound on Fwith Bernsteins inequality would give


113/194

P fnP f PnfnPnf+8c(P fn P f) log N

n+

4 log N3n

Consequence when f F(but R > 0)

P fn P f C

log Nn

12

always better than n1/2 for > 0


Local Rademacher average

Definition


114/194

R(F, r) = E supfF:P f2r

Rnf Allows to generalize the previous result

Computes the capacity of a small ball in F (functions with smallvariance)

Under noise conditions, small variance implies small error


Sub-root functions

Definition

R R


115/194

A function :R

R

is sub-root if

is non-decreasing

is non negative

(r)/r is non-increasing


Sub-root functions

Properties

A sub-root function


116/194

is continuous has a unique fixed point (r) = r

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3xphi(x)


Star hull

Definition


117/194

F= {f : f F, [0, 1]}

Properties

Rn(

F, r) is sub-root

Entropy of F is not much bigger than entropy ofF


Result

r fixed point ofR(F, r)


118/194

Bounded functionsP f Pnf C

rVar [f] +

log 1 + log log n

n

Consequence for variance related to expectation (Var [f] c(P f))

P f C

Pnf + (r)

12 +

log 1 + log log n

n


Consequences

For VC classes R(F, r) C rhn hence r Chn


119/194

Rate of convergence of Pnf to P f in O(1/

n)

But rate of convergence of P fn to P f is O(1/n1/(2))Only condition is t G but can be removed by SRM/Model selection


Proof sketch (1)

Talagrands inequality


120/194

supfF

P fPnf E

supfF

P f Pnf

+c

supfF

Var [f] /n+c/n

Peeling of the class

Fk = {f : Var [f] [xk, xk+1)}


Proof sketch (2)

Application


121/194

sup

fFkP fPnf E

sup

fFkP f Pnf

+c

xVar [f] /n+c/n

Symmetrization

f F, P fPnf 2R(F, xVar [f])+c

xVar [f] /n+c/n


Proof sketch (3)

We need to solve this inequality. Things are simple ifR behave like


122/194

a square root, hence the sub-root property

P f Pnf 2

rVar [f] + c

xVar [f] /n + c/n

Variance-expectationVar [f] c(P f)

Solve in P f


Data-dependent version

As in the global case, one can use data-dependent local Rademcher


123/194

averagesRn(F, r) = E

sup

fF:P f2rRnf

Using concentration one can also get

P f C

Pnf + (rn)

12 +

log 1 + log log n

n

where rn is the fixed point of a sub-root upper bound of

Rn(

F, r)


Discussion

Improved rates under low noise conditions


124/194

p

Interpolation in the rates

Capacity measure seems local,

but depends on all the functions,

after appropriate rescaling: each f Fis considered at scale r/Pf2


Randomized Classifiers

Given G a class of functions


125/194

Deterministic: picks a function gn and always use it to predict Randomized construct a distribution n over G for each instance to classify, pick g n

Error is averaged over nR(n) = nP f

Rn(n) = nPnf


Union Bound (1)

Let be a (fixed) distribution over F.


126/194

Recall the refined union bound

f F, P f Pnf

log 1(f) + log1

2n

Take expectation with respect to n

nP f nPnf n

log 1(f) + log

1

2n


Union Bound (2)

(1

/(


127/194

nP f nPnf n log (f) + log /(2n)

n log (f) + log 1 /(2n) K(n, ) + H(n) + log

1 /(2n)

K(n, ) =

n(f) logn(f)(f) df Kullback-Leibler divergence

H(n) = n(f) log n(f) df Entropy


PAC-Bayesian Refinement

It is possible to improve the previous bound.


128/194

With probability at least 1 ,

nP f nPnf

K(n, ) + log 4n + log1

2n

1

Good if n is spread (i.e. large entropy) Not interesting if n = fn


Proof (1)

Variational formulation of entropy: for any T

T (f ) lT(f)

K( )


129/194

T(f) log e + K(, ) Apply it to (P f Pnf)2

n(P f Pnf)2 log e(P fPnf)2

+ K(n, )

Markovs inequality: with probability 1 ,

n(P f Pnf)2 log E

e(P fPnf)2

+ K(n, ) + log1


Proof (2)

FubiniE e

(P fPnf)2 = E e(P fPnf)2


130/194

Modified Chernoff boundE

e(2n1)(P fPnf)

2

4n

Putting together ( = 2n 1)(2n 1)n(P f Pnf)2 K(n, ) + log 4n + log 1

Jensen (2n

1)(n(P f

Pnf))

2

(2n

1)n(P f

Pnf)

2


Lecture 6

Loss Functions

P ti


131/194

Properties

Consistency

Examples

Losses and noise


Motivation (1)

ERM: minimize ni=1 1[g(Xi)=Yi] in a set GC t ti ll h d


132/194

Computationally hard Smoothing

Replace binary by real-valued functions

Introduce smooth loss functionn

i=1

(g(Xi), Yi)


Motivation (2)

Hyperplanes in infinite dimension have i fi ite VC di e sio


133/194

infinite VC-dimension but finite scale-sensitive dimension (to be defined later)

It is good to have a scale

This scale can be used to give a confidence (i.e. estimate the density)

However, losses do not need to be related to densities Can get bounds in terms of margin error instead of empirical error

(smoother easier to optimize for model selection)


Margin

It is convenient to work with (symmetry of +1 and 1)

(g(x) y) (yg(x))


134/194

(g(x), y) = (yg(x))

yg(x) is the margin of g at (x, y) Loss

L(g) = E [(Y g(X))] , Ln(g) =1

n

ni=1

(Yig(Xi))

Loss class F= {f : (x, y) (yg(x)) : g G}


Minimizing the loss

Decomposition of L(g)

1


135/194

1

2E [E [(1 + (X))(g(X)) + (1 (X))(g(X))|X]]

Minimization for each x

H() = infR

((1 + )()/2 + (1 )()/2)

L := infg L(g) = E [H((X))]


Classification-calibrated

A minimal requirement is that the minimizer in H() has the correct

sign (that of the target t or that of )


136/194

sign (that of the target t or that of ).

Definition is classification-calibrated if, for any = 0

inf:0

(1+)()+(1

)(

) > inf

R(1+)()+(1

)(

)

This means the infimum is achieved for an of the correct sign (andnot for an of the wrong sign, except possibly for = 0).


Consequences (1)

Results due to (Jordan, Bartlett and McAuliffe 2003)

is classification calibrated iff for all sequences g and every proba


137/194

is classification-calibrated iff for all sequences gi and every proba-bility distribution P,

L(gi) L R(gi) R

When is convex (convenient for optimization) is classification-calibrated iff it is differentiable at 0 and (0) < 0


Consequences (2)

Let H() = inf :0 ((1 + )()/2 + (1 )()/2)


138/194

Let H () = inf:0 ((1 + )()/2 + (1 )()/2)

Let () be the largest convex function below H() H()

One has(R(g) R) L(g) L


Examples (1)

3.5

401hingesquared hingesquareexponential


139/194

1 0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3


Examples (2)

Hinge loss(x) = max(0, 1 x), (x) = x

Squared hinge loss


140/194

Squared hinge loss(x) = max(0, 1 x)2, (x) = x2

Square loss

(x) = (1 x)2, (x) = x2 Exponential

(x) = exp(x), (x) = 1

1 x2


Low noise conditions

Relationship can be improved under low noise conditions Under Tsybakovs condition with exponent and constant c,

c(R(g) R)((R(g) R)1/2c) L(g) L


141/194

c(R(g) R ) ((R(g) R ) /2c) L(g) L

Hinge loss (no improvement)

R(g)

R

L(g)

L

Square loss or squared hinge loss

R(g) R (4c(L(g) L)) 12


Estimation error

Recall that Tsybakov condition implies P f2 c(P f) for therelative loss class (with 0

1 loss)


142/194

What happens for the relative loss class associated to ?

Two possibilities

Strictly convex loss (can modify the metric on R)

Piecewise linear


Strictly convex losses

Noise behavior controlled by modulus of convexity

R l


143/194

Result(

P f2

K) P f /2

with K Lipschitz constant of and modulus of convexity of L(g)

with respect to f gL2(P)

Not related to noise exponent


Piecewise linear losses

Noise behavior related to noise exponent


144/194

Result for hinge lossP f

2

CP f

if initial class G is uniformly bounded



145/194

Examples

Under Tsybakovs condition

Hinge loss 1 log 1 + log log n


146/194

R(g) R L(g) L + C

(r

)1

2 +log + log log n

n

Squared hinge loss or square loss (x) = cx2, P f2

CP f

R(g)R C

L(g) L + C(r + log1 + log log n

n)

12


Classification vs Regression losses

Consider a classification-calibrated function


147/194

It is a classification loss if L(t) = L

otherwise it is a regression loss


Classification vs Regression losses

Square, squared hinge, exponential losses Noise enters relationship between risk and loss


148/194

Modulus of convexity enters in estimation error

Hinge loss Direct relationship between risk and loss

Noise enters in estimation error

Approximation term not affected by noise in second case Real value does not bring probability information in second case


Lecture 7

Regularization

Formulation


149/194

Capacity measures

Computing Rademacher averages

Applications


Equivalent problems

Up to the choice of the regularization parameters, the following

problems are equivalent

min L (f ) + (f )


150/194

minfF

Ln(f) + (f)

minfF:Ln(f)e

(f)

minfF:(f)R Ln(f)

The solution sets are the same


Comments

Computationally, variant of SRM

variant of model selection by penalization


151/194

y p

one has to choose a regularizer which makes sense

Need a class that is large enough (for universal consistency)

but has small balls


Rates

To obtain bounds, consider ERM on balls

Relevant capacity is that of balls


152/194

p y

Real-valued functions, need a generalization of VC dimension, entropyor covering numbers

Involve scale sensitive capacity (takes into account the value and notonly the sign)


Scale-sensitive capacity

Generalization of VC entropy and VC dimension to real-valued func-tions

D fi i i i h d b F ( l ) if h


153/194

Definition: a set x1, . . . , xn is shattered by F (at scale ) if thereexists a function s such that for all choices of i {1, 1}, thereexists f F

i(f(xi) s(xi)) The fat-shattering dimension of Fat scale (denoted vc(F, )) is

the maximum cardinality of a shattered set


Link with covering numbers

Like VC dimension, fat-shattering dimension can be used to upperbound covering numbers


154/194

ResultN(F, t , n)

C1

t

C2vc(F,C3t)

Note that one can also define data-dependent versions (restriction onthe sample)


Link with Rademacher averages (1)

Consequence of covering number estimates

Rn(F) C1

vc(F, t) log C2t

dt


155/194

( ) n

0

( ) g

t

Another link via Gaussian averages (replace Rademacher by GaussianN(0,1) variables)

Gn(F) = Eg

supfF

1

n

ni=1

gif(Zi)


Link with Rademacher averages (2)

Worst case average

n(F) = sup E

sup Gnf


156/194

n(F) supx1,...,xn

E

supfF

Gnf

Associated dimension t(F, ) = sup{n N : n(F) } Result (Mendelson & Vershynin 2003)

vc(F, c) t(F, ) K2

vc(F, c)


Rademacher averages and Lipschitz losses

What matters is the capacity of

F(loss class)


157/194

If is Lipschitz with constant M

thenRn(F) MRn(G)

Relates to Rademacher average of the initial class (easier to compute)


Dualization

Consider the problem mingR Ln(g) Rademacher of ball

E sup

gRRng


158/194

g

Duality

E supfR Rnf = RnE

ni=1

iXi

dual norm, Xi evaluation at Xi (element of the dual underappropriate conditions)


RHKS

Given a positive definite kernel k

Space of functions: reproducing kernel Hilbert space associated to k

Regularizer: rkhs norm


159/194

Regularizer: rkhs norm k

Properties: Representer theorem

gn =n

i=1

ik(Xi, )


Shattering dimension of hyperplanes

Set of functions

G = {g(x) = w x : w = 1}


160/194

G {g(x) w x : w 1}

Assume

x

R

Resultvc(G, ) R2/2


Proof Strategy (Gurvits, 1997)

Assume that x1, . . . , xr are -shattered by hyperplanes with w = 1,i.e., for all y1, . . . , yr {1}, there exists a w such that

yi w, xi for all i = 1, . . . , r . (2)


161/194

Two steps:

prove that the more points we want to shatter (2), the larger

ri=1 yixi must be upper bound the size of ri=1 yixi in terms of R

Combining the two tells us how many points we can at most shatter


Part I

Summing (2) yields w, (

ri=1 yixi) r

By Cauchy-Schwarz inequality


162/194

w,

r

i=1

yixi

w

ri=1

yixi

=

ri=1

yixi

Combine both:r

ri=1

yixi

. (3)


Part II

Consider labels yi {1}, as (Rademacher variables).

E

ri=1

yixi2 = E r

i,j=1

yiyj xi, xj


163/194

i 1

i,j 1

=r

i=1 E [xi, xi] + E

r

i=j xi, xj

=r

i=1

xi2


Part II, ctd.

Since xi R, we get E ri=1 yixi

2

rR2.

This holds for the expectation over the random choices of the labels,


164/194

hence there must be at least one set of labels for which it also holds

true. Use this set.

Hence

ri=1

yixi

2

rR2.


Part I and II Combined

Part I: (r)2 ri=1 yixi

2

Part II: ri=1 yixi2 rR2


165/194

i 1 Hence

r2

2

rR

2,

i.e.,

r R2

2


Boosting

Given a class H of functions

Space of functions: linear span of

H Regularizer: 1-norm of the weights g = inf{ |i| : g =h }


166/194

ihi}

Properties: weight concentrated on the (weighted) margin maximizers

gn = whh

diYih(Xi) = minhH

diYih

(Xi)


Rademacher averages for boosting

Function class of interest

GR = {g span H : g1 R}


167/194

Result

Rn(GR) = RRn(H)

Capacity (as measured by global Rademacher averages) not affectedby taking linear combinations !


Lecture 8

SVM

Computational aspects


168/194

Capacity Control

Universality

Special case of RBF kernel


Formulation (1)

Soft margin

minw,b,

1

2 w2

+ C

mi=1

i

(w x + b) 1


169/194

yi(w, xi + b) 1 ii 0

Convex objective function and convex constraints

Unique solution Efficient procedures to find it Is it the right criterion ?


Formulation (2) Soft margin

minw,b,

1

2w2 + C

m

i=1i

yi(w, xi + b) 1 i, i 0O ti l l f


170/194

Optimal value of ii = max(0, 1 yi(w, xi + b))

Substitute above to get

minw,b

1

2w2 + C

m

i=1max(0, 1 yi(w, xi + b))


Regularization

General form of regularization problem

minfF 1m

ni=1

c(yif(xi)) + f2


171/194

Capacity control by regularization with convex cost

0 1

1


Loss Function

(Y f(X)) = max(0, 1 Y f(X))

Convex, non-increasing, upper bounds 1[Y f(X)0]


172/194

Classification-calibrated

Classification type (L = L(t))R(g) R L(g) L


Regularization

Choosing a kernel corresponds to

Choose a sequence (ak) Set


173/194

f2 :=k0

ak

|f(k)|2dx

penalization of high order derivatives (high frequencies)

enforce smoothness of the solution


Capacity: VC dimension The VC dimension of the set of hyperplanes is d + 1 in Rd.

Dimension of feature space ?

for RBF kernel

w choosen in the span of the data (w = iyixi)The span of the data has dimension m for RBF kernel (k(., xi)


174/194

p ( ( , i)linearly independent)

The VC bound does not give any informationh

m= 1

Need to take the margin into account


Capacity: Shattering dimension

Hyperplanes with Margin

If x R,vc(hyperplanes with margin , 1) R

2

/2


175/194


Margin

The shattering dimension is related to the margin

Maximizing the margin means minimizing the shattering dimension

S


176/194

Small shattering dimension good control of the risk

this control is automatic (no need to choose the margin beforehand)

but requires tuning of regularization parameter



177/194

Rademacher Averages (2)

E supwM 1n n

i=1 i w, xi= E

supwM

w, 1n

ni 1

ixi


178/194

pwM

, n

i=1

i xi

E supwM w

1n

n

i=1ixi

=M

nE

ni=1

ixi, n

i=1ixi


Rademacher Averages (3)

M

n

E n

i=1

ixi,n

i=1

ixi M

E

n

n


179/194

n

E

i=1

ixi,

i=1ixi

=

M

n E i,j ij xi, xj=

M

n

ni=1

k(xi, xi)


Improved rates Noise condition

Under Massarts condition (

|

|> 0), with

g

M

E

((Y g(X)) (Y t(X)))2

(M1+2/0)(L(g)L) .


180/194

If noise is nice, variance linearly related to expectation Estimation error of order r (of the class G)


Improved rates Capacity (1)

rn related to decay of eigenvalues of the Gram matrix

rn c

mindN

d + j


181/194

n n dN

j>d

j

Note that d = 0 gives the trace bound

rn always better than the trace bound (equality when i constant)


Improved rates Capacity (2)

Example: exponential decay

i = ei

1


182/194

Global Rademacher of order 1n

rn of orderlog n

n


Exponent of the margin

Estimation error analysis shows that in GM = {g : g M}

R(gn) R(g) M...


183/194

Wrong power (M2 penalty) is used in the algorithm

Computationally easier But does not give a dimension-free status Using M could improve the cutoff detection


Kernel

Why is it good to use kernels ?

Gaussian kernel (RBF)

k( )xy

2

2


184/194

k(x, y) = e 22

is the width of the kernel

What is the geometry of the feature space ?



185/194

RBF


186/194


RBF

Differential Geometry

Flat Riemannian metric

distance along the sphere is equal to distance in input space


187/194

distance along the sphere is equal to distance in input space

Distances are contracted

shortcuts by getting outside the sphere


RBF

Geometry of the span

Ellipsoid

a

b

x^2/a^2 + y^2/b^2 = 1


188/194

K = (k(xi, xj)) Gram matrix Eigenvalues 1, . . . , m Data points mapped to ellispoid with lengths 1, . . . ,

m


RBF

Universality

Consider the set of functions

H = span{k(x, ) : x X}


189/194

H is dense in C(X) Any continuous function can be approximated (in the norm) by

functions in H

with enough data one can construct any function


RBF

Eigenvalues

Exponentially decreasing

Fourier domain: exponential penalization of derivatives


190/194

p p

Enforces smoothness with respect to the Lebesgue measure in inputspace


RBF

Induced Distance and Flexibility

0

1-nearest neighbor in input spaceEach point in a separate dimension, everything orthogonal


191/194

linear classifier in input space

All points very close on the sphere, initial geometry

Tuning allows to try all possible intermediate combinations


RBF

Ideas

Works well if the Euclidean distance is good

Works well if decision boundary is smooth


192/194

Adapt smoothness via

Universal



193/194

Learning Theory: some informal thoughts

Need assumptions/restrictions to learn

Data cannot replace knowledge No universal learning (simplicity measure)


194/194

SVM work because of capacity control

Choice of kernel = choice of prior/ regularizer RBF works well if Euclidean distance meaningful Knowledge improves performance (e.g. invariances)


Stastical Learning Theory

Documents

Transcript of Stastical Learning Theory