Stastical Learning Theory
-
Upload
imran-khan -
Category
Documents
-
view
220 -
download
0
Transcript of Stastical Learning Theory
-
8/4/2019 Stastical Learning Theory
1/194
Statistical Learning Theory
Olivier BousquetDepartment of Empirical Inference
Max Planck Institute of Biological Cybernetics
Machine Learning Summer School, August 2003
-
8/4/2019 Stastical Learning Theory
2/194
Roadmap (1)
Lecture 1: Introduction
Part I: Binary Classification
Lecture 2: Basic bounds
Lecture 3: VC theory
Lecture 4: Capacity measures
Lecture 5: Advanced topics
O. Bousquet Statistical Learning Theory 1
-
8/4/2019 Stastical Learning Theory
3/194
Roadmap (2)
Part II: Real-Valued Classification
Lecture 6: Margin and loss functions
Lecture 7: Regularization
Lecture 8: SVM
O. Bousquet Statistical Learning Theory 2
-
8/4/2019 Stastical Learning Theory
4/194
Lecture 1
The Learning Problem Context
Formalization
Approximation/Estimation trade-off
Algorithms and Bounds
O. Bousquet Statistical Learning Theory Lecture 1 3
-
8/4/2019 Stastical Learning Theory
5/194
Learning and Inference
The inductive inference process:1. Observe a phenomenon
2. Construct a model of the phenomenon
3. Make predictions
This is more or less the definition of natural sciences !
The goal of Machine Learning is to automate this process
The goal of Learning Theory is to formalize it.
O. Bousquet Statistical Learning Theory Lecture 1 4
-
8/4/2019 Stastical Learning Theory
6/194
Pattern recognition
We consider here the supervised learning framework for patternrecognition:
Data consists of pairs (instance, label)
Label is +1 or 1
Algorithm constructs a function (instance label)
Goal: make few mistakes on future unseen instances
O. Bousquet Statistical Learning Theory Lecture 1 5
-
8/4/2019 Stastical Learning Theory
7/194
Approximation/Interpolation
It is always possible to build a function that fits exactly the data.
0 0.5 1 1.50
0.5
1
1.5
But is it reasonable ?
O. Bousquet Statistical Learning Theory Lecture 1 6
-
8/4/2019 Stastical Learning Theory
8/194
Occams Razor
Idea: look for regularities in the observed phenomenonThese can ge generalized from the observed past to the future
choose the simplest consistent model
How to measure simplicity ?
Physics: number of constants Description length
Number of parameters
...
O. Bousquet Statistical Learning Theory Lecture 1 7
-
8/4/2019 Stastical Learning Theory
9/194
No Free Lunch
No Free Lunch if there is no assumption on how the past is related to the future,
prediction is impossible
if there is no restriction on the possible phenomena, generalization
is impossible
We need to make assumptions Simplicity is not absolute Data will never replace knowledge Generalization = data + knowledge
O. Bousquet Statistical Learning Theory Lecture 1 8
-
8/4/2019 Stastical Learning Theory
10/194
Assumptions
Two types of assumptions
Future observations related to past ones
Stationarity of the phenomenon
Constraints on the phenomenon Notion of simplicity
O. Bousquet Statistical Learning Theory Lecture 1 9
-
8/4/2019 Stastical Learning Theory
11/194
-
8/4/2019 Stastical Learning Theory
12/194
Probabilistic Model
Relationship between past and future observations
Sampled independently from the same distribution
Independence: each new observation yields maximum information
Identical distribution: the observations give information about theunderlying phenomenon (here a probability distribution)
O. Bousquet Statistical Learning Theory Lecture 1 11
-
8/4/2019 Stastical Learning Theory
13/194
Probabilistic Model
We consider an input space X and output space Y.Here: classification case Y= {1, 1}.
Assumption: The pairs (X, Y) X Yare distributedaccording to P (unknown).
Data: We observe a sequence of n i.i.d. pairs (Xi, Yi) sampled
according to P.
Goal: construct a function g : X Ywhich predicts Y from X.
O. Bousquet Statistical Learning Theory Lecture 1 12
-
8/4/2019 Stastical Learning Theory
14/194
Probabilistic Model
Criterion to choose our function:
Low probability of error P(g(X) = Y).Risk
R(g) = P(g(X) = Y) = E
1[g(X)=Y]
P is unknown so that we cannot directly measure the risk Can only measure the agreement on the data Empirical Risk
Rn(g) =1
n
n
i=1 1[g(Xi)=Yi]
O. Bousquet Statistical Learning Theory Lecture 1 13
-
8/4/2019 Stastical Learning Theory
15/194
Target function
P can be decomposed as PX P(Y|X) (x) = E [Y|X = x] = 2P [Y = 1|X = x] 1 is the regression
function
t(x) = sgn (x) is the target function in the deterministic case Y = t(X) (P [Y = 1|X] {0, 1})
in general, n(x) = min(P [Y = 1|X = x] , 1P [Y = 1|X = x])(1 (x))/2 is the noise level
O. Bousquet Statistical Learning Theory Lecture 1 14
-
8/4/2019 Stastical Learning Theory
16/194
Assumptions about P
Need assumptions about P.
Indeed, if t(x) is totally chaotic, there is no possible generalization
from finite data.
Assumptions can be
Preference (e.g. a priori probability distribution on possible functions)
Restriction (set of possible functions)Treating lack of knowledge
Bayesian approach: uniform distribution Learning Theory approach: worst case analysis
O. Bousquet Statistical Learning Theory Lecture 1 15
-
8/4/2019 Stastical Learning Theory
17/194
-
8/4/2019 Stastical Learning Theory
18/194
Overfitting/Underfitting
The data can mislead you. Underfittingmodel too small to fit the data
Overfittingartificially good agreement with the data
No way to detect them from the data ! Need extra validation data.
O. Bousquet Statistical Learning Theory Lecture 1 17
-
8/4/2019 Stastical Learning Theory
19/194
Empirical Risk Minimization
Choose a model G (set of possible functions)
Minimize the empirical risk in the model
mingG Rn(g)
What if the Bayes classifier is not in the model ?
O. Bousquet Statistical Learning Theory Lecture 1 18
-
8/4/2019 Stastical Learning Theory
20/194
Approximation/Estimation
Bayes risk
R = infg
R(g) .
Best risk a deterministic function can have (risk of the target function,
or Bayes classifier).
Decomposition: R(g) = inf
gGR(g)
R(gn) R = R(g) R + R(gn) R(g) Approximation Estimation
Only the estimation error is random (i.e. depends on the data).
O. Bousquet Statistical Learning Theory Lecture 1 19
-
8/4/2019 Stastical Learning Theory
21/194
Structural Risk Minimization
Choose a collection of models {Gd : d = 1, 2, . . .} Minimize the empirical risk in each model Minimize the penalized empirical risk
mind
mingGd
Rn(g) + pen(d, n)
pen(d, n) gives preference to models where estimation error is small
pen(d, n) measures the size or capacity of the model
O. Bousquet Statistical Learning Theory Lecture 1 20
-
8/4/2019 Stastical Learning Theory
22/194
Regularization
Choose a large model G (possibly dense) Choose a regularizer g Minimize the regularized empirical risk
mingG Rn(g) + g2
Choose an optimal trade-off (regularization parameter).
Most methods can be thought of as regularization methods.
O. Bousquet Statistical Learning Theory Lecture 1 21
-
8/4/2019 Stastical Learning Theory
23/194
Bounds (1)
A learning algorithm
Takes as input the data (X1, Y1), . . . , (Xn, Yn) Produces a function gn
Can we estimate the risk of gn ?
random quantity (depends on the data).
need probabilistic bounds
O. Bousquet Statistical Learning Theory Lecture 1 22
-
8/4/2019 Stastical Learning Theory
24/194
Bounds (2)
Error boundsR(gn) Rn(gn) + B
Estimation from an empirical quantity
Relative error bounds
Best in a class
R(gn) R(g) + B Bayes risk
R(gn)
R + B
Theoretical guarantees
O. Bousquet Statistical Learning Theory Lecture 1 23
-
8/4/2019 Stastical Learning Theory
25/194
Lecture 2
Basic Bounds
Probability tools
Relationship with empirical processes
Law of large numbers
Union bound
Relative error bounds
O. Bousquet Statistical Learning Theory Lecture 2 24
-
8/4/2019 Stastical Learning Theory
26/194
Probability Tools (1)
Basic facts
Union: P [A or B] P [A] + P [B]
Inclusion: If A B, then P [A] P [B].
Inversion: IfP [X t] F(t) then with probability at least 1 ,X F1().
Expectation: If X
0, E [X] = 0 P [X t] dt.
O. Bousquet Statistical Learning Theory Lecture 2 25
-
8/4/2019 Stastical Learning Theory
27/194
Probability Tools (2)
Basic inequalities
Jensen: for f convex, f(E [X]) E [f(X)]
Markov: If X 0 then for all t > 0, P [X t] E[X]t
Chebyshev: for t > 0, P [|X E [X] | t] Var[X]t2
Chernoff: for all t R, P [X t] inf0 E e(Xt)
O. Bousquet Statistical Learning Theory Lecture 2 26
-
8/4/2019 Stastical Learning Theory
28/194
Error bounds
Recall that we want to bound R(gn) = E 1[gn(X)=Y] where gn hasbeen constructed from (X1, Y1), . . . , (Xn, Yn). Cannot be observed (P is unknown)
Random (depends on the data) we want to bound
P [R(gn) Rn(gn) > ]
O. Bousquet Statistical Learning Theory Lecture 2 27
-
8/4/2019 Stastical Learning Theory
29/194
Loss class
For convenience, let Zi = (Xi, Yi) and Z = (X, Y). GivenG
define
the loss class
F= {f : (x, y) 1[g(x)=y] : g G}
Denote P f =E
[f(X, Y)] and Pnf =
1
n ni=1 f(Xi, Yi)Quantity of interest:
P f Pnf
We will go back and forth between Fand G (bijection)
O. Bousquet Statistical Learning Theory Lecture 2 28
-
8/4/2019 Stastical Learning Theory
30/194
Empirical process
Empirical process:
{P f Pnf}fF
Process = collection of random variables (here indexed by functionsin
F)
Empirical = distribution of each random variable Many techniques exist to control the supremum
supfF
P f Pnf
O. Bousquet Statistical Learning Theory Lecture 2 29
-
8/4/2019 Stastical Learning Theory
31/194
The Law of Large Numbers
R(g) Rn(g) = E [f(Z)] 1nn
i=1
f(Zi)
difference between the expectation and the empirical average of ther.v. f(Z)
Law of large numbers
P
lim
n1
n
ni=1
f(Zi) E [f(Z)] = 0
= 1 .
can we quantify it ?
O. Bousquet Statistical Learning Theory Lecture 2 30
-
8/4/2019 Stastical Learning Theory
32/194
Hoeffdings Inequality
Quantitative version of law of large numbers.
Assumes bounded random variables
Theorem 1. LetZ1, . . . , Zn be n i.i.d. random variables. Iff(Z) [a, b]. Then for all > 0, we have
P
1
n
ni=1
f(Zi) E [f(Z)]
>
2 exp
2n
2
(b a)2
.
Lets rewrite it to better understand
O. Bousquet Statistical Learning Theory Lecture 2 31
-
8/4/2019 Stastical Learning Theory
33/194
Hoeffdings InequalityWrite
= 2 exp 2n2(b a)2
Then
P
|Pnf P f| > (b a)
log 2
2n
or [Inversion] with probability at least 1 ,
|Pnf P f| (b a)
log 22n
O. Bousquet Statistical Learning Theory Lecture 2 32
-
8/4/2019 Stastical Learning Theory
34/194
Hoeffdings inequality
Lets apply to f(Z) = 1[g(X)=Y].
For any g, and any > 0, with probability at least 1
R(g)
Rn(g) +
log 2
2n
. (1)
Notice that one has to consider a fixed function f and the probability is
with respect to the sampling of the data.
If the function depends on the data this does not apply !
O. Bousquet Statistical Learning Theory Lecture 2 33
-
8/4/2019 Stastical Learning Theory
35/194
Limitations
For each fixed function f F, there is a set S of samples for whichP f Pnf
log 2
2n (P [S] 1 )
They may be different for different functions
The function chosen by the algorithm depends on the sample
For the observed sample, only some of the functions in Fwill satisfythis inequality !
O. Bousquet Statistical Learning Theory Lecture 2 34
-
8/4/2019 Stastical Learning Theory
36/194
Limitations
What we need to bound is
P fn Pnfn
where fn is the function chosen by the algorithm based on the data.
For any fixed sample, there exists a function f such that
P f Pnf = 1
Take the function which is f(Xi) = Yi on the data and f(X) = Yeverywhere else.
This does not contradict Hoeffding but shows it is not enough
O. Bousquet Statistical Learning Theory Lecture 2 35
-
8/4/2019 Stastical Learning Theory
37/194
Limitations
Risk
Function class
R
Remp
f f f opt m
R[f]
R [f]emp
Hoeffdings inequality quantifies differences for a fixed function
O. Bousquet Statistical Learning Theory Lecture 2 36
U if D i i
-
8/4/2019 Stastical Learning Theory
38/194
Uniform Deviations
Before seeing the data, we do not know which function the algorithm
will choose.
The trick is to consider uniform deviations
R(fn) Rn(fn) supfF(R(f) Rn(f))
We need a bound which holds simultaneously for all functions in a class
O. Bousquet Statistical Learning Theory Lecture 2 37
U i B d
-
8/4/2019 Stastical Learning Theory
39/194
Union Bound
Consider two functions f1, f2 and define
Ci = {(x1, y1), . . . , (xn, yn) : P fi Pnfi > }
From Hoeffdings inequality, for each i
P [Ci]
We want to bound the probability of being bad for i = 1 or i = 2
P[C1 C2]
P[C1] +
P[C2]
O. Bousquet Statistical Learning Theory Lecture 2 38
Fi i C
-
8/4/2019 Stastical Learning Theory
40/194
Finite CaseMore generally
P [C1 . . . CN] N
i=1
P [Ci]
We have
P [f {f1, . . . , f N} : P f Pnf > ]
N
i=1
P [P fi Pnfi > ]
N exp 2n2O. Bousquet Statistical Learning Theory Lecture 2 39
Fi it C
-
8/4/2019 Stastical Learning Theory
41/194
Finite Case
We obtain, for
G=
{g1, . . . , gN
}, for all > 0
with probability at least 1 ,
g
G, R(g)
Rn(g) + log N + log
1
2m
This is a generalization bound !
Coding interpretation
log N is the number of bits to specify a function in F
O. Bousquet Statistical Learning Theory Lecture 2 40
A i ti /E ti ti
-
8/4/2019 Stastical Learning Theory
42/194
Approximation/EstimationLet
g
= arg mingG R(g)If gn minimizes the empirical risk in G,
Rn(g) Rn(gn) 0
Thus
R(gn) = R(gn) R(g) + R(g) Rn(g) Rn(gn) + R(gn) R(g) + R(g)
2 supgG |R(g) Rn(g)| + R(g)
O. Bousquet Statistical Learning Theory Lecture 2 41
A i ti /E ti ti
-
8/4/2019 Stastical Learning Theory
43/194
Approximation/Estimation
We obtain with probability at least 1
R(gn) R(g) + 2
log N + log 22m
The first term decreases if N increasesThe second term increases
The size ofG controls the trade-off
O. Bousquet Statistical Learning Theory Lecture 2 42
S (1)
-
8/4/2019 Stastical Learning Theory
44/194
Summary (1)
Inference requires assumptions
- Data sampled i.i.d. from P
- Restrict the possible functions to G
- Choose a sequence of models Gm to have more flexibility/control
O. Bousquet Statistical Learning Theory Lecture 2 43
Summary (2)
-
8/4/2019 Stastical Learning Theory
45/194
Summary (2)
Bounds are valid w.r.t. repeated sampling
- For a fixed function g, for most of the samples
R(g) Rn(g) 1/
n
- For most of the samples if|G| = N
supgG
R(g) Rn(g)
log N/n
Extra variability because the chosen gn changes with the data
O. Bousquet Statistical Learning Theory Lecture 2 44
Improvements
-
8/4/2019 Stastical Learning Theory
46/194
Improvements
We obtained
supgG
R(g) Rn(g)
log N + log 22n
To be improved
Hoeffding only uses boundedness, not the variance Union bound as bad as if independent Supremum is not what the algorithm chooses.
Next we improve the union bound and extend it to the infinite case
O. Bousquet Statistical Learning Theory Lecture 2 45
Refined union bound (1)
-
8/4/2019 Stastical Learning Theory
47/194
Refined union bound (1)
For each f
F,
P
P f Pnf >
log 1(f)
2n
(f)
P
f F: P f Pnf >
log 1(f)
2n
fF
(f)
Choose (f) = p(f) with fFp(f) = 1
O. Bousquet Statistical Learning Theory Lecture 2 46
Refined union bound (2)
-
8/4/2019 Stastical Learning Theory
48/194
Refined union bound (2)
With probability at least 1
,
f F, P f Pnf +
log 1p(f) + log1
2n
Applies to countably infinite F Can put knowledge about the algorithm into p(f)
But p chosen before seeing the data
O. Bousquet Statistical Learning Theory Lecture 2 47
Refined union bound (3)
-
8/4/2019 Stastical Learning Theory
49/194
Refined union bound (3)
Good p means good bound. The bound can be improved if you knowahead of time the chosen function (knowledge improves the bound)
In the infinite case, how to choose the p (since it implies an ordering)
The trick is to look at Fthrough the data
O. Bousquet Statistical Learning Theory Lecture 2 48
Lecture 3
-
8/4/2019 Stastical Learning Theory
50/194
Lecture 3
Infinite Case: Vapnik-Chervonenkis Theory
Growth function
Vapnik-Chervonenkis dimension
Proof of the VC bound
VC entropy
SRM
O. Bousquet Statistical Learning Theory Lecture 3 49
Infinite Case
-
8/4/2019 Stastical Learning Theory
51/194
Infinite CaseMeasure of the size of an infinite class ?
Consider
Fz1,...,zn = {(f(z1), . . . , f (zn)) : f F}
The size of this set is the number of possible ways in which the data
(z1, . . . , zn) can be classified.
Growth function
SF(n) = sup(z1,...,zn)
|Fz1,...,zn|
Note that SF(n) = SG(n)
O. Bousquet Statistical Learning Theory Lecture 3 50
Infinite Case
-
8/4/2019 Stastical Learning Theory
52/194
Infinite Case
Result (Vapnik-Chervonenkis)With probability at least 1
g G, R(g) Rn(g) +
log SG(2n) + log 48n
Always better than N in the finite case How to compute SG(n) in general ?
use VC dimension
O. Bousquet Statistical Learning Theory Lecture 3 51
VC Dimension
-
8/4/2019 Stastical Learning Theory
53/194
VC Dimension
Notice that since g
{1, 1
}, SG(n)
2n
If SG(n) = 2n, the class of functions can generate any classification onn points (shattering)
Definition 2. The VC-dimension ofG is the largest n such that
SG(n) = 2n
O. Bousquet Statistical Learning Theory Lecture 3 52
VC Dimension
-
8/4/2019 Stastical Learning Theory
54/194
VC Dimension
Hyperplanes
In Rd, V C(hyperplanes) = d + 1
O. Bousquet Statistical Learning Theory Lecture 3 53
VC Dimension
-
8/4/2019 Stastical Learning Theory
55/194
Number of Parameters
Is VC dimension equal to number of parameters ?
One parameter{sgn(sin(tx)) : t R}
Infinite VC dimension !
O. Bousquet Statistical Learning Theory Lecture 3 54
VC Dimension
-
8/4/2019 Stastical Learning Theory
56/194
We want to know S
G(n) but we only know
SG(n) = 2n for n hWhat happens for n h ?
m
log(S(m))
h
O. Bousquet Statistical Learning Theory Lecture 3 55
Vapnik-Chervonenkis-Sauer-Shelah Lemma
-
8/4/2019 Stastical Learning Theory
57/194
p
Lemma 3. Let
Gbe a class of functions with finite VC-dimension h.
Then for all n N,SG(n)
hi=0
ni
and for all n h,
SG(n) enh
h
phase transition
O. Bousquet Statistical Learning Theory Lecture 3 56
VC Bound
-
8/4/2019 Stastical Learning Theory
58/194
Let
Gbe a class with VC dimension h.
With probability at least 1
g G, R(g) Rn(g) +
h log 2enh + log4
8n
So the error is of order h log n
n
O. Bousquet Statistical Learning Theory Lecture 3 57
Interpretation
-
8/4/2019 Stastical Learning Theory
59/194
VC dimension: measure of effective dimension
Depends on geometry of the class
Gives a natural definition of simplicity (by quantifying the potentialoverfitting)
Not related to the number of parameters
Finiteness guarantees learnability under any distribution
O. Bousquet Statistical Learning Theory Lecture 3 58
Symmetrization (lemma)
-
8/4/2019 Stastical Learning Theory
60/194
y ( )
Key ingredient in VC bounds: Symmetrization
Let Z1, . . . , Zn an independent (ghost) sample and P
n the
corresponding empirical measure.
Lemma 4. For any t > 0, such that nt
2
2,
P
supfF
(P Pn)f t
2P
supfF
(Pn Pn)f t/2
O. Bousquet Statistical Learning Theory Lecture 3 59
Symmetrization (proof 1)
-
8/4/2019 Stastical Learning Theory
61/194
( )
fn the function achieving the supremum (depends on Z1, . . . , Zn)
1[(PPn)fn>t]1[(PPn)fnt (PPn)fnt/2]
Taking expectations with respect to the second sample gives
1[(PPn)fn>t]P (P Pn)fn < t/2 P (Pn Pn)fn > t/2
O. Bousquet Statistical Learning Theory Lecture 3 60
Symmetrization (proof 2)
-
8/4/2019 Stastical Learning Theory
62/194
By Chebyshev inequality,
P (P Pn)fn t/2 4Var [fn]nt2 1nt2
Hence
1[(PPn)fn>t](1 1
nt2) P (Pn Pn)fn > t/2
Take expectation with respect to first sample.
O. Bousquet Statistical Learning Theory Lecture 3 61
Proof of VC bound (1)
-
8/4/2019 Stastical Learning Theory
63/194
Symmetrization allows to replace expectation by average on ghost
sample Function class projected on the double sample
FZ1,...,Zn,Z1,...,Zn
Union bound on FZ1,...,Zn,Z1,...,Zn Variant of Hoeffdings inequality
P Pnf P
nf > t 2ent2/2
O. Bousquet Statistical Learning Theory Lecture 3 62
Proof of VC bound (2)
-
8/4/2019 Stastical Learning Theory
64/194
P
supfF(P Pn)f t
2P supfF(Pn Pn)f t/2= 2P
supfFZ1,...,Zn,Z1,...,Zn (Pn Pn)f t/2 2SF(2n)P
(Pn Pn)f t/2
4SF(2n)ent
2/8
O. Bousquet Statistical Learning Theory Lecture 3 63
VC Entropy (1)
-
8/4/2019 Stastical Learning Theory
65/194
VC dimension is distribution independent
The same bound holds for any distribution
It is loose for most distributions
A similar proof can give a distribution-dependent result
O. Bousquet Statistical Learning Theory Lecture 3 64
VC Entropy (2)
-
8/4/2019 Stastical Learning Theory
66/194
Denote the size of the projection N (
F, z1 . . . , zn) := #
Fz
1,...,z
n The VC entropy is defined as
HF(n) = log E [N (F, Z1, . . . , Zn)] ,
VC entropy bound: with probability at least 1
g G, R(g) Rn(g) +
HG(2n) + log 28n
O. Bousquet Statistical Learning Theory Lecture 3 65
VC Entropy (proof)
-
8/4/2019 Stastical Learning Theory
67/194
Introduce i {1, 1} (probability 1/2), Rademacher variables
2P
supfFZ,Z(P
n Pn)f t/2
2EP
supfFZ,Z
1
n
n
i=1i(f(Z
i) f(Zi)) t/2
2E NF, Z , Z P
1
n
ni=1
i t/2
2E NF, Z , Z
ent2/8
O. Bousquet Statistical Learning Theory Lecture 3 66
From Bounds to Algorithms
-
8/4/2019 Stastical Learning Theory
68/194
For any distribution, H
G(n)/n
0 ensures consistency of empirical
risk minimizer (i.e. convergence to best in the class)
Does it means we can learn anything ?
No because of the approximation of the class
Need to trade-off approximation and estimation error (assessed bythe bound)
Use the bound to control the trade-off
O. Bousquet Statistical Learning Theory Lecture 3 67
Structural risk minimization
-
8/4/2019 Stastical Learning Theory
69/194
Structural risk minimization (SRM) (Vapnik, 1979): minimize the
right hand side of
R(g) Rn(g) + B(h, n).
To this end, introduce a structure on G.
Learning machine a set of functions and an induction principle
O. Bousquet Statistical Learning Theory Lecture 3 68
SRM: The Picture
-
8/4/2019 Stastical Learning Theory
70/194
R(f* )
h
training error
capacity term
error
Sn1 Sn Sn+1
structure
bound on test error
O. Bousquet Statistical Learning Theory Lecture 3 69
Lecture 4
-
8/4/2019 Stastical Learning Theory
71/194
Capacity Measures
Covering numbers
Rademacher averages
Relationships
O. Bousquet Statistical Learning Theory Lecture 4 70
Covering numbers
-
8/4/2019 Stastical Learning Theory
72/194
Define a (random) distance d between functions, e.g.
d(f, f) = 1n
#{f(Zi) = f(Zi) : i = 1, . . . , n}
Normalized Hamming distance of the projections on the sample
A set f1, . . . , f N covers Fat radius if
F Ni=1B(fi, )
Covering number N(F, , n) is the minimum size of a cover ofradius
Note that N(F, , n) = N(G, , n).
O. Bousquet Statistical Learning Theory Lecture 4 71
Bound with covering numbers
-
8/4/2019 Stastical Learning Theory
73/194
When the covering numbers are finite, one can approximate the class
G by a finite set of functions
Result
P [g G : R(g) Rn(g) t] 8E [N(G, t , n)] ent2/128
O. Bousquet Statistical Learning Theory Lecture 4 72
Covering numbers and VC dimension
-
8/4/2019 Stastical Learning Theory
74/194
Notice that for all t, N(
G, t , n)
#
GZ = N(
G, Z)
Hence N(G, t , n) h log enh
Haussler
N(G, t , n) Ch(4e)h 1th
Independent of n
O. Bousquet Statistical Learning Theory Lecture 4 73
Refinement
-
8/4/2019 Stastical Learning Theory
75/194
VC entropy corresponds to log covering numbers at minimal scale
Covering number bound is a generalization where the scale is adaptedto the error
Is this the right scale ?
It turns out that results can be improved by considering all scales (chaining)
O. Bousquet Statistical Learning Theory Lecture 4 74
Rademacher averages
-
8/4/2019 Stastical Learning Theory
76/194
Rademacher variables: 1, . . . , n independent random variables
with
P [i = 1] = P [i = 1] =1
2
Notation (randomized empirical measure) Rnf =
1n ni=1 if(Zi)
Rademacher average: R(F) = E supfFRnf Conditional Rademacher average Rn(F) = E
supfFRnfO. Bousquet Statistical Learning Theory Lecture 4 75
Result
-
8/4/2019 Stastical Learning Theory
77/194
Distribution dependent
f F, P f Pnf + 2R(F) +
log 12n
,
Data dependent
f F, P f Pnf + 2Rn(F) +
2log 2n
,
O. Bousquet Statistical Learning Theory Lecture 4 76
Concentration
-
8/4/2019 Stastical Learning Theory
78/194
Hoeffdings inequality is a concentration inequality
When n increases, the average is concentrated around the expectation
Generalization to functions that depend on i.i.d. random variablesexist
O. Bousquet Statistical Learning Theory Lecture 4 77
McDiarmids Inequality
-
8/4/2019 Stastical Learning Theory
79/194
Assume for all i = 1, . . . , n,
supz1,...,zn,z
i
|F(z1, . . . , zi, . . . , zn) F(z1, . . . , zi, . . . , zn)| c
then for all > 0,
P [|F E [F] | > ] 2 exp
22
nc2
O. Bousquet Statistical Learning Theory Lecture 4 78
Proof of Rademacher average bounds
-
8/4/2019 Stastical Learning Theory
80/194
Use concentration to relate supf
FP f
Pnf to its expectation
Use symmetrization to relate expectation to Rademacher average
Use concentration again to relate Rademacher average to conditionalone
O. Bousquet Statistical Learning Theory Lecture 4 79
Application (1)
-
8/4/2019 Stastical Learning Theory
81/194
supfFA(f) + B(f) supfFA(f) + supfFB(f)Hence
| supf
F
C(f) supf
F
A(f)| supf
F
(C(f) A(f))
this gives
| supfF
(P f Pnf) supfF
(P f Pnf)| supfF
(P
nf Pnf)
O. Bousquet Statistical Learning Theory Lecture 4 80
Application (2)
-
8/4/2019 Stastical Learning Theory
82/194
f {0, 1} hence,
Pnf Pnf =1
n(f(Zi) f(Zi))
1
n
thus
| supfF(P f Pnf) supfF(P f Pnf)| 1
n
McDiarmids inequality can be applied with c = 1/n
O. Bousquet Statistical Learning Theory Lecture 4 81
Symmetrization (1)
-
8/4/2019 Stastical Learning Theory
83/194
Upper bound
E
supfF
P f Pnf
2E
supfF
Rnf
Lower bound
E
supfF
|P f Pnf|
12E
supfF
Rnf
12
n
O. Bousquet Statistical Learning Theory Lecture 4 82
Symmetrization (2)
-
8/4/2019 Stastical Learning Theory
84/194
E[supfF
P f Pnf]
= E[supfF
E
P
nf Pnf]
EZ,Z[supfFP
nf Pnf]
= E,Z,Z
supfF
1
n
n
i=1i(f(Z
i) f(Zi))
2E[sup
fFRnf]
O. Bousquet Statistical Learning Theory Lecture 4 83
Loss class and initial class
-
8/4/2019 Stastical Learning Theory
85/194
R(F) = E
supgG
1
n
ni=1
i1[g(Xi)=Yi]
= E supgG1
n
n
i=1
i1
2(1
Yig(Xi))
=1
2E
supgG
1
n
ni=1
iYig(Xi)
=
1
2R(G)
O. Bousquet Statistical Learning Theory Lecture 4 84
Computing Rademacher averages (1)
-
8/4/2019 Stastical Learning Theory
86/194
12Esup
gG1n
ni=1
ig(Xi)=
1
2+ E
supgG
1
n
n
i=11 ig(Xi)
2
=1
2 E
infgG
1
n
ni=1
1 ig(Xi)2
=1
2 E infgG Rn(g, )
O. Bousquet Statistical Learning Theory Lecture 4 85
Computing Rademacher averages (2)
-
8/4/2019 Stastical Learning Theory
87/194
Not harder than computing empirical risk minimizer
Pick i randomly and minimize error with respect to labels i
Intuition: measure how much the class can fit random noise
Large class R(G) = 12
O. Bousquet Statistical Learning Theory Lecture 4 86
Concentration again
-
8/4/2019 Stastical Learning Theory
88/194
Let
F = E
supfF
Rnf
Expectation with respect to i only, with (Xi, Yi) fixed.
F satisfies McDiarmids assumptions with c = 1n
E [F] = R(F) can be estimated by F = Rn(F)
O. Bousquet Statistical Learning Theory Lecture 4 87
Relationship with VC dimension
-
8/4/2019 Stastical Learning Theory
89/194
For a finite set
F=
{f1, . . . , f N
}R(F) 2
log N/n
Consequence for VC class Fwith dimension h
R(F) 2
h log enhn
.
Recovers VC bound with a concentration proof
O. Bousquet Statistical Learning Theory Lecture 4 88
Chaining
-
8/4/2019 Stastical Learning Theory
90/194
Using covering numbers at all scales, the geometry of the class is
better captured
DudleyRn(F)
Cn
0
log N(F, t , n) dt
ConsequenceR(F) C
h
n
Removes the unnecessary log n factor !
O. Bousquet Statistical Learning Theory Lecture 4 89
Lecture 5
Advanced Topics
-
8/4/2019 Stastical Learning Theory
91/194
Advanced Topics
Relative error bounds
Noise conditions
Localized Rademacher averages
PAC-Bayesian bounds
O. Bousquet Statistical Learning Theory Lecture 5 90
Binomial tails
-
8/4/2019 Stastical Learning Theory
92/194
Pnf
B(p, n) binomial distribution p = P f
P [P f Pnf t] = n(pt)
k=0
nk
pk(1 p)nk
Can be upper bounded Exponential 1p1pt
n(1pt)
pp+tn(p+t)
Bennett e np1p((1t/p) log(1t/p)+t/p)
Bernstein e nt2
2p(1p)+2t/3
Hoeffding e2nt2
O. Bousquet Statistical Learning Theory Lecture 5 91
Tail behavior
( 2/ ( ))
-
8/4/2019 Stastical Learning Theory
93/194
For small deviations, Gaussian behavior
exp(
nt2/2p(1
p))
Gaussian with variance p(1 p)
For large deviations, Poisson behavior exp(3nt/2)
Tails heavier than Gaussian
Can upper bound with a Gaussian with large (maximum) varianceexp(2nt2)
O. Bousquet Statistical Learning Theory Lecture 5 92
Illustration (1)
Maximum variance (p = 0.5)
-
8/4/2019 Stastical Learning Theory
94/194
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
De
viationbound
BinoCDF
Bernstein
BennettBinomial Tail
Best Gausian
O. Bousquet Statistical Learning Theory Lecture 5 93
Illustration (2)
Small variance (p = 0 1)
-
8/4/2019 Stastical Learning Theory
95/194
Small variance (p = 0.1)
0.1 0.15 0.2 0.25 0.3 0.350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
Deviationbound
BinoCDF
SubGaussian
Bernstein
Bennett
Binomial Tail
Best Gausian
0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.440
0.01
0.02
0.03
0.04
0.05
0.06
0.07
x
Deviationbound
BinoCDF
Bernstein
Bennett
Binomial Tail
Best Gausian
O. Bousquet Statistical Learning Theory Lecture 5 94
Taking the variance into account (1)
E h f i f h diff i P f (1 P f ) P f
-
8/4/2019 Stastical Learning Theory
96/194
Each function f
Fhas a different variance P f(1
P f)
P f.
For each f F, by Bernsteins inequality
P f
Pnf +
2P f log 1
n
+2log 1
3n
The Gaussian part dominates (for P f not too small, or n largeenough), it depends on P f
O. Bousquet Statistical Learning Theory Lecture 5 95
Taking the variance into account (2)
-
8/4/2019 Stastical Learning Theory
97/194
Central Limit Theorem
nP f PnfP f(1 P f)
N(0, 1)
Idea is to consider the ratioP f Pnf
P f
O. Bousquet Statistical Learning Theory Lecture 5 96
Normalization
Here (f {0, 1}), Var [f] P f2 = P f
-
8/4/2019 Stastical Learning Theory
98/194
(f { , }), [f ] f f
Large variance large risk. After normalization, fluctuations are more uniform
supfF
P f PnfP f
not necessarily attained at functions with large variance.
Focus of learning: functions with small error P f (hence smallvariance).
The normalized supremum takes this into account.
O. Bousquet Statistical Learning Theory Lecture 5 97
Relative deviations
Vapnik-Chervonenkis 1974
-
8/4/2019 Stastical Learning Theory
99/194
For > 0 with probability at least 1 ,
f F, P f PnfP f
2
log SF(2n) + log 4n
and
f F, Pnf P fPnf
2
log SF(2n) + log 4n
O. Bousquet Statistical Learning Theory Lecture 5 98
Proof sketch
1. Symmetrization
-
8/4/2019 Stastical Learning Theory
100/194
P
supfF
P f PnfP f
t
2P
supfF
Pnf Pnf(Pnf + Pnf)/2
t
2. Randomization
= 2EP
supfF
1n
ni=1 i(f(Z
i) f(Zi))
(Pnf + Pnf)/2 t
3. Tail bound
O. Bousquet Statistical Learning Theory Lecture 5 99
Consequences
From the fact
-
8/4/2019 Stastical Learning Theory
101/194
A B + CA A B + C2 + BC
we get
f F, P f Pnf + 2Pnflog SF(2n) + log 4n
+4log SF(2n) + log 4
n
O. Bousquet Statistical Learning Theory Lecture 5 100
Zero noise
Ideal situation
-
8/4/2019 Stastical Learning Theory
102/194
gn empirical risk minimizer t G
R = 0 (no noise, n(X) = 0 a.s.)In that case
Rn(gn) = 0
R(gn) = O(d log nn ).
O. Bousquet Statistical Learning Theory Lecture 5 101
Interpolating between rates ?
Rates are not correctly estimated by this inequality
-
8/4/2019 Stastical Learning Theory
103/194
y y q y
Consequence of relative error bounds
P fn P f + 2
P flog SF(2n) + log 4
n
+4log SF(2n) + log 4n
The quantity which is small is not P f but P fn P f
But relative error bounds do not apply to differences
O. Bousquet Statistical Learning Theory Lecture 5 102
Definitions
P = PX P (Y |X)
-
8/4/2019 Stastical Learning Theory
104/194
P PX P(Y|X) regression function (x) = E [Y|X = x] target function t(x) = sgn (x) noise level n(X) = (1 |(x)|)/2 Bayes risk R = E [n(X)] R(g) = E [(1 (X))/2] + E (X)1[g0] R(g) R = E |(X)|1[g0]
O. Bousquet Statistical Learning Theory Lecture 5 103
Intermediate noise
Instead of assuming that |(x)| = 1 (i.e. n(x) = 0), the
-
8/4/2019 Stastical Learning Theory
105/194
deterministic case, one can assume that n is well-behaved.
Two kinds of assumptions
n not too close to 1/2
n not often too close to 1/2
O. Bousquet Statistical Learning Theory Lecture 5 104
Massart Condition
For some c > 0, assume
-
8/4/2019 Stastical Learning Theory
106/194
> ,
|(X)| > 1c
almost surely
There is no region where the decision is completely random
Noise bounded away from 1/2
O. Bousquet Statistical Learning Theory Lecture 5 105
Tsybakov Condition
Let [0, 1], equivalent conditions
-
8/4/2019 Stastical Learning Theory
107/194
(1) c > 0, g {1, 1}X,P [g(X)(X) 0] c(R(g) R)
(2) c > 0, A X, A dP(x) c(A |(x)|dP(x))
(3) B > 0, t 0, P [|(X)| t] Bt 1
O. Bousquet Statistical Learning Theory Lecture 5 106
Equivalence
(1) (2) Recall R(g) R = E |(X)|1[g0] . For each
-
8/4/2019 Stastical Learning Theory
108/194
( ) ( ) (g) |( )| [g0]function g, there exists a set A such that 1[A] = 1[g0] (2) (3) Let A = {x : |(x)| t}
P [|| t] = A dP(x) c(A |(x)|dP(x))
ct(
A
dP(x))
P [
|
| t]
c
11 t
1
O. Bousquet Statistical Learning Theory Lecture 5 107
(3) (1)
R(g) R = E |(X)| 1[g0]tE 1 1
-
8/4/2019 Stastical Learning Theory
109/194
tE 1[g0]1[||>t]
= tP [|| > t] tE 1[g>0]1[||>t] t(1 Bt 1 ) tP [g > 0] = t(P [g 0] Bt 1 )
Take t = (1)P[g0]B (1)/
P [g 0] B1
(1 )(1 ) (R(g) R
)
O. Bousquet Statistical Learning Theory Lecture 5 108
Remarks
is in [0, 1] because
-
8/4/2019 Stastical Learning Theory
110/194
R(g) R = E |(X)|1[g0] E 1[g0]
= 0 no condition
= 1 gives Massarts condition
O. Bousquet Statistical Learning Theory Lecture 5 109
Consequences
Under Massarts condition
-
8/4/2019 Stastical Learning Theory
111/194
E
(1[g(X)=Y] 1[t(X)=Y])2
c(R(g) R)
Under Tsybakovs conditionE
(1[g(X)=Y] 1[t(X)=Y])2
c(R(g) R)
O. Bousquet Statistical Learning Theory Lecture 5 110
Relative loss class
F is the loss class associated to G
-
8/4/2019 Stastical Learning Theory
112/194
The relative loss class is defined as
F= {f f : f F}
It satisfiesP f
2 c(P f)
O. Bousquet Statistical Learning Theory Lecture 5 111
Finite case
Union bound on Fwith Bernsteins inequality would give
-
8/4/2019 Stastical Learning Theory
113/194
P fnP f PnfnPnf+8c(P fn P f) log N
n+
4 log N3n
Consequence when f F(but R > 0)
P fn P f C
log Nn
12
always better than n1/2 for > 0
O. Bousquet Statistical Learning Theory Lecture 5 112
Local Rademacher average
Definition
-
8/4/2019 Stastical Learning Theory
114/194
R(F, r) = E supfF:P f2r
Rnf Allows to generalize the previous result
Computes the capacity of a small ball in F (functions with smallvariance)
Under noise conditions, small variance implies small error
O. Bousquet Statistical Learning Theory Lecture 5 113
Sub-root functions
Definition
R R
-
8/4/2019 Stastical Learning Theory
115/194
A function :R
R
is sub-root if
is non-decreasing
is non negative
(r)/r is non-increasing
O. Bousquet Statistical Learning Theory Lecture 5 114
Sub-root functions
Properties
A sub-root function
-
8/4/2019 Stastical Learning Theory
116/194
is continuous has a unique fixed point (r) = r
0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3xphi(x)
O. Bousquet Statistical Learning Theory Lecture 5 115
Star hull
Definition
-
8/4/2019 Stastical Learning Theory
117/194
F= {f : f F, [0, 1]}
Properties
Rn(
F, r) is sub-root
Entropy of F is not much bigger than entropy ofF
O. Bousquet Statistical Learning Theory Lecture 5 116
Result
r fixed point ofR(F, r)
-
8/4/2019 Stastical Learning Theory
118/194
Bounded functionsP f Pnf C
rVar [f] +
log 1 + log log n
n
Consequence for variance related to expectation (Var [f] c(P f))
P f C
Pnf + (r)
12 +
log 1 + log log n
n
O. Bousquet Statistical Learning Theory Lecture 5 117
Consequences
For VC classes R(F, r) C rhn hence r Chn
-
8/4/2019 Stastical Learning Theory
119/194
Rate of convergence of Pnf to P f in O(1/
n)
But rate of convergence of P fn to P f is O(1/n1/(2))Only condition is t G but can be removed by SRM/Model selection
O. Bousquet Statistical Learning Theory Lecture 5 118
Proof sketch (1)
Talagrands inequality
-
8/4/2019 Stastical Learning Theory
120/194
supfF
P fPnf E
supfF
P f Pnf
+c
supfF
Var [f] /n+c/n
Peeling of the class
Fk = {f : Var [f] [xk, xk+1)}
O. Bousquet Statistical Learning Theory Lecture 5 119
Proof sketch (2)
Application
-
8/4/2019 Stastical Learning Theory
121/194
sup
fFkP fPnf E
sup
fFkP f Pnf
+c
xVar [f] /n+c/n
Symmetrization
f F, P fPnf 2R(F, xVar [f])+c
xVar [f] /n+c/n
O. Bousquet Statistical Learning Theory Lecture 5 120
Proof sketch (3)
We need to solve this inequality. Things are simple ifR behave like
-
8/4/2019 Stastical Learning Theory
122/194
a square root, hence the sub-root property
P f Pnf 2
rVar [f] + c
xVar [f] /n + c/n
Variance-expectationVar [f] c(P f)
Solve in P f
O. Bousquet Statistical Learning Theory Lecture 5 121
Data-dependent version
As in the global case, one can use data-dependent local Rademcher
-
8/4/2019 Stastical Learning Theory
123/194
averagesRn(F, r) = E
sup
fF:P f2rRnf
Using concentration one can also get
P f C
Pnf + (rn)
12 +
log 1 + log log n
n
where rn is the fixed point of a sub-root upper bound of
Rn(
F, r)
O. Bousquet Statistical Learning Theory Lecture 5 122
Discussion
Improved rates under low noise conditions
-
8/4/2019 Stastical Learning Theory
124/194
p
Interpolation in the rates
Capacity measure seems local,
but depends on all the functions,
after appropriate rescaling: each f Fis considered at scale r/Pf2
O. Bousquet Statistical Learning Theory Lecture 5 123
Randomized Classifiers
Given G a class of functions
-
8/4/2019 Stastical Learning Theory
125/194
Deterministic: picks a function gn and always use it to predict Randomized construct a distribution n over G for each instance to classify, pick g n
Error is averaged over nR(n) = nP f
Rn(n) = nPnf
O. Bousquet Statistical Learning Theory Lecture 5 124
Union Bound (1)
Let be a (fixed) distribution over F.
-
8/4/2019 Stastical Learning Theory
126/194
Recall the refined union bound
f F, P f Pnf
log 1(f) + log1
2n
Take expectation with respect to n
nP f nPnf n
log 1(f) + log
1
2n
O. Bousquet Statistical Learning Theory Lecture 5 125
Union Bound (2)
(1
/(
-
8/4/2019 Stastical Learning Theory
127/194
nP f nPnf n log (f) + log /(2n)
n log (f) + log 1 /(2n) K(n, ) + H(n) + log
1 /(2n)
K(n, ) =
n(f) logn(f)(f) df Kullback-Leibler divergence
H(n) = n(f) log n(f) df Entropy
O. Bousquet Statistical Learning Theory Lecture 5 126
PAC-Bayesian Refinement
It is possible to improve the previous bound.
-
8/4/2019 Stastical Learning Theory
128/194
With probability at least 1 ,
nP f nPnf
K(n, ) + log 4n + log1
2n
1
Good if n is spread (i.e. large entropy) Not interesting if n = fn
O. Bousquet Statistical Learning Theory Lecture 5 127
Proof (1)
Variational formulation of entropy: for any T
T (f ) lT(f)
K( )
-
8/4/2019 Stastical Learning Theory
129/194
T(f) log e + K(, ) Apply it to (P f Pnf)2
n(P f Pnf)2 log e(P fPnf)2
+ K(n, )
Markovs inequality: with probability 1 ,
n(P f Pnf)2 log E
e(P fPnf)2
+ K(n, ) + log1
O. Bousquet Statistical Learning Theory Lecture 5 128
Proof (2)
FubiniE e
(P fPnf)2 = E e(P fPnf)2
-
8/4/2019 Stastical Learning Theory
130/194
Modified Chernoff boundE
e(2n1)(P fPnf)
2
4n
Putting together ( = 2n 1)(2n 1)n(P f Pnf)2 K(n, ) + log 4n + log 1
Jensen (2n
1)(n(P f
Pnf))
2
(2n
1)n(P f
Pnf)
2
O. Bousquet Statistical Learning Theory Lecture 5 129
Lecture 6
Loss Functions
P ti
-
8/4/2019 Stastical Learning Theory
131/194
Properties
Consistency
Examples
Losses and noise
O. Bousquet Statistical Learning Theory Lecture 6 130
Motivation (1)
ERM: minimize ni=1 1[g(Xi)=Yi] in a set GC t ti ll h d
-
8/4/2019 Stastical Learning Theory
132/194
Computationally hard Smoothing
Replace binary by real-valued functions
Introduce smooth loss functionn
i=1
(g(Xi), Yi)
O. Bousquet Statistical Learning Theory Lecture 6 131
Motivation (2)
Hyperplanes in infinite dimension have i fi ite VC di e sio
-
8/4/2019 Stastical Learning Theory
133/194
infinite VC-dimension but finite scale-sensitive dimension (to be defined later)
It is good to have a scale
This scale can be used to give a confidence (i.e. estimate the density)
However, losses do not need to be related to densities Can get bounds in terms of margin error instead of empirical error
(smoother easier to optimize for model selection)
O. Bousquet Statistical Learning Theory Lecture 6 132
Margin
It is convenient to work with (symmetry of +1 and 1)
(g(x) y) (yg(x))
-
8/4/2019 Stastical Learning Theory
134/194
(g(x), y) = (yg(x))
yg(x) is the margin of g at (x, y) Loss
L(g) = E [(Y g(X))] , Ln(g) =1
n
ni=1
(Yig(Xi))
Loss class F= {f : (x, y) (yg(x)) : g G}
O. Bousquet Statistical Learning Theory Lecture 6 133
Minimizing the loss
Decomposition of L(g)
1
-
8/4/2019 Stastical Learning Theory
135/194
1
2E [E [(1 + (X))(g(X)) + (1 (X))(g(X))|X]]
Minimization for each x
H() = infR
((1 + )()/2 + (1 )()/2)
L := infg L(g) = E [H((X))]
O. Bousquet Statistical Learning Theory Lecture 6 134
Classification-calibrated
A minimal requirement is that the minimizer in H() has the correct
sign (that of the target t or that of )
-
8/4/2019 Stastical Learning Theory
136/194
sign (that of the target t or that of ).
Definition is classification-calibrated if, for any = 0
inf:0
(1+)()+(1
)(
) > inf
R(1+)()+(1
)(
)
This means the infimum is achieved for an of the correct sign (andnot for an of the wrong sign, except possibly for = 0).
O. Bousquet Statistical Learning Theory Lecture 6 135
Consequences (1)
Results due to (Jordan, Bartlett and McAuliffe 2003)
is classification calibrated iff for all sequences g and every proba
-
8/4/2019 Stastical Learning Theory
137/194
is classification-calibrated iff for all sequences gi and every proba-bility distribution P,
L(gi) L R(gi) R
When is convex (convenient for optimization) is classification-calibrated iff it is differentiable at 0 and (0) < 0
O. Bousquet Statistical Learning Theory Lecture 6 136
Consequences (2)
Let H() = inf :0 ((1 + )()/2 + (1 )()/2)
-
8/4/2019 Stastical Learning Theory
138/194
Let H () = inf:0 ((1 + )()/2 + (1 )()/2)
Let () be the largest convex function below H() H()
One has(R(g) R) L(g) L
O. Bousquet Statistical Learning Theory Lecture 6 137
Examples (1)
3.5
401hingesquared hingesquareexponential
-
8/4/2019 Stastical Learning Theory
139/194
1 0.5 0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
O. Bousquet Statistical Learning Theory Lecture 6 138
Examples (2)
Hinge loss(x) = max(0, 1 x), (x) = x
Squared hinge loss
-
8/4/2019 Stastical Learning Theory
140/194
Squared hinge loss(x) = max(0, 1 x)2, (x) = x2
Square loss
(x) = (1 x)2, (x) = x2 Exponential
(x) = exp(x), (x) = 1
1 x2
O. Bousquet Statistical Learning Theory Lecture 6 139
Low noise conditions
Relationship can be improved under low noise conditions Under Tsybakovs condition with exponent and constant c,
c(R(g) R)((R(g) R)1/2c) L(g) L
-
8/4/2019 Stastical Learning Theory
141/194
c(R(g) R ) ((R(g) R ) /2c) L(g) L
Hinge loss (no improvement)
R(g)
R
L(g)
L
Square loss or squared hinge loss
R(g) R (4c(L(g) L)) 12
O. Bousquet Statistical Learning Theory Lecture 6 140
Estimation error
Recall that Tsybakov condition implies P f2 c(P f) for therelative loss class (with 0
1 loss)
-
8/4/2019 Stastical Learning Theory
142/194
What happens for the relative loss class associated to ?
Two possibilities
Strictly convex loss (can modify the metric on R)
Piecewise linear
O. Bousquet Statistical Learning Theory Lecture 6 141
Strictly convex losses
Noise behavior controlled by modulus of convexity
R l
-
8/4/2019 Stastical Learning Theory
143/194
Result(
P f2
K) P f /2
with K Lipschitz constant of and modulus of convexity of L(g)
with respect to f gL2(P)
Not related to noise exponent
O. Bousquet Statistical Learning Theory Lecture 6 142
Piecewise linear losses
Noise behavior related to noise exponent
-
8/4/2019 Stastical Learning Theory
144/194
Result for hinge lossP f
2
CP f
if initial class G is uniformly bounded
O. Bousquet Statistical Learning Theory Lecture 6 143
-
8/4/2019 Stastical Learning Theory
145/194
Examples
Under Tsybakovs condition
Hinge loss 1 log 1 + log log n
-
8/4/2019 Stastical Learning Theory
146/194
R(g) R L(g) L + C
(r
)1
2 +log + log log n
n
Squared hinge loss or square loss (x) = cx2, P f2
CP f
R(g)R C
L(g) L + C(r + log1 + log log n
n)
12
O. Bousquet Statistical Learning Theory Lecture 6 145
Classification vs Regression losses
Consider a classification-calibrated function
-
8/4/2019 Stastical Learning Theory
147/194
It is a classification loss if L(t) = L
otherwise it is a regression loss
O. Bousquet Statistical Learning Theory Lecture 6 146
Classification vs Regression losses
Square, squared hinge, exponential losses Noise enters relationship between risk and loss
-
8/4/2019 Stastical Learning Theory
148/194
Modulus of convexity enters in estimation error
Hinge loss Direct relationship between risk and loss
Noise enters in estimation error
Approximation term not affected by noise in second case Real value does not bring probability information in second case
O. Bousquet Statistical Learning Theory Lecture 6 147
Lecture 7
Regularization
Formulation
-
8/4/2019 Stastical Learning Theory
149/194
Capacity measures
Computing Rademacher averages
Applications
O. Bousquet Statistical Learning Theory Lecture 7 148
Equivalent problems
Up to the choice of the regularization parameters, the following
problems are equivalent
min L (f ) + (f )
-
8/4/2019 Stastical Learning Theory
150/194
minfF
Ln(f) + (f)
minfF:Ln(f)e
(f)
minfF:(f)R Ln(f)
The solution sets are the same
O. Bousquet Statistical Learning Theory Lecture 7 149
Comments
Computationally, variant of SRM
variant of model selection by penalization
-
8/4/2019 Stastical Learning Theory
151/194
y p
one has to choose a regularizer which makes sense
Need a class that is large enough (for universal consistency)
but has small balls
O. Bousquet Statistical Learning Theory Lecture 7 150
Rates
To obtain bounds, consider ERM on balls
Relevant capacity is that of balls
-
8/4/2019 Stastical Learning Theory
152/194
p y
Real-valued functions, need a generalization of VC dimension, entropyor covering numbers
Involve scale sensitive capacity (takes into account the value and notonly the sign)
O. Bousquet Statistical Learning Theory Lecture 7 151
Scale-sensitive capacity
Generalization of VC entropy and VC dimension to real-valued func-tions
D fi i i i h d b F ( l ) if h
-
8/4/2019 Stastical Learning Theory
153/194
Definition: a set x1, . . . , xn is shattered by F (at scale ) if thereexists a function s such that for all choices of i {1, 1}, thereexists f F
i(f(xi) s(xi)) The fat-shattering dimension of Fat scale (denoted vc(F, )) is
the maximum cardinality of a shattered set
O. Bousquet Statistical Learning Theory Lecture 7 152
Link with covering numbers
Like VC dimension, fat-shattering dimension can be used to upperbound covering numbers
-
8/4/2019 Stastical Learning Theory
154/194
ResultN(F, t , n)
C1
t
C2vc(F,C3t)
Note that one can also define data-dependent versions (restriction onthe sample)
O. Bousquet Statistical Learning Theory Lecture 7 153
Link with Rademacher averages (1)
Consequence of covering number estimates
Rn(F) C1
vc(F, t) log C2t
dt
-
8/4/2019 Stastical Learning Theory
155/194
( ) n
0
( ) g
t
Another link via Gaussian averages (replace Rademacher by GaussianN(0,1) variables)
Gn(F) = Eg
supfF
1
n
ni=1
gif(Zi)
O. Bousquet Statistical Learning Theory Lecture 7 154
Link with Rademacher averages (2)
Worst case average
n(F) = sup E
sup Gnf
-
8/4/2019 Stastical Learning Theory
156/194
n(F) supx1,...,xn
E
supfF
Gnf
Associated dimension t(F, ) = sup{n N : n(F) } Result (Mendelson & Vershynin 2003)
vc(F, c) t(F, ) K2
vc(F, c)
O. Bousquet Statistical Learning Theory Lecture 7 155
Rademacher averages and Lipschitz losses
What matters is the capacity of
F(loss class)
-
8/4/2019 Stastical Learning Theory
157/194
If is Lipschitz with constant M
thenRn(F) MRn(G)
Relates to Rademacher average of the initial class (easier to compute)
O. Bousquet Statistical Learning Theory Lecture 7 156
Dualization
Consider the problem mingR Ln(g) Rademacher of ball
E sup
gRRng
-
8/4/2019 Stastical Learning Theory
158/194
g
Duality
E supfR Rnf = RnE
ni=1
iXi
dual norm, Xi evaluation at Xi (element of the dual underappropriate conditions)
O. Bousquet Statistical Learning Theory Lecture 7 157
RHKS
Given a positive definite kernel k
Space of functions: reproducing kernel Hilbert space associated to k
Regularizer: rkhs norm
-
8/4/2019 Stastical Learning Theory
159/194
Regularizer: rkhs norm k
Properties: Representer theorem
gn =n
i=1
ik(Xi, )
O. Bousquet Statistical Learning Theory Lecture 7 158
Shattering dimension of hyperplanes
Set of functions
G = {g(x) = w x : w = 1}
-
8/4/2019 Stastical Learning Theory
160/194
G {g(x) w x : w 1}
Assume
x
R
Resultvc(G, ) R2/2
O. Bousquet Statistical Learning Theory Lecture 7 159
Proof Strategy (Gurvits, 1997)
Assume that x1, . . . , xr are -shattered by hyperplanes with w = 1,i.e., for all y1, . . . , yr {1}, there exists a w such that
yi w, xi for all i = 1, . . . , r . (2)
-
8/4/2019 Stastical Learning Theory
161/194
Two steps:
prove that the more points we want to shatter (2), the larger
ri=1 yixi must be upper bound the size of ri=1 yixi in terms of R
Combining the two tells us how many points we can at most shatter
O. Bousquet Statistical Learning Theory Lecture 7 160
Part I
Summing (2) yields w, (
ri=1 yixi) r
By Cauchy-Schwarz inequality
-
8/4/2019 Stastical Learning Theory
162/194
w,
r
i=1
yixi
w
ri=1
yixi
=
ri=1
yixi
Combine both:r
ri=1
yixi
. (3)
O. Bousquet Statistical Learning Theory Lecture 7 161
Part II
Consider labels yi {1}, as (Rademacher variables).
E
ri=1
yixi2 = E r
i,j=1
yiyj xi, xj
-
8/4/2019 Stastical Learning Theory
163/194
i 1
i,j 1
=r
i=1 E [xi, xi] + E
r
i=j xi, xj
=r
i=1
xi2
O. Bousquet Statistical Learning Theory Lecture 7 162
Part II, ctd.
Since xi R, we get E ri=1 yixi
2
rR2.
This holds for the expectation over the random choices of the labels,
-
8/4/2019 Stastical Learning Theory
164/194
hence there must be at least one set of labels for which it also holds
true. Use this set.
Hence
ri=1
yixi
2
rR2.
O. Bousquet Statistical Learning Theory Lecture 7 163
Part I and II Combined
Part I: (r)2 ri=1 yixi
2
Part II: ri=1 yixi2 rR2
-
8/4/2019 Stastical Learning Theory
165/194
i 1 Hence
r2
2
rR
2,
i.e.,
r R2
2
O. Bousquet Statistical Learning Theory Lecture 7 164
Boosting
Given a class H of functions
Space of functions: linear span of
H Regularizer: 1-norm of the weights g = inf{ |i| : g =h }
-
8/4/2019 Stastical Learning Theory
166/194
ihi}
Properties: weight concentrated on the (weighted) margin maximizers
gn = whh
diYih(Xi) = minhH
diYih
(Xi)
O. Bousquet Statistical Learning Theory Lecture 7 165
Rademacher averages for boosting
Function class of interest
GR = {g span H : g1 R}
-
8/4/2019 Stastical Learning Theory
167/194
Result
Rn(GR) = RRn(H)
Capacity (as measured by global Rademacher averages) not affectedby taking linear combinations !
O. Bousquet Statistical Learning Theory Lecture 7 166
Lecture 8
SVM
Computational aspects
-
8/4/2019 Stastical Learning Theory
168/194
Capacity Control
Universality
Special case of RBF kernel
O. Bousquet Statistical Learning Theory Lecture 8 167
Formulation (1)
Soft margin
minw,b,
1
2 w2
+ C
mi=1
i
(w x + b) 1
-
8/4/2019 Stastical Learning Theory
169/194
yi(w, xi + b) 1 ii 0
Convex objective function and convex constraints
Unique solution Efficient procedures to find it Is it the right criterion ?
O. Bousquet Statistical Learning Theory Lecture 8 168
Formulation (2) Soft margin
minw,b,
1
2w2 + C
m
i=1i
yi(w, xi + b) 1 i, i 0O ti l l f
-
8/4/2019 Stastical Learning Theory
170/194
Optimal value of ii = max(0, 1 yi(w, xi + b))
Substitute above to get
minw,b
1
2w2 + C
m
i=1max(0, 1 yi(w, xi + b))
O. Bousquet Statistical Learning Theory Lecture 8 169
Regularization
General form of regularization problem
minfF 1m
ni=1
c(yif(xi)) + f2
-
8/4/2019 Stastical Learning Theory
171/194
Capacity control by regularization with convex cost
0 1
1
O. Bousquet Statistical Learning Theory Lecture 8 170
Loss Function
(Y f(X)) = max(0, 1 Y f(X))
Convex, non-increasing, upper bounds 1[Y f(X)0]
-
8/4/2019 Stastical Learning Theory
172/194
Classification-calibrated
Classification type (L = L(t))R(g) R L(g) L
O. Bousquet Statistical Learning Theory Lecture 8 171
Regularization
Choosing a kernel corresponds to
Choose a sequence (ak) Set
-
8/4/2019 Stastical Learning Theory
173/194
f2 :=k0
ak
|f(k)|2dx
penalization of high order derivatives (high frequencies)
enforce smoothness of the solution
O. Bousquet Statistical Learning Theory Lecture 8 172
Capacity: VC dimension The VC dimension of the set of hyperplanes is d + 1 in Rd.
Dimension of feature space ?
for RBF kernel
w choosen in the span of the data (w = iyixi)The span of the data has dimension m for RBF kernel (k(., xi)
-
8/4/2019 Stastical Learning Theory
174/194
p ( ( , i)linearly independent)
The VC bound does not give any informationh
m= 1
Need to take the margin into account
O. Bousquet Statistical Learning Theory Lecture 8 173
Capacity: Shattering dimension
Hyperplanes with Margin
If x R,vc(hyperplanes with margin , 1) R
2
/2
-
8/4/2019 Stastical Learning Theory
175/194
O. Bousquet Statistical Learning Theory Lecture 8 174
Margin
The shattering dimension is related to the margin
Maximizing the margin means minimizing the shattering dimension
S
-
8/4/2019 Stastical Learning Theory
176/194
Small shattering dimension good control of the risk
this control is automatic (no need to choose the margin beforehand)
but requires tuning of regularization parameter
O. Bousquet Statistical Learning Theory Lecture 8 175
-
8/4/2019 Stastical Learning Theory
177/194
Rademacher Averages (2)
E supwM 1n n
i=1 i w, xi= E
supwM
w, 1n
ni 1
ixi
-
8/4/2019 Stastical Learning Theory
178/194
pwM
, n
i=1
i xi
E supwM w
1n
n
i=1ixi
=M
nE
ni=1
ixi, n
i=1ixi
O. Bousquet Statistical Learning Theory Lecture 8 177
Rademacher Averages (3)
M
n
E n
i=1
ixi,n
i=1
ixi M
E
n
n
-
8/4/2019 Stastical Learning Theory
179/194
n
E
i=1
ixi,
i=1ixi
=
M
n E i,j ij xi, xj=
M
n
ni=1
k(xi, xi)
O. Bousquet Statistical Learning Theory Lecture 8 178
Improved rates Noise condition
Under Massarts condition (
|
|> 0), with
g
M
E
((Y g(X)) (Y t(X)))2
(M1+2/0)(L(g)L) .
-
8/4/2019 Stastical Learning Theory
180/194
If noise is nice, variance linearly related to expectation Estimation error of order r (of the class G)
O. Bousquet Statistical Learning Theory Lecture 8 179
Improved rates Capacity (1)
rn related to decay of eigenvalues of the Gram matrix
rn c
mindN
d + j
-
8/4/2019 Stastical Learning Theory
181/194
n n dN
j>d
j
Note that d = 0 gives the trace bound
rn always better than the trace bound (equality when i constant)
O. Bousquet Statistical Learning Theory Lecture 8 180
Improved rates Capacity (2)
Example: exponential decay
i = ei
1
-
8/4/2019 Stastical Learning Theory
182/194
Global Rademacher of order 1n
rn of orderlog n
n
O. Bousquet Statistical Learning Theory Lecture 8 181
Exponent of the margin
Estimation error analysis shows that in GM = {g : g M}
R(gn) R(g) M...
-
8/4/2019 Stastical Learning Theory
183/194
Wrong power (M2 penalty) is used in the algorithm
Computationally easier But does not give a dimension-free status Using M could improve the cutoff detection
O. Bousquet Statistical Learning Theory Lecture 8 182
Kernel
Why is it good to use kernels ?
Gaussian kernel (RBF)
k( )xy
2
2
-
8/4/2019 Stastical Learning Theory
184/194
k(x, y) = e 22
is the width of the kernel
What is the geometry of the feature space ?
O. Bousquet Statistical Learning Theory Lecture 8 183
-
8/4/2019 Stastical Learning Theory
185/194
RBF
-
8/4/2019 Stastical Learning Theory
186/194
O. Bousquet Statistical Learning Theory Lecture 8 185
RBF
Differential Geometry
Flat Riemannian metric
distance along the sphere is equal to distance in input space
-
8/4/2019 Stastical Learning Theory
187/194
distance along the sphere is equal to distance in input space
Distances are contracted
shortcuts by getting outside the sphere
O. Bousquet Statistical Learning Theory Lecture 8 186
RBF
Geometry of the span
Ellipsoid
a
b
x^2/a^2 + y^2/b^2 = 1
-
8/4/2019 Stastical Learning Theory
188/194
K = (k(xi, xj)) Gram matrix Eigenvalues 1, . . . , m Data points mapped to ellispoid with lengths 1, . . . ,
m
O. Bousquet Statistical Learning Theory Lecture 8 187
RBF
Universality
Consider the set of functions
H = span{k(x, ) : x X}
-
8/4/2019 Stastical Learning Theory
189/194
H is dense in C(X) Any continuous function can be approximated (in the norm) by
functions in H
with enough data one can construct any function
O. Bousquet Statistical Learning Theory Lecture 8 188
RBF
Eigenvalues
Exponentially decreasing
Fourier domain: exponential penalization of derivatives
-
8/4/2019 Stastical Learning Theory
190/194
p p
Enforces smoothness with respect to the Lebesgue measure in inputspace
O. Bousquet Statistical Learning Theory Lecture 8 189
RBF
Induced Distance and Flexibility
0
1-nearest neighbor in input spaceEach point in a separate dimension, everything orthogonal
-
8/4/2019 Stastical Learning Theory
191/194
linear classifier in input space
All points very close on the sphere, initial geometry
Tuning allows to try all possible intermediate combinations
O. Bousquet Statistical Learning Theory Lecture 8 190
RBF
Ideas
Works well if the Euclidean distance is good
Works well if decision boundary is smooth
-
8/4/2019 Stastical Learning Theory
192/194
Adapt smoothness via
Universal
O. Bousquet Statistical Learning Theory Lecture 8 191
-
8/4/2019 Stastical Learning Theory
193/194
Learning Theory: some informal thoughts
Need assumptions/restrictions to learn
Data cannot replace knowledge No universal learning (simplicity measure)
-
8/4/2019 Stastical Learning Theory
194/194
SVM work because of capacity control
Choice of kernel = choice of prior/ regularizer RBF works well if Euclidean distance meaningful Knowledge improves performance (e.g. invariances)
O. Bousquet Statistical Learning Theory Lecture 8 193