Some theory of machine learning - Seoul National...

Some theory of machine learning


Seoul National University Deep Learning September-December, 2019 1 / 33


Setup

Consider (xi , yi ), xi ∈ Rd , y ∈ {0, 1}.Given a classifier f : Rd → {0, 1}, we would like to minimizeR(f ) = P(Y 6= f (X )).

We obtain data (xi , yi ), i = 1, · · · , n. An estimator of R(f ) isR̂(f ) = n−1

∑ni=1 I (Yi 6= f (Xi )), which we call Empirical risk.

Learning f (Empirical risk minimization): We want to find f thatmakes R(f ) small. Suppose we choose f̂ which minimizes theempirical risk over F , i.e., R̂(f̂ ) = min

f ∈FR̂(f ).

Optimal f in F : Let f ∗ be the optimal classifier in F in the sensethat it minimizes the true risk over F , i.e., R(f ∗) = min

f ∈FR(f ).

f may not be in F : Let f ∗∗ be the true classifier that minimizes thetrue risk, i.e., R(f ∗∗) = min

fR(f ).



Areas of theoretical studies related to Deep Learning

R(f̂ )− R(f ∗∗) = {R(f̂ ))− R(f ∗)}+ {R(f ∗)− R(f ∗∗)}{R(f ∗)− R(f ∗∗)} is approximation error.

R(f̂ )− R(f ∗) is estimation error.

Theoretical work on deep learning addresses (i) approximation, (ii)optimization and (iii) generalization error. Works on approximationhelp explain expressiveness of deep models, but it is not our focus inthis course. We will mention some recent finding on (ii) in theoptimization section. In this section we will focus on thegeneralization error.



Excess risk

R(f̂ )− R(f ∗∗) = {R(f̂ )− R(f ∗)}+ {R(f ∗)− R(f ∗∗)}{R(f ∗)− R(f ∗∗)} is approximation error.

R(f̂ )− R(f ∗). We would like to bound this estimation error.

R(f̂ )− R(f ∗) = {R(f̂ )− R̂(f̂ )}+ {R̂(f̂ )− R̂(f ∗)}+ {R̂(f ∗)− R(f ∗)}≤ {R(f̂ )− R̂(f̂ )}︸︷︷︸

(i)

+ {R̂(f ∗)− R(f ∗)}︸︷︷︸(ii)

≤ 2 supf ∈F|R̂(f )− R(f )|

To bound (ii), concentration of measures can be invoked but not (i).



Hoeffding’s Inequality

Hoeffding’s Inequality: If Z1, · · · ,Zn are independent withP(ai ≤ Zi ≤ bi ) = 1, then for any ε > 0,

P(|Z̄n − µ| > ε) < 2e−2nε2/c ,

where c = n−1∑n

i=1(bi − ai )2 and Z̄n = n−1

∑ni=1 Zi .

Hoeffding’s inequality implies

P(|R(f )− R̂(f )| > ε) ≤ 2e−2nε2,

and with probability 1− δ,

|R(f )− R̂(f )| ≤√

1

2nlog(

2

δ)



Measures of complexity: Shattering number

Let χ be a set and let F be a class of binary functions on χ. Forf ∈ F , f : z ={z1, · · · , zn} → {0,+1}. Define the projection of F onz , F(z1, · · · , zn) = {(f (z1), · · · , f (zn))}. Note that F(z1, · · · , zn) isa finite collection of vectors. |F| can be infinite, but|F(z1, · · · , zn)| ≤ 2n.

e.g.1. f (z) = I (z > a) with z1 < z2 < z3. ThenF(z1, z2, z3) = {(0, 0, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1)}.e.g.2. f (z) = I (a < z < b) with z1 < z2 < z3. Then F(z1, z2, z3) ={(0, 0, 0), (1, 0, 0), (1, 1, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1), (0, 1, 0)}.Definition: Shattering number, growth function: Number of the mostpossible sets of dichotomies on any n points:

mF (n) = maxz1,···zn |F(z1, · · · zn)|.



Measures of complexity: Shattering number

By definition, the shattering number satisfies mF (n) ≤ 2n.

For e.g.1, F is set of f : R→ {0, 1}, f (z) = I (z > a). When n = 3with z1 < z2 < z3,F(z1, z2, z3) = {(0, 0, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1)}.mF (n) = n + 1.

e.g.2. F is set of f : R→ {0, 1}, f (z) = I (a < z < b). When n = 3with z1 < z2 < z3. Then F(z1, z2, z3) ={(0, 0, 0), (1, 0, 0), (1, 1, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1), (0, 1, 0)}.mF (n) = n(n + 1)/2 + 1.

F is set of f : R2 → {0, 1}, f (z) = 1 if convex region.mF (n) = 2n.



Measures of complexity: VC dimension

The VC dimension of F is the largest value of n for whichmF (n) = 2n, i.e. dVC (F) = sup{n : mF (n) = 2n}, equivalently, thesize of the largest set that can be shattered.

Break point: Number of data points for which we cannot get allpossible dichotomies (=VC dimension +1).

Table: VC dimensions

Function class VC dimension

interval [a, b] 2Disc in R2 3

half-spaces in Rd d + 1Convex polycons in R2 ∞



Measures of complexity: Sauer’s Theorem

VC dimension or break point regardless of F can give informationabout mF (n). If we know the break point is k , we know that any kpoints of n cannot have all possible patterns. e.g. n = 3, k = 2. Thisobservation gives a upper bound for mF (n).

Group x1 x2 x3

G1 0 0 0G2 0 0 1G2 0 1 0G2 1 0 0

Since k = 2, any two columns cannot have all possible dichotomies.

Let B(n, k) be the maximum number of patterns with n and breakpoint k. When we consider x1 and x2 only, G2 represents a set withdistinct patterns. For this set, the size is ≤ B(n− 1, k). For G1, sincex3 has both possible patterns, the size is ≤ B(n − 1, k − 1).



Measures of complexity: Sauer’s Theorem

When the break point is k ,|F(z1, · · · zn)| ≤ B(n − 1, k) + B(n − 1, k − 1) = B(n, k). Using this,the following theorem holds.

Sauer’s Theorem: Suppose that F has finite VC dimension d . Then

mF (n) ≤d∑

i=0

(n

i

)and for all n ≥ d ,

mF (n) ≤ (en

d)d

If mF (n) is less than 2n, mF (n) is polynomial, nothing in between.



Uniform bounds using shattering number

(Vapnik and Chervonenkis) Let F be a class of binary functions. Forany t >

√2/n

P(supf ∈F|Pn(f )− P(f )| > t) ≤ 4mF (2n)e−nt

2/8

and hence with probability at least 1− δ,

supf ∈F|Pn(f )− P(f )| ≤

√8

nlog

(4mF (2n)

δ

)where Pn(f ) = 1

n

∑ni=1 f (xi ), P(f ) =

∫f (x)dP(x).

The symmetrization technique can be used in proving the VCtheorem.



Symmetrization Lemma

Denote the empirical distribution of xi · · · , xn from P by Pn. Letx ′i · · · , x ′n denote a second independent sample from P and P ′n denotethe empirical distribution of the second sample. For all t >

√2/n,

P(supf ∈F|Pn(f )− P(f )| > t) ≤ 2P(sup

f ∈F|Pn(f )− P ′n(f )| > t/2).

The second sample, x ′i · · · , x ′n, is called a ghost sample.

If |(Pn(f )− P(f ))| > t and |(P(f )− P ′n(f )| ≤ t/2, then|P ′n(f )− Pn(f )| > t/2. This can be used in a proof.

The importance of the result is that we can express a bound ofsupf ∈F |Pn(f )− P(f )| using supf ∈F |Pn(f )− P ′n(f )|. This allowstaking maximum over a finite number that can be expressed using thegrowth function.



Uniform bounds using VC dimension

Recall mF (n) ≤ ( end )d . Replacing mF (n) with ( end )d in VC inequality,

we obtain for any t >√

2/n

P(supf ∈F|Pn(f )− P(f )| > t) ≤ 4(

en

d)de−nt

2/8

and hence with probability at least 1− δ,

supf ∈F|Pn(f )− P(f )| ≤

√√√√8

n

(log

(4

δ

)+ d log

(ne

d

))

where Pn(f ) = 1n

∑ni=1 f (xi ), P(f ) =

∫f (x)dP(x).



VC dimension bounds for neural network

One can compute the complexity or capacity of Neural Networkmodels by measuring how many configurations can be shattered (VCdimension)

The capacity of the network, if measured by the number of pieces in apiecewise linear approximation, increases exponentially with depth[Montufar, Pascanu et al, 2014]

These results quantify an upper bound on the empirical risk of deepneural networks

The bounds might be very pessimistic.



General measures of complexity: Rademacher complexity

Motivation: y = {−1, 1}. f : χ→ {−1, 1}.R̂(f ) = 1

m

∑mi=1 I (f (xi ) 6= yi ) = 1

2 −12m

∑mi=1 yi f (xi ).

Minimizing training error is to maximize 1m

∑mi=1 yi f (xi ) over f .

Consider a random label instead of y . A bigger model class F canmake 1

m

∑mi=1 yi f (xi ) big as well.

Definition (Rademacher complexity): Let σ1 · · · , σn be independentrandom variables P(σi = 1) = P(σi = −1) = 1/2. Rademachercomplexity of F is

Radn(F) = E

supf ∈F

1

n

n∑i=1

σi f (xi )

.



General measures of complexity: Rademacher complexity

Empirical Rademacher complexity of F by

Radn(F , x) = Eσ

supf ∈F

1

n

n∑i=1

σi f (xi )

.

When |F| = 1,

Radn(F) = E{

supf ∈F1n

∑ni=1 σi f (xi )

}= E 1

n

∑ni=1 Eσi f (xi ) = 0.

When |F| = 2n, Radn(F) = E{

supf ∈F1n

∑ni=1 σi f (xi )

}= 1.



Rademacher complexity: Example

Example (Ridge regression): Let F be the class of linear predictorswith restriction ‖w‖2 ≤W2. Additionally assume that ‖x‖2 ≤ X2.

Radn(F , x) = Eσ supw:‖w‖2≤W2

1

n

n∑i=1

σi < xi ,w >

=1

nEσ sup

w:‖w‖2≤W2

<

n∑i=1

σixi ,w >

=W2

nEσ‖

n∑i=1

σixi‖2 ≤W2

n

√√√√Eσ‖n∑

i=1

σixi‖22

≤ W2

n

√√√√Eσ

n∑i=1

‖σixi‖22 =W2

n

√√√√Eσ

n∑i=1

‖xi‖22 ≤X2W2√

n



Properties of Rademacher complexity

(Monotonicity) If F ⊂ G then Radn(F , x) ≤ Radn(G, x).

Proof. Radn(F) = E{

supf ∈F1n

∑ni=1 σi f (xi )

}≤

E{

supf ∈G1n

∑ni=1 σi f (xi )

}= Radn(G).

(Convex hull) Let conv(F) be the convex hull of F . ThenRadn(F , x) = Radn(conv(F), x).

(Scale and Shift) For any function class F and c, d ∈ R, definecF + d = {cf + d : f ∈ F}. Then Radn(cF + d , x) = |c |Radn(F , x).

(Lipschitz composition) If φ is a Lipschitz function such that∀s, t ∈ dom(φ), |φ(s)− φ(t)| ≤ L|s − t|, we haveRadn(φ ◦ F) ≤ LRadn(F).



Uniform bounds using Rademacher complexity

With probability at least 1− δ,

supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F) +

√1

2nlog(

2

δ)

supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F , x) +

√2

nlog(

2

δ)

A proof requires application of theorems of (i) McDiarmid and (ii)Symmetrization. (i) Let g(x) = supf ∈F |Pn(f )− P(f )|, wherex = {x1 · · · , xi , · · · , xn}. Let x ′ = {x1 · · · , zi , · · · , xn}. Then checkg(x)− g(x ′) is bounded (e.g. binary case, ≤ 1

n ). By Mcdiarmid

inequality, P(|g(x)−Eg(x)| > ε) ≤ 2e−2nε2, and with probability at

least 1−δ, g(x) ≤ E (g(x)) +√

12n log 2

δ . Using (ii), one can show

E (g(x)) ≤ 2Radn(F)



McDiarmid’s Inequality

McDiarmid’s Inequality: Let Z1, · · · ,Zn be independent randomvariables. Let x1, · · · , xn be independent random variables. Letx = {x1 · · · , xi , · · · , xn} and x ′ = {x1 · · · , zi , · · · , xn}. Suppose thatsupx1,··· ,xn,zi |f (x)− f (x ′)| ≤ ci for i = 1, · · · n. Then

P(|f (Z1, · · · ,Zn)− E (f (Z1, · · · ,Zn))| ≥ ε) ≤ 2 exp

(− 2ε2∑n

i=1 c2i

).

If f (x1 · · · , xi , · · · , xn) = n−1∑n

i=1 xi , the McDiarmid’s inequalityreduces to Hoffding’s.



Uniform bounds using Rademacher complexity

McDiarmid inequality shows with probability at least 1−δ,

g(x)− Eg(x) ≤√

12n log 2

δ , where g(x) = supf ∈F |Pn(f )− P(f )|.Symmetrization lemma shows E (g(x)) ≤ 2Radn(F). To prove, weuse a ghost sample x ′i · · · , x ′n and Rademacher variables σ1, · · · , σn.Note that n−1

∑ni=1{f (xi )− f (x ′i )} has the same distribution as

n−1∑n

i=1 σi{f (xi )− f (x ′i )}.

E (g(x)) = E{supf ∈F|P(f )− Pn(f )|} = E{sup

f ∈F|E ′(P ′n(f )− Pn(f ))|}

≤ EE ′[

supf ∈F|P ′n(f )− Pn(f )|

]= EE ′

[supf ∈F|n−1

n∑i=1

{f (xi )− f (x ′i )}|]

= EE ′[

supf ∈F|n−1

n∑i=1

σi{f (xi )− f (x ′i )}|]≤ 2Radn(F)



Uniform bounds using empirical Rademacher complexity

McDiarmid’s inequality implies with probability at least 1− δ,

|Radn(F)− Radn(F , x)| ≤√

1

2nlog(

2

δ)

With probability at least 1− δ,

supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F , x) +

√2

nlog(

2

δ)



Relationship between Rademacher Complexity and Growthfunction

Massart’s Finite Lemma: Let A be some finite subset of Rn and σi beindependent Rademacher random variables. Let r = supa∈A ‖a‖2 thenwe have

Eσ[supa∈A

1

n

n∑i=1

σiai ] ≤r√

2 log |A|n

To prove Massart’s Lemma, we first establish

esEσ[supa∈A1n

∑ni=1 σiai ] ≤ |A|e

s2r2

2n2 using Jensen’s inequality andHoeffding’s Lemma. Then taking log on both sides and deviding by s,

Eσ[supa∈A

1

n

n∑i=1

σiai ] ≤log |A|

s+

sr2

2n2,

and then solve for s and substitute back.



Bounding Rademacher complexity

Let f ∈ F and f = (f (x1), · · · , f (xn)). Assume that for any x ,|f (x)| ≤ M. Then ‖f ‖2 ≤

√nM. Applying Massart’s Lemma, we

have

Radn(F) ≤ M

√2 log |FX1,··· ,Xn |

n≤ M

√2 logmF (n)

n.

For a binary function class F with VC dimension of d ,mF (n) ≤ ( end )d if n > d . Therefore

Radn(F) ≤√

2 logmF (n)

n≤√

2d log end

n≤√

2VC (F)(1 + log n)

n



Error bound for binary cases

Using Radn(F) ≤√

2 logmF (n)n , with probability 1− δ

supf ∈F|Pn(f )− P(f )| ≤ 2

√2 logmF (n)

n+

√1

2nlog(

2

δ)

When F has finite VC dimension d , using Radn(F) ≤ C√

(d log n)/nfor a constant C > 0, with probability 1− δ

supf ∈F|Pn(f )− P(f )| ≤ 2C

√(d log n)/n +

√1

2nlog(

2

δ)



Covering number

A pseudometric space (S , d) is a set S and a function d : S ×S → R+

(called a pseudometric) such that, for any x , y , z ∈ S we have:

d(x , y) = d(y , x) (symmetry);d(x , z) ≤ d(x , y) + d(y , z) (triangle inequality);d(x , x) = 0.

A metric space is obtained if one further assumes that d(x , y) = 0implies x = y . Covering number is defined on pseudometric spaces.

Definition (ε-cover) The set C ⊆ S is a ε-cover of (S , d) if for everyx ∈ S there exists y ∈ C such that d(x , y) ≤ ε.



Covering number of F

Definition (ε-cover of F) If Q is a measure and p ≥ 1, define

‖f ‖Lp(Q) =(∫|f (x)|pdQ(x)

)1/p. A set V = {f1, f2, · · · , fN} is an

ε-cover of F if for every f ∈ F there exists a fj ∈ V such that‖f − fj‖Lp(Q) < ε.Definition (Covering Number) Covering number is denoted byNp(ε,F ,Q) = min{|V | : V is a ε-cover of F}.Definition (Uniform Covering Number)

Np(ε,F) = supQ

Np(ε,F ,Q)

Definition (Empirical Covering Number) Let {Xi}ni=1 be n fixed pointsand Qn be the corresponding empirical measure. Then‖f ‖Lp(Qn) = ( 1n

∑ni=1 |f (Xi )|p)1/p and Np(ε,F ,Qn) is called empirical

covering number, the minimal N such that there exist f1, · · · , fN thatfor every f ∈ F there is a j ∈ {1, · · · ,N} such that

{ 1n∑n

i=1 |f (Xi )− fj(Xi )|p}1p < ε.



Covering Number: Example

Suppose that A ⊂ Rm, let c = maxa∈A ‖a‖, and assume that A lies in ad-dimensional subspace of Rm. Then N(ε,A) ≤ (2c

√d/ε)d . To see this,

denoting vi as orthonormal basis, for any a ∈ A, a =∑d

i=1 αivi with‖α‖∞ ≤ ‖α‖2 = ‖a‖2 ≤ c . Consider

A′ = {d∑

i=1

α′ivi : ∀α′ ∈ {−c,−c + r ,−c + 2r , · · · , c}}.

Given a ∈ A, there exists a′ ∈ A′ such that

‖a− a′‖2 = ‖d∑

i=1

(α′i − αi )vi‖2 ≤ r2d∑

i=1

‖vi‖2 ≤ r2d .

Setting r2d = ε, A′ is an ε-cover of A, and

N(ε,A) ≤ |A′| = (2c

r)d = (

2c√d

ε)d .



Bounding Rademacher Complexity with Covering Number:Pollard’s Lemma

Pollard’s Lemma: For any sample S = {Xi}ni=1, letF = {f : X → {−1, 1}}, we have

Radn(F ,S) ≤ infβ≥0{β +

√2 logN1(β,F ,Qn)

n}.

A key to prove Pollard’s Lemma is to use identity

supf ∈F

1

n

n∑i=1

σi f (Xi ) = supv∈V

supf ∈Bβ(v)

1

n

n∑i=1

σi (f (Xi )− vi + vi ),

where V is a l1 cover of F , |V | = N1(β,F ,Qn),Bβ(v) = {f ∈ F : 1

n

∑ni=1 |f (Xi )− vi | ≤ β}, and then to apply

Massart’s Finite lemma.



Bounding Rademacher Complexity with Covering Number:Dudley’s Chaining

Dudley’s Chaining: For any i.i.d. sample S = {Xi}ni=1, letF = {f : X → {−1, 1}}, we have

Radn(F , S) ≤ inf0≤α≤1

{4α + 12

∫ 1

α

√logN2(δ,F ,Qn)

ndδ}.

The main idea is to use balls with different size for the covering. LetVj be a minimal l2 cover at scale εj = 2−j with |Vj | = N2(F , εj ,Qn)for j ∈ {0, 1, · · · ,N} and |Vj | ≤ |Vj+1|.



Generalization Bounds for Deep Neural Networks: Setup

We can write DNN as follows. Let σi : Rdi → Rdi+1 1-Lipschitzcontinuous function and σi (0) = 0. Let

fA(x) = σL(ALσL−1(AL−1 · · ·σ1(A1x) · · · ))

with x ∈ Rd , Ai ∈ Rdi×di−1 , with d0 = d , dL+1 = k , andW = maxi di . Assume ‖x‖ ≤ B.

We give two results for two slightly different function class, one with

bounded Frobenius norm, ‖A‖F =√∑d

i=1

∑mj=1 A

2ij and the other

with bounded spectral norm, ‖A‖2 = sup‖u‖2=1 ‖Au‖2 = λmax , thelargest singular value of A.



Generalization Bounds for Deep Neural Networks

(Golowich et al., 2017) For function class FA,‖.‖F ,γ = {fA(x) : Rd →Rk ,A = {A1, · · · ,AL} : ‖Ai‖F ≤ Mi , i = 1, · · · , L, γ ≤

∏Li=1 ‖Ai‖2},

Radn(F , S) . B(L∏

i=1

Mi ) min({

√log(γ−1

∏Li=1Mi )√

n,

√L

n}).

(Li et al., 2018) For function class FA,‖.‖2 = {fA(x) : Rd → Rk ,A ={A1, · · · ,AL} : ‖Ai‖2 ≤ si , i = 1, · · · , L, σi (0) = 0,

Radn(F , S) .B(∏L

i=1 si )√n

√LW 2 log

(√Lnmax1≤i≤L si )

min1≤i≤L si.


Some theory of machine learning - Seoul National...

Documents

Transcript of Some theory of machine learning - Seoul National...