Some theory of machine learning - Seoul National...

33
Some theory of machine learning Some theory of machine learning Seoul National University Deep Learning September-December, 2019 1 / 33

Transcript of Some theory of machine learning - Seoul National...

Page 1: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Some theory of machine learning

Seoul National University Deep Learning September-December, 2019 1 / 33

Page 2: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Setup

Consider (xi , yi ), xi ∈ Rd , y ∈ {0, 1}.Given a classifier f : Rd → {0, 1}, we would like to minimizeR(f ) = P(Y 6= f (X )).

We obtain data (xi , yi ), i = 1, · · · , n. An estimator of R(f ) isR̂(f ) = n−1

∑ni=1 I (Yi 6= f (Xi )), which we call Empirical risk.

Learning f (Empirical risk minimization): We want to find f thatmakes R(f ) small. Suppose we choose f̂ which minimizes theempirical risk over F , i.e., R̂(f̂ ) = min

f ∈FR̂(f ).

Optimal f in F : Let f ∗ be the optimal classifier in F in the sensethat it minimizes the true risk over F , i.e., R(f ∗) = min

f ∈FR(f ).

f may not be in F : Let f ∗∗ be the true classifier that minimizes thetrue risk, i.e., R(f ∗∗) = min

fR(f ).

Seoul National University Deep Learning September-December, 2019 2 / 33

Page 3: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Areas of theoretical studies related to Deep Learning

R(f̂ )− R(f ∗∗) = {R(f̂ ))− R(f ∗)}+ {R(f ∗)− R(f ∗∗)}{R(f ∗)− R(f ∗∗)} is approximation error.

R(f̂ )− R(f ∗) is estimation error.

Theoretical work on deep learning addresses (i) approximation, (ii)optimization and (iii) generalization error. Works on approximationhelp explain expressiveness of deep models, but it is not our focus inthis course. We will mention some recent finding on (ii) in theoptimization section. In this section we will focus on thegeneralization error.

Seoul National University Deep Learning September-December, 2019 3 / 33

Page 4: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Excess risk

R(f̂ )− R(f ∗∗) = {R(f̂ )− R(f ∗)}+ {R(f ∗)− R(f ∗∗)}{R(f ∗)− R(f ∗∗)} is approximation error.

R(f̂ )− R(f ∗). We would like to bound this estimation error.

R(f̂ )− R(f ∗) = {R(f̂ )− R̂(f̂ )}+ {R̂(f̂ )− R̂(f ∗)}+ {R̂(f ∗)− R(f ∗)}≤ {R(f̂ )− R̂(f̂ )}︸ ︷︷ ︸

(i)

+ {R̂(f ∗)− R(f ∗)}︸ ︷︷ ︸(ii)

≤ 2 supf ∈F|R̂(f )− R(f )|

To bound (ii), concentration of measures can be invoked but not (i).

Seoul National University Deep Learning September-December, 2019 4 / 33

Page 5: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Hoeffding’s Inequality

Hoeffding’s Inequality: If Z1, · · · ,Zn are independent withP(ai ≤ Zi ≤ bi ) = 1, then for any ε > 0,

P(|Z̄n − µ| > ε) < 2e−2nε2/c ,

where c = n−1∑n

i=1(bi − ai )2 and Z̄n = n−1

∑ni=1 Zi .

Hoeffding’s inequality implies

P(|R(f )− R̂(f )| > ε) ≤ 2e−2nε2,

and with probability 1− δ,

|R(f )− R̂(f )| ≤√

1

2nlog(

2

δ)

Seoul National University Deep Learning September-December, 2019 5 / 33

Page 6: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Uniform bound

Let |F| = M. For m = 1, · · · ,M, event Bm, |R̂(fm)− R(fm)| > ε,

P(∪Mi=1Bm) = P( supfm∈F

|R̂(fm)− R(fm)| > ε) < 2Me−2nε2. We obtain

the following holds with probability 1− δ,

supfm∈F

|R̂(fm)− R(fm)| ≤√

log 2M/δ

2n

Uniform bound relies on finite |F|. This argument does not gothrough if |F| is infinite.To handle infinite |F|, we show two approaches: one is to consider‘the projection of F on data’ and the other is to attempt to boundsupf∈F |R̂(f )− R(f )| at once. In the first approach we introduceShattering number (or growth function), Vapnik and Chervonenkis(VC) dimension. In the second approach we introduce McDiamid’sInequality and Rademacher complexity (Koltchinskii and Panchenko,2002).

Seoul National University Deep Learning September-December, 2019 6 / 33

Page 7: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Measures of complexity: Shattering number

Let χ be a set and let F be a class of binary functions on χ. Forf ∈ F , f : z ={z1, · · · , zn} → {0,+1}. Define the projection of F onz , F(z1, · · · , zn) = {(f (z1), · · · , f (zn))}. Note that F(z1, · · · , zn) isa finite collection of vectors. |F| can be infinite, but|F(z1, · · · , zn)| ≤ 2n.

e.g.1. f (z) = I (z > a) with z1 < z2 < z3. ThenF(z1, z2, z3) = {(0, 0, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1)}.e.g.2. f (z) = I (a < z < b) with z1 < z2 < z3. Then F(z1, z2, z3) ={(0, 0, 0), (1, 0, 0), (1, 1, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1), (0, 1, 0)}.Definition: Shattering number, growth function: Number of the mostpossible sets of dichotomies on any n points:

mF (n) = maxz1,···zn |F(z1, · · · zn)|.

Seoul National University Deep Learning September-December, 2019 7 / 33

Page 8: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Measures of complexity: Shattering number

By definition, the shattering number satisfies mF (n) ≤ 2n.

For e.g.1, F is set of f : R→ {0, 1}, f (z) = I (z > a). When n = 3with z1 < z2 < z3,F(z1, z2, z3) = {(0, 0, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1)}.mF (n) = n + 1.

e.g.2. F is set of f : R→ {0, 1}, f (z) = I (a < z < b). When n = 3with z1 < z2 < z3. Then F(z1, z2, z3) ={(0, 0, 0), (1, 0, 0), (1, 1, 0), (0, 0, 1), (0, 1, 1), (1, 1, 1), (0, 1, 0)}.mF (n) = n(n + 1)/2 + 1.

F is set of f : R2 → {0, 1}, f (z) = 1 if convex region.mF (n) = 2n.

Seoul National University Deep Learning September-December, 2019 8 / 33

Page 9: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Measures of complexity: VC dimension

The VC dimension of F is the largest value of n for whichmF (n) = 2n, i.e. dVC (F) = sup{n : mF (n) = 2n}, equivalently, thesize of the largest set that can be shattered.

Break point: Number of data points for which we cannot get allpossible dichotomies (=VC dimension +1).

Table: VC dimensions

Function class VC dimension

interval [a, b] 2Disc in R2 3

half-spaces in Rd d + 1Convex polycons in R2 ∞

Seoul National University Deep Learning September-December, 2019 9 / 33

Page 10: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Measures of complexity: Sauer’s Theorem

VC dimension or break point regardless of F can give informationabout mF (n). If we know the break point is k , we know that any kpoints of n cannot have all possible patterns. e.g. n = 3, k = 2. Thisobservation gives a upper bound for mF (n).

Group x1 x2 x3

G1 0 0 0G2 0 0 1G2 0 1 0G2 1 0 0

Since k = 2, any two columns cannot have all possible dichotomies.

Let B(n, k) be the maximum number of patterns with n and breakpoint k. When we consider x1 and x2 only, G2 represents a set withdistinct patterns. For this set, the size is ≤ B(n− 1, k). For G1, sincex3 has both possible patterns, the size is ≤ B(n − 1, k − 1).

Seoul National University Deep Learning September-December, 2019 10 / 33

Page 11: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Measures of complexity: Sauer’s Theorem

When the break point is k ,|F(z1, · · · zn)| ≤ B(n − 1, k) + B(n − 1, k − 1) = B(n, k). Using this,the following theorem holds.

Sauer’s Theorem: Suppose that F has finite VC dimension d . Then

mF (n) ≤d∑

i=0

(n

i

)and for all n ≥ d ,

mF (n) ≤ (en

d)d

If mF (n) is less than 2n, mF (n) is polynomial, nothing in between.

Seoul National University Deep Learning September-December, 2019 11 / 33

Page 12: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Uniform bounds using shattering number

(Vapnik and Chervonenkis) Let F be a class of binary functions. Forany t >

√2/n

P(supf ∈F|Pn(f )− P(f )| > t) ≤ 4mF (2n)e−nt

2/8

and hence with probability at least 1− δ,

supf ∈F|Pn(f )− P(f )| ≤

√8

nlog

(4mF (2n)

δ

)where Pn(f ) = 1

n

∑ni=1 f (xi ), P(f ) =

∫f (x)dP(x).

The symmetrization technique can be used in proving the VCtheorem.

Seoul National University Deep Learning September-December, 2019 12 / 33

Page 13: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Symmetrization Lemma

Denote the empirical distribution of xi · · · , xn from P by Pn. Letx ′i · · · , x ′n denote a second independent sample from P and P ′n denotethe empirical distribution of the second sample. For all t >

√2/n,

P(supf ∈F|Pn(f )− P(f )| > t) ≤ 2P(sup

f ∈F|Pn(f )− P ′n(f )| > t/2).

The second sample, x ′i · · · , x ′n, is called a ghost sample.

If |(Pn(f )− P(f ))| > t and |(P(f )− P ′n(f )| ≤ t/2, then|P ′n(f )− Pn(f )| > t/2. This can be used in a proof.

The importance of the result is that we can express a bound ofsupf ∈F |Pn(f )− P(f )| using supf ∈F |Pn(f )− P ′n(f )|. This allowstaking maximum over a finite number that can be expressed using thegrowth function.

Seoul National University Deep Learning September-December, 2019 13 / 33

Page 14: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Uniform bounds using VC dimension

Recall mF (n) ≤ ( end )d . Replacing mF (n) with ( end )d in VC inequality,

we obtain for any t >√

2/n

P(supf ∈F|Pn(f )− P(f )| > t) ≤ 4(

en

d)de−nt

2/8

and hence with probability at least 1− δ,

supf ∈F|Pn(f )− P(f )| ≤

√√√√8

n

(log

(4

δ

)+ d log

(ne

d

))

where Pn(f ) = 1n

∑ni=1 f (xi ), P(f ) =

∫f (x)dP(x).

Seoul National University Deep Learning September-December, 2019 14 / 33

Page 15: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

VC dimension bounds for neural network

One can compute the complexity or capacity of Neural Networkmodels by measuring how many configurations can be shattered (VCdimension)

The capacity of the network, if measured by the number of pieces in apiecewise linear approximation, increases exponentially with depth[Montufar, Pascanu et al, 2014]

These results quantify an upper bound on the empirical risk of deepneural networks

The bounds might be very pessimistic.

Seoul National University Deep Learning September-December, 2019 15 / 33

Page 16: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

General measures of complexity: Rademacher complexity

Motivation: y = {−1, 1}. f : χ→ {−1, 1}.R̂(f ) = 1

m

∑mi=1 I (f (xi ) 6= yi ) = 1

2 −12m

∑mi=1 yi f (xi ).

Minimizing training error is to maximize 1m

∑mi=1 yi f (xi ) over f .

Consider a random label instead of y . A bigger model class F canmake 1

m

∑mi=1 yi f (xi ) big as well.

Definition (Rademacher complexity): Let σ1 · · · , σn be independentrandom variables P(σi = 1) = P(σi = −1) = 1/2. Rademachercomplexity of F is

Radn(F) = E

supf ∈F

1

n

n∑i=1

σi f (xi )

.

Seoul National University Deep Learning September-December, 2019 16 / 33

Page 17: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

General measures of complexity: Rademacher complexity

Empirical Rademacher complexity of F by

Radn(F , x) = Eσ

supf ∈F

1

n

n∑i=1

σi f (xi )

.

When |F| = 1,

Radn(F) = E{

supf ∈F1n

∑ni=1 σi f (xi )

}= E 1

n

∑ni=1 Eσi f (xi ) = 0.

When |F| = 2n, Radn(F) = E{

supf ∈F1n

∑ni=1 σi f (xi )

}= 1.

Seoul National University Deep Learning September-December, 2019 17 / 33

Page 18: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Rademacher complexity: Example

Example (Ridge regression): Let F be the class of linear predictorswith restriction ‖w‖2 ≤W2. Additionally assume that ‖x‖2 ≤ X2.

Radn(F , x) = Eσ supw:‖w‖2≤W2

1

n

n∑i=1

σi < xi ,w >

=1

nEσ sup

w:‖w‖2≤W2

<

n∑i=1

σixi ,w >

=W2

nEσ‖

n∑i=1

σixi‖2 ≤W2

n

√√√√Eσ‖n∑

i=1

σixi‖22

≤ W2

n

√√√√Eσ

n∑i=1

‖σixi‖22 =W2

n

√√√√Eσ

n∑i=1

‖xi‖22 ≤X2W2√

n

Seoul National University Deep Learning September-December, 2019 18 / 33

Page 19: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Properties of Rademacher complexity

(Monotonicity) If F ⊂ G then Radn(F , x) ≤ Radn(G, x).

Proof. Radn(F) = E{

supf ∈F1n

∑ni=1 σi f (xi )

}≤

E{

supf ∈G1n

∑ni=1 σi f (xi )

}= Radn(G).

(Convex hull) Let conv(F) be the convex hull of F . ThenRadn(F , x) = Radn(conv(F), x).

(Scale and Shift) For any function class F and c, d ∈ R, definecF + d = {cf + d : f ∈ F}. Then Radn(cF + d , x) = |c |Radn(F , x).

(Lipschitz composition) If φ is a Lipschitz function such that∀s, t ∈ dom(φ), |φ(s)− φ(t)| ≤ L|s − t|, we haveRadn(φ ◦ F) ≤ LRadn(F).

Seoul National University Deep Learning September-December, 2019 19 / 33

Page 20: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Uniform bounds using Rademacher complexity

With probability at least 1− δ,

supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F) +

√1

2nlog(

2

δ)

supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F , x) +

√2

nlog(

2

δ)

A proof requires application of theorems of (i) McDiarmid and (ii)Symmetrization. (i) Let g(x) = supf ∈F |Pn(f )− P(f )|, wherex = {x1 · · · , xi , · · · , xn}. Let x ′ = {x1 · · · , zi , · · · , xn}. Then checkg(x)− g(x ′) is bounded (e.g. binary case, ≤ 1

n ). By Mcdiarmid

inequality, P(|g(x)−Eg(x)| > ε) ≤ 2e−2nε2, and with probability at

least 1−δ, g(x) ≤ E (g(x)) +√

12n log 2

δ . Using (ii), one can show

E (g(x)) ≤ 2Radn(F)

Seoul National University Deep Learning September-December, 2019 20 / 33

Page 21: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

McDiarmid’s Inequality

McDiarmid’s Inequality: Let Z1, · · · ,Zn be independent randomvariables. Let x1, · · · , xn be independent random variables. Letx = {x1 · · · , xi , · · · , xn} and x ′ = {x1 · · · , zi , · · · , xn}. Suppose thatsupx1,··· ,xn,zi |f (x)− f (x ′)| ≤ ci for i = 1, · · · n. Then

P(|f (Z1, · · · ,Zn)− E (f (Z1, · · · ,Zn))| ≥ ε) ≤ 2 exp

(− 2ε2∑n

i=1 c2i

).

If f (x1 · · · , xi , · · · , xn) = n−1∑n

i=1 xi , the McDiarmid’s inequalityreduces to Hoffding’s.

Seoul National University Deep Learning September-December, 2019 21 / 33

Page 22: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Uniform bounds using Rademacher complexity

McDiarmid inequality shows with probability at least 1−δ,

g(x)− Eg(x) ≤√

12n log 2

δ , where g(x) = supf ∈F |Pn(f )− P(f )|.Symmetrization lemma shows E (g(x)) ≤ 2Radn(F). To prove, weuse a ghost sample x ′i · · · , x ′n and Rademacher variables σ1, · · · , σn.Note that n−1

∑ni=1{f (xi )− f (x ′i )} has the same distribution as

n−1∑n

i=1 σi{f (xi )− f (x ′i )}.

E (g(x)) = E{supf ∈F|P(f )− Pn(f )|} = E{sup

f ∈F|E ′(P ′n(f )− Pn(f ))|}

≤ EE ′[

supf ∈F|P ′n(f )− Pn(f )|

]= EE ′

[supf ∈F|n−1

n∑i=1

{f (xi )− f (x ′i )}|]

= EE ′[

supf ∈F|n−1

n∑i=1

σi{f (xi )− f (x ′i )}|]≤ 2Radn(F)

Seoul National University Deep Learning September-December, 2019 22 / 33

Page 23: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Uniform bounds using empirical Rademacher complexity

McDiarmid’s inequality implies with probability at least 1− δ,

|Radn(F)− Radn(F , x)| ≤√

1

2nlog(

2

δ)

With probability at least 1− δ,

supf ∈F|Pn(f )− P(f )| ≤ 2Radn(F , x) +

√2

nlog(

2

δ)

Seoul National University Deep Learning September-December, 2019 23 / 33

Page 24: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Relationship between Rademacher Complexity and Growthfunction

Massart’s Finite Lemma: Let A be some finite subset of Rn and σi beindependent Rademacher random variables. Let r = supa∈A ‖a‖2 thenwe have

Eσ[supa∈A

1

n

n∑i=1

σiai ] ≤r√

2 log |A|n

To prove Massart’s Lemma, we first establish

esEσ[supa∈A1n

∑ni=1 σiai ] ≤ |A|e

s2r2

2n2 using Jensen’s inequality andHoeffding’s Lemma. Then taking log on both sides and deviding by s,

Eσ[supa∈A

1

n

n∑i=1

σiai ] ≤log |A|

s+

sr2

2n2,

and then solve for s and substitute back.

Seoul National University Deep Learning September-December, 2019 24 / 33

Page 25: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Bounding Rademacher complexity

Let f ∈ F and f = (f (x1), · · · , f (xn)). Assume that for any x ,|f (x)| ≤ M. Then ‖f ‖2 ≤

√nM. Applying Massart’s Lemma, we

have

Radn(F) ≤ M

√2 log |FX1,··· ,Xn |

n≤ M

√2 logmF (n)

n.

For a binary function class F with VC dimension of d ,mF (n) ≤ ( end )d if n > d . Therefore

Radn(F) ≤√

2 logmF (n)

n≤√

2d log end

n≤√

2VC (F)(1 + log n)

n

Seoul National University Deep Learning September-December, 2019 25 / 33

Page 26: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Error bound for binary cases

Using Radn(F) ≤√

2 logmF (n)n , with probability 1− δ

supf ∈F|Pn(f )− P(f )| ≤ 2

√2 logmF (n)

n+

√1

2nlog(

2

δ)

When F has finite VC dimension d , using Radn(F) ≤ C√

(d log n)/nfor a constant C > 0, with probability 1− δ

supf ∈F|Pn(f )− P(f )| ≤ 2C

√(d log n)/n +

√1

2nlog(

2

δ)

Seoul National University Deep Learning September-December, 2019 26 / 33

Page 27: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Covering number

A pseudometric space (S , d) is a set S and a function d : S ×S → R+

(called a pseudometric) such that, for any x , y , z ∈ S we have:

d(x , y) = d(y , x) (symmetry);d(x , z) ≤ d(x , y) + d(y , z) (triangle inequality);d(x , x) = 0.

A metric space is obtained if one further assumes that d(x , y) = 0implies x = y . Covering number is defined on pseudometric spaces.

Definition (ε-cover) The set C ⊆ S is a ε-cover of (S , d) if for everyx ∈ S there exists y ∈ C such that d(x , y) ≤ ε.

Seoul National University Deep Learning September-December, 2019 27 / 33

Page 28: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Covering number of F

Definition (ε-cover of F) If Q is a measure and p ≥ 1, define

‖f ‖Lp(Q) =(∫|f (x)|pdQ(x)

)1/p. A set V = {f1, f2, · · · , fN} is an

ε-cover of F if for every f ∈ F there exists a fj ∈ V such that‖f − fj‖Lp(Q) < ε.Definition (Covering Number) Covering number is denoted byNp(ε,F ,Q) = min{|V | : V is a ε-cover of F}.Definition (Uniform Covering Number)

Np(ε,F) = supQ

Np(ε,F ,Q)

Definition (Empirical Covering Number) Let {Xi}ni=1 be n fixed pointsand Qn be the corresponding empirical measure. Then‖f ‖Lp(Qn) = ( 1n

∑ni=1 |f (Xi )|p)1/p and Np(ε,F ,Qn) is called empirical

covering number, the minimal N such that there exist f1, · · · , fN thatfor every f ∈ F there is a j ∈ {1, · · · ,N} such that

{ 1n∑n

i=1 |f (Xi )− fj(Xi )|p}1p < ε.

Seoul National University Deep Learning September-December, 2019 28 / 33

Page 29: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Covering Number: Example

Suppose that A ⊂ Rm, let c = maxa∈A ‖a‖, and assume that A lies in ad-dimensional subspace of Rm. Then N(ε,A) ≤ (2c

√d/ε)d . To see this,

denoting vi as orthonormal basis, for any a ∈ A, a =∑d

i=1 αivi with‖α‖∞ ≤ ‖α‖2 = ‖a‖2 ≤ c . Consider

A′ = {d∑

i=1

α′ivi : ∀α′ ∈ {−c,−c + r ,−c + 2r , · · · , c}}.

Given a ∈ A, there exists a′ ∈ A′ such that

‖a− a′‖2 = ‖d∑

i=1

(α′i − αi )vi‖2 ≤ r2d∑

i=1

‖vi‖2 ≤ r2d .

Setting r2d = ε, A′ is an ε-cover of A, and

N(ε,A) ≤ |A′| = (2c

r)d = (

2c√d

ε)d .

Seoul National University Deep Learning September-December, 2019 29 / 33

Page 30: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Bounding Rademacher Complexity with Covering Number:Pollard’s Lemma

Pollard’s Lemma: For any sample S = {Xi}ni=1, letF = {f : X → {−1, 1}}, we have

Radn(F ,S) ≤ infβ≥0{β +

√2 logN1(β,F ,Qn)

n}.

A key to prove Pollard’s Lemma is to use identity

supf ∈F

1

n

n∑i=1

σi f (Xi ) = supv∈V

supf ∈Bβ(v)

1

n

n∑i=1

σi (f (Xi )− vi + vi ),

where V is a l1 cover of F , |V | = N1(β,F ,Qn),Bβ(v) = {f ∈ F : 1

n

∑ni=1 |f (Xi )− vi | ≤ β}, and then to apply

Massart’s Finite lemma.

Seoul National University Deep Learning September-December, 2019 30 / 33

Page 31: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Bounding Rademacher Complexity with Covering Number:Dudley’s Chaining

Dudley’s Chaining: For any i.i.d. sample S = {Xi}ni=1, letF = {f : X → {−1, 1}}, we have

Radn(F , S) ≤ inf0≤α≤1

{4α + 12

∫ 1

α

√logN2(δ,F ,Qn)

ndδ}.

The main idea is to use balls with different size for the covering. LetVj be a minimal l2 cover at scale εj = 2−j with |Vj | = N2(F , εj ,Qn)for j ∈ {0, 1, · · · ,N} and |Vj | ≤ |Vj+1|.

Seoul National University Deep Learning September-December, 2019 31 / 33

Page 32: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Generalization Bounds for Deep Neural Networks: Setup

We can write DNN as follows. Let σi : Rdi → Rdi+1 1-Lipschitzcontinuous function and σi (0) = 0. Let

fA(x) = σL(ALσL−1(AL−1 · · ·σ1(A1x) · · · ))

with x ∈ Rd , Ai ∈ Rdi×di−1 , with d0 = d , dL+1 = k , andW = maxi di . Assume ‖x‖ ≤ B.

We give two results for two slightly different function class, one with

bounded Frobenius norm, ‖A‖F =√∑d

i=1

∑mj=1 A

2ij and the other

with bounded spectral norm, ‖A‖2 = sup‖u‖2=1 ‖Au‖2 = λmax , thelargest singular value of A.

Seoul National University Deep Learning September-December, 2019 32 / 33

Page 33: Some theory of machine learning - Seoul National …stat.snu.ac.kr/mcp/Lecture4_learning_theory.pdfSome theory of machine learning Measures of complexity: Sauer’s Theorem VC dimension

Some theory of machine learning

Generalization Bounds for Deep Neural Networks

(Golowich et al., 2017) For function class FA,‖.‖F ,γ = {fA(x) : Rd →Rk ,A = {A1, · · · ,AL} : ‖Ai‖F ≤ Mi , i = 1, · · · , L, γ ≤

∏Li=1 ‖Ai‖2},

Radn(F , S) . B(L∏

i=1

Mi ) min({

√log(γ−1

∏Li=1Mi )√

n,

√L

n}).

(Li et al., 2018) For function class FA,‖.‖2 = {fA(x) : Rd → Rk ,A ={A1, · · · ,AL} : ‖Ai‖2 ≤ si , i = 1, · · · , L, σi (0) = 0,

Radn(F , S) .B(∏L

i=1 si )√n

√LW 2 log

(√Lnmax1≤i≤L si )

min1≤i≤L si.

Seoul National University Deep Learning September-December, 2019 33 / 33