Understanding Deep Learning in Theory

Post on 04-May-2022

6 views 0 download

Transcript of Understanding Deep Learning in Theory

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Understanding Deep Learning in Theory

Hui Jiang

Department of Electrical Engineering and Computer ScienceLassonde School of Engineering, York University, Canada

Vector Institue Seminar on April 5, 2019

Hui Jiang Deep Learning Theory 1/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Outline

1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation

2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

4 Conclusions

Hui Jiang Deep Learning Theory 2/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Deep learning overviewDeep learning theory reviewLearning problem formulation

Deep Learning Overview

Deep learning has achieved tremendous successes in practice:speech, vision, text, games, · · ·Deep learning is somehow criticized because it can not beexplained by the current machine learning theory.

Open Problems

What is the essense to successes of neural nets?

Why neural nets overtake other ML models in practice?

Why so “easy” to learn neural nets?

Why do neural nets generalize well?

What is the limit of neural nets?

Hui Jiang Deep Learning Theory 3/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Deep learning overviewDeep learning theory reviewLearning problem formulation

Deep Learning Theory

Theoretical Issues of Neural Nets

1 expressiveness: what is the modeling capacity of neural nets?

solved due to the universal approximation [Cyb89]

2 optimization: why simple gradient descents consistently work?

NP-hard to learn even a small neural net [BR92]high-dimensional & non-convex to learn large neural nets[AHW95]

3 generalization: why over-parameterized neural nets generalize?

VC theory gives loose bounds to simple models [Vap00]totally fails to explain complex models like neural nets

Hui Jiang Deep Learning Theory 4/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Deep learning overviewDeep learning theory reviewLearning problem formulation

Problem Formulation

Machine Learning as Stochastic Function Fitting

Inputs x ∈ RK follow a p.d.f. p(x): x ∼ p(x)

A target function y = f̄ (x): deterministic from input x ∈ RK tooutput y ∈ R

Finite training samples: DT = {(x1, y1), (x2, y2), · · · , (xT , yT )},where xt ∼ p(x) and yt = f̄ (xt) for all 1 ≤ t ≤ T

A model y = f (x|DT ) is learned from function class C based on DT

to minimize a loss measure l(f̄ , f ) between f̄ and f w.r.t. p(x)

l(·) is normally convex: mean square error, cross-entropy, hinge, ...

Hui Jiang Deep Learning Theory 5/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Deep learning overviewDeep learning theory reviewLearning problem formulation

Problem Formulation (cont’d)

Ideally f () should be learned to minimize the expected risk:

Expected Risk

R(f | DT ) = Ep(x)

[l(f̄ (x), f (x|DT )

) ]Practically f () is learned to minimize the empirical loss:

Empirical Risk

Remp(f | DT ) =1

T

T∑t=1

l (yt , f (xt |DT ))

Function class C = L1(UK ), where UK , [0, 1]K ⊂ RK

f ∈ L1(UK ) ⇐⇒∫· · ·∫

x∈UK

|f (x)| dx <∞

Hui Jiang Deep Learning Theory 6/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation

2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

4 Conclusions

Hui Jiang Deep Learning Theory 7/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Learning as Functional Minimization

Learning is formulated as model-free functional minimizationin L1(UK ):

f ∗ = arg minf ∈L1(UK )

Q(f |DT ) = arg minf ∈L1(UK )

T∑t=1

l (yt , f (xt))

Functional minimization is very generic. But

how to parameterize the function space L1(UK ) at above?

Hui Jiang Deep Learning Theory 8/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Literal Model Space: Neural Networks

Literal model space

Use neural networks to represent the function space L1(UK )

Literal space ΛM : the set of all well-structured neural nets ofM weights; each neural net is denoted as w ∈ RM

Universal approximator theorem [Cyb89]: ∀f (x) ∈ L1(UK )can be well approximated by at least one w in ΛM .

If inputs and all weights are bounded, ∀w ∈ ΛM represents afunction in L1(UK ), denoted as fw(x).

Lemma 1

If M is sufficiently large, limM→∞ ΛM ≡ L1(UK ).

Hui Jiang Deep Learning Theory 9/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Learning in Literal Space

Learning neural nets in literal space

w∗ = arg minw∈RM

Q(fw|DT ) = arg minw∈RM

T∑t=1

l (yt , fw(xt))

Hui Jiang Deep Learning Theory 10/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Canonical Model Space: Fourier Series

Fourier coefficients: ∀f (x) ∈ L1(UK )F

=⇒ θ = {θk |k ∈ ZK}

θk =

∫· · ·∫

x∈UK

f (x) e−2πik·xdx (∀k ∈ ZK )

Fourier series: f (x) := F−1 (x|θ)

f (x) =∑k∈ZK

θk e2πik·x

Riemann-Lebesgue lemma: ∀ε > 0, truncate to Nsignificant terms, Nε, to form a partial sum of finite terms

f̂ (x) =∑k∈Nε

θk e2πik·x

where∫·· ·∫

x∈UK‖f (x)− f̂ (x)‖2dx ≤ ε2.

Hui Jiang Deep Learning Theory 11/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Illustration of Canonical Space of L1(UK )

Hui Jiang Deep Learning Theory 12/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Illustration of Canonical Space of L1(UK )

Hui Jiang Deep Learning Theory 13/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Canonical Space of L1(UK )

Canonical space Θ: each set of Fourier coefficients θ ∈ Θ

Learning in canonical space

θ∗ = arg minθ∈Θ

Q(θ|DT ) = arg minθ∈Θ

T∑t=1

l(yt ,F

−1(xt |θ))

Hui Jiang Deep Learning Theory 14/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Learning in Canonical Space

Theorem 2

The objective function Q(f |DT ) is convex in canonical space. Ifthe dimensionality of canonical space is not less than the numberof training samples in DT , the global minimum achieves zero loss.

Proof sketch:

Q(f |DT ) is represented in canonical space:

Q(θ|DT ) =T∑t=1

l(yt ,F

−1(xt |θ))

=T∑t=1

l

yt ,∑k∈Nε

θk · e2πik·xt

N unknown coefficients: θ = {θk | k ∈ Nε}T linear eqns: yt = f̂ (xt) =

∑k∈Nε

θk · e2πik·xt (1 ≤ t ≤ T )

Hui Jiang Deep Learning Theory 15/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

From Canonical Space back to Literal Space

Model spaces: w⇒ fw(x)F

=⇒ θ =⇒ F−1(x|θ) := fθ(x)

The objective function in two spaces:

Q(fw|DT ) = Q(fθ|DT )

The chain rule:

∇wQ(fw|DT ) = ∇θQ(fθ|DT )∇wθ = ∇θQ(fθ|DT )∇wF (fw(x))

Hui Jiang Deep Learning Theory 16/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

From Canonical Space back to Literal Space (cont’d)

Disparity Matrix

∂Q∂w1

...∂Q∂wm

...∂Q∂wM

M×1

=

F(∂fw(x)∂w1

)...

F(∂fw(x)∂wm

)...

F(∂fw(x)∂wM

)

M×N︸ ︷︷ ︸

disparity matrix H(w)

∂Q∂θ1

...∂Q∂θn

...∂Q∂θN

N×1

Gradients in literal and canonical spaces are related via a pointwise

linear transformation:[∇wQ

]M×1

=[H(w)

]M×N

[∇θQ

]N×1

Hui Jiang Deep Learning Theory 17/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

From Canonical Space back to Literal Space (cont’d)

Lemma 3

Assume neural network is sufficiently large (M ≥ N). If w∗ is astationary point of Q(fw) and H(w∗) has full rank, w∗ is a globalminimum.

Lemma 4

If w(0) is a stationary point of Q(fw) and H(w(0)) does not has fullrank at w(0), then w(0) may be a local minimum or saddle point orglobal minimum.

Hui Jiang Deep Learning Theory 18/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Why learning of large-scale neural networks behaves likeconvex optimization?

Stochastic Gradient Descent (SGD)

randomly initialize w(0), set k = 0for epoch = 1 to L do

for each minibatch in training set DT dow(k+1) ← w(k) − hk∇wQ(w(k))k ← k + 1

end forend for

Theorem 5

If M ≥ N, and initial w(0) and step sizes hk are chosen as such to ensureH(w) maintains full rank at every w(k), then SGD/GD surely convergesto a global minimum of zero loss like convex optimization.

Hui Jiang Deep Learning Theory 19/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

Why learning of large-scale neural networks behaves likeconvex optimization? (cont’d)

When H(w) degenerates?

dead neurons: zero rows

duplicated neurons: linearly dependent rows

if M � N, H(w) becomes singular only after at least M − Nneurons are dead or duplicated.

Corollary 6

If an over-parameterized neural network is randomly initialized,SGD/GD converges to a global minimum of zero loss in probability.

Hui Jiang Deep Learning Theory 20/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation

2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

4 Conclusions

Hui Jiang Deep Learning Theory 21/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Why over-parametered neural networks not overfitting?

Current machine learning theory

VC generalization bound for classification [Vap00]:

R(fw | DT ) ≤ Remp(fw | DT ) +

√8dM(ln 2T

dM+ 1) + 8 ln( 4

δ )

T

Error bound for NNs [Bar94]: approx + estimation errors

R(fw | DT ) ≤ O

(C 2f

M

)+ O

(M · KT

log(T )

)

Presumably over-parameterized models will fail due to overfitting:

When M →∞, Remp(fw | DT ) = 0, but R(fw | DT ) will diverge.

Hui Jiang Deep Learning Theory 22/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Learning Problem Formulation (review)

Inputs x ∈ RK follow p.d.f. p(x): x ∼ p(x)

A target function y = f̄ (x): from input x ∈ RK to output y ∈ RT training samples: DT = {(x1, y1), (x2, y2), · · · , (xT , yT )}, wherext ∼ p(x) and yt = f̄ (xt) for all 1 ≤ t ≤ T

A model y = f (x|DT ) is learned from DT to minimize loss

Empirical Risk

Remp(f | DT ) =1

T

T∑t=1

l (yt , f (xt |DT ))

Expected Risk

R(f | DT ) = Ep(x)

[l(f̄ (x), f (x|DT )

) ]Hui Jiang Deep Learning Theory 23/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Bandlimitedness

For real-world applications, target functions must be bandlimited.

Non-bandlimited processes must be driven by unlimited power.

Real-world data are always generated from bandlimited processes.

Definition 7 (strictly bandlimited)

F (ω) =

∫· · ·∫ +∞

−∞f (x) e−ix·ω dx = 0 if ‖ω‖ > B

Definition 8 (approximately bandlimited)

∀ε > 0, ∃Bε > 0, out-of-band residual energy satisfies∫· · ·∫‖ω‖>Bε

‖F (ω)‖2 dω < ε2

Hui Jiang Deep Learning Theory 24/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Illustration of Bandlimited Fourier Spectrum

Strictly bandlimited F (ω):

Approximately bandlimited F (ω):

Hui Jiang Deep Learning Theory 25/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Illustration of Bandlimited Functions

20 40 60 80 100 120 140 160 180 200

input: x

-300

-200

-100

0

100

200

300

ou

tpu

t: y

No Band-limiting

non-bandlimited function

20 40 60 80 100 120 140 160 180 200

input: x

-150

-100

-50

0

50

100

outp

ut: y

Weakly Band-limiting

strictly bandlimited function by a large B

20 40 60 80 100 120 140 160 180 200

-100

-80

-60

-40

-20

0

20

40

60

Strongly Band-limiting

strictly bandlimited function by a small B

20 40 60 80 100 120 140 160 180 200

input: x

-60

-40

-20

0

20

40

60

outp

ut: y

Approximately Band-limiting

approximately bandlimited function

Hui Jiang Deep Learning Theory 26/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Perfect Learning

Definition 9 (perfect learning)

Learn a model from a finite set of training samples to achieve notonly zero empirical risk but also zero expected risk

Theorem 10 (existence of perfect learning)

If target function f̄ (x) is strictly or approximately bandlimited,there exists a method to learn a model f (x|DT ) from DT , not onlyleading to zero empirical risk Remp(f |DT ) = 0 but also yielding

zero expected risk in probability R(f |DT )P−→ 0 as T →∞.

Proof sketch:Like a stochastic version of multidimensional sampling theorem[PM62; Me01]

Hui Jiang Deep Learning Theory 27/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Perfect Learning

Corollary 11

If f̄ (x) · p(x) is not strictly nor approximately bandlimited, nomatter how many training samples to use, R(f |DT ) of allrealizable learning algorithms have a nonzero lower-bound:limT→∞ R(f | DT ) ≥ ε > 0.

Therefore, we conclude:

Perfect Learning

target function f̄ (x) is bandlimited ⇐⇒ perfect learning is feasible

Hui Jiang Deep Learning Theory 28/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Non-asymptotic Analysis of Perfect Learning

When T is finite, performance is measured by mean expected risk:

RT = EDT

[Ep(x)

[‖f (x|DT )− f̄ (x)‖2

]]Theorem 12 (error bound of perfect learning)

If x ∼ p(x) within a hypercube [−U,U]K ⊂ RK , target functionf̄ (x) is bandlimited by B, perfect learner is upper-bounded as:

R∗T <

[(KBU)n+1 · H

(n + 1)!

]2

where n ' O(T 1/K ) and H = supx |f̄ (x)|.

This error bound is independent of model complexity M

When T is small, difficulty of learning is quantified by KBU

Hui Jiang Deep Learning Theory 29/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Perfect Learning in Practice

Theorem 13 (perfect learning in practice)

If target function f̄ (x) is strictly or approximately bandlimited,assume a strictly or approximately bandlimited model, f (x), islearned from a sufficiently large training set DT . If this modelyields zero empirical risk on DT :

Remp(f | DT ) = 0,

then it is guaranteed to yield zero expected risk:

R(f | DT ) −→ 0 as T →∞.

Hui Jiang Deep Learning Theory 30/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Proof sketch of Theorem 13

Any bandlimited function may be represented as a series of Fourier basefunctions with decaying coefficients:

target function:

f̄ (x) = · · · · · ·+ ηk−1e2πik−1·x + ηk0

e2πik0·x + ηk1e2πik1·x + · · · · · ·

model to be learned:

f (x) = · · · · · ·+ θk−1e2πik−1·x + θk0

e2πik0·x + θk1e2πik1·x + · · · · · ·

training samples: DT = {(xt , yt)|1 ≤ t ≤ T}

yt = f̄ (xt) t = 1, · · · ,T

yt = f (xt) t = 1, · · · ,T

Hui Jiang Deep Learning Theory 31/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Proof sketch of Theorem 13 (cont’d)

T linear equations:

f̄ (xt)− f (xt) = 0 t = 1, · · · ,T

target function and the model:

f̄ (x) = · · · · · ·+ ηk−1e2πik−1·x + ηk0

e2πik0·x + ηk1e2πik1·x + · · ·︸ ︷︷ ︸

T most significant terms

· · ·

f (x) = · · · · · ·+ θk−1e2πik−1·x + θk0

e2πik0·x + θk1e2πik1·x + · · ·︸ ︷︷ ︸

T most significant terms

· · ·

determine T coefficients up to good precision: θk → ηk

as T →∞, f (x)→ f̄ (x) �

Hui Jiang Deep Learning Theory 32/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Machine Learning Models vs. Bandlimitedness

All machine learning models are approximately bandlimited undersome minor conditions:

1 input x is bounded

2 model size is finite

3 all model parameters are bounded

4 model is piecewise continuous

PAC-learnable models are bandlimited

All PAC-learnable models are approximately bandlimited,including linear models, logistic regression, statistical models,neural networks ...

Hui Jiang Deep Learning Theory 33/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

Why Neural Nets Generalize: Asymptotic Regularization

Corollary 14

Assume neural network, fw(x), is learned from a sufficiently largetraining set DT , generated by a bandlimited process f̄ (x). If fw(x)yields zero empirical risk on DT :

Remp(fw | DT ) = 0,

then it surely yields zero expected risk as T →∞:

limT→∞

R(fw | DT ) −→ 0.

Definition 15 (asymptotic regularization)

Due to the bandlimitedness property, neural networkasymptotically regularizes itself as T →∞.

Hui Jiang Deep Learning Theory 34/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation

2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?

3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization

4 Conclusions

Hui Jiang Deep Learning Theory 35/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Conclusions

Deep Learning Theory: Neural Nets

1 expressiveness: neural nets ⇔ universal approximator in L1(UK )

2 optimization: learning large neural nets is unexpectedly “simple”

behaves like convex optimization, conditional on H(w)over-parameterized neural nets are complete in L1(UK )

3 generalization: asymptotic regularization

real-world data are generated by bandlimited processesneural nets are approximately bandlimitedneural nets self-regularize on sufficiently large training sets

Hui Jiang Deep Learning Theory 36/40

Introduction of Deep Learning TheoryOptimization Theory for Deep Learning

Learning Theory for Deep LearningConclusions

Conclusions

Under big data + big model, neural nets solve supervisedlearning problems in statistical sense:

limT→∞

R(f̂w|DT ) = 0.

1 collect sufficient training samples (determined by KBU)2 fit over-parameterized models (neural nets) onto them

However, adversarial attack is new and different ...

all neural nets are approximately bandlimited

Hui Jiang Deep Learning Theory 37/40

References

More details and all proofs are found in:

Hui Jiang, “A New Perspective on Machine Learning: How to doPerfect Supervised Learning”, preprint arXiv:1901.02046, 2019.

Hui Jiang, “Why Learning of Large-Scale Neural Networks BehavesLike Convex Optimization”, preprint arXiv:1903.02140, 2019.

Thank You!

(Q&A)

Hui Jiang Deep Learning Theory 38/40

References

References:

P. Auer, M. Herbster, and M.K. Warmuth.“Exponentially many local minima for single neurons”.In: Proc. of Advances in Neural Information ProcessingSystems 8 (NIPS). 1995.

A. R. Barron. “Approximation and Estimation Boundsfor Artificial Neural Networks”. In: Machine Learning14 (1994), pp. 115–131.

A. L. Blum and R. L. Rivest. “Training a 3-node neuralnetwork is NP-complete”. In: Neural Networks 5.1(1992), pp. 117–127.

G. Cybenko. “Approximation by superpositions of asigmoidal function”. In: Mathematics of Control,Signals, and Systems 2 (1989), pp. 303–314.

Hui Jiang Deep Learning Theory 39/40

References

F. A. Marvasti and et.al. Nonuniform Sampling :Theory and Practice. Kluwer Academic / PlenumPublishers, 2001.

D. P. Petersen and D. Middleton. “Sampling andReconstruction of Wave-Number-Limited Functions inN-Dimensional Euclidean Spaces”. In: Information andControl 5 (1962), pp. 279–323.

V. N. Vapnik. The nature of statistical learning theory.New York : Springer, 2000.

Hui Jiang Deep Learning Theory 40/40