Post on 04-May-2022
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Understanding Deep Learning in Theory
Hui Jiang
Department of Electrical Engineering and Computer ScienceLassonde School of Engineering, York University, Canada
Vector Institue Seminar on April 5, 2019
Hui Jiang Deep Learning Theory 1/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Outline
1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation
2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
4 Conclusions
Hui Jiang Deep Learning Theory 2/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Deep learning overviewDeep learning theory reviewLearning problem formulation
Deep Learning Overview
Deep learning has achieved tremendous successes in practice:speech, vision, text, games, · · ·Deep learning is somehow criticized because it can not beexplained by the current machine learning theory.
Open Problems
What is the essense to successes of neural nets?
Why neural nets overtake other ML models in practice?
Why so “easy” to learn neural nets?
Why do neural nets generalize well?
What is the limit of neural nets?
Hui Jiang Deep Learning Theory 3/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Deep learning overviewDeep learning theory reviewLearning problem formulation
Deep Learning Theory
Theoretical Issues of Neural Nets
1 expressiveness: what is the modeling capacity of neural nets?
solved due to the universal approximation [Cyb89]
2 optimization: why simple gradient descents consistently work?
NP-hard to learn even a small neural net [BR92]high-dimensional & non-convex to learn large neural nets[AHW95]
3 generalization: why over-parameterized neural nets generalize?
VC theory gives loose bounds to simple models [Vap00]totally fails to explain complex models like neural nets
Hui Jiang Deep Learning Theory 4/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Deep learning overviewDeep learning theory reviewLearning problem formulation
Problem Formulation
Machine Learning as Stochastic Function Fitting
Inputs x ∈ RK follow a p.d.f. p(x): x ∼ p(x)
A target function y = f̄ (x): deterministic from input x ∈ RK tooutput y ∈ R
Finite training samples: DT = {(x1, y1), (x2, y2), · · · , (xT , yT )},where xt ∼ p(x) and yt = f̄ (xt) for all 1 ≤ t ≤ T
A model y = f (x|DT ) is learned from function class C based on DT
to minimize a loss measure l(f̄ , f ) between f̄ and f w.r.t. p(x)
l(·) is normally convex: mean square error, cross-entropy, hinge, ...
Hui Jiang Deep Learning Theory 5/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Deep learning overviewDeep learning theory reviewLearning problem formulation
Problem Formulation (cont’d)
Ideally f () should be learned to minimize the expected risk:
Expected Risk
R(f | DT ) = Ep(x)
[l(f̄ (x), f (x|DT )
) ]Practically f () is learned to minimize the empirical loss:
Empirical Risk
Remp(f | DT ) =1
T
T∑t=1
l (yt , f (xt |DT ))
Function class C = L1(UK ), where UK , [0, 1]K ⊂ RK
f ∈ L1(UK ) ⇐⇒∫· · ·∫
x∈UK
|f (x)| dx <∞
Hui Jiang Deep Learning Theory 6/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation
2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
4 Conclusions
Hui Jiang Deep Learning Theory 7/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Learning as Functional Minimization
Learning is formulated as model-free functional minimizationin L1(UK ):
f ∗ = arg minf ∈L1(UK )
Q(f |DT ) = arg minf ∈L1(UK )
T∑t=1
l (yt , f (xt))
Functional minimization is very generic. But
how to parameterize the function space L1(UK ) at above?
Hui Jiang Deep Learning Theory 8/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Literal Model Space: Neural Networks
Literal model space
Use neural networks to represent the function space L1(UK )
Literal space ΛM : the set of all well-structured neural nets ofM weights; each neural net is denoted as w ∈ RM
Universal approximator theorem [Cyb89]: ∀f (x) ∈ L1(UK )can be well approximated by at least one w in ΛM .
If inputs and all weights are bounded, ∀w ∈ ΛM represents afunction in L1(UK ), denoted as fw(x).
Lemma 1
If M is sufficiently large, limM→∞ ΛM ≡ L1(UK ).
Hui Jiang Deep Learning Theory 9/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Learning in Literal Space
Learning neural nets in literal space
w∗ = arg minw∈RM
Q(fw|DT ) = arg minw∈RM
T∑t=1
l (yt , fw(xt))
Hui Jiang Deep Learning Theory 10/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Canonical Model Space: Fourier Series
Fourier coefficients: ∀f (x) ∈ L1(UK )F
=⇒ θ = {θk |k ∈ ZK}
θk =
∫· · ·∫
x∈UK
f (x) e−2πik·xdx (∀k ∈ ZK )
Fourier series: f (x) := F−1 (x|θ)
f (x) =∑k∈ZK
θk e2πik·x
Riemann-Lebesgue lemma: ∀ε > 0, truncate to Nsignificant terms, Nε, to form a partial sum of finite terms
f̂ (x) =∑k∈Nε
θk e2πik·x
where∫·· ·∫
x∈UK‖f (x)− f̂ (x)‖2dx ≤ ε2.
Hui Jiang Deep Learning Theory 11/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Illustration of Canonical Space of L1(UK )
Hui Jiang Deep Learning Theory 12/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Illustration of Canonical Space of L1(UK )
Hui Jiang Deep Learning Theory 13/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Canonical Space of L1(UK )
Canonical space Θ: each set of Fourier coefficients θ ∈ Θ
Learning in canonical space
θ∗ = arg minθ∈Θ
Q(θ|DT ) = arg minθ∈Θ
T∑t=1
l(yt ,F
−1(xt |θ))
Hui Jiang Deep Learning Theory 14/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Learning in Canonical Space
Theorem 2
The objective function Q(f |DT ) is convex in canonical space. Ifthe dimensionality of canonical space is not less than the numberof training samples in DT , the global minimum achieves zero loss.
Proof sketch:
Q(f |DT ) is represented in canonical space:
Q(θ|DT ) =T∑t=1
l(yt ,F
−1(xt |θ))
=T∑t=1
l
yt ,∑k∈Nε
θk · e2πik·xt
N unknown coefficients: θ = {θk | k ∈ Nε}T linear eqns: yt = f̂ (xt) =
∑k∈Nε
θk · e2πik·xt (1 ≤ t ≤ T )
Hui Jiang Deep Learning Theory 15/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
From Canonical Space back to Literal Space
Model spaces: w⇒ fw(x)F
=⇒ θ =⇒ F−1(x|θ) := fθ(x)
The objective function in two spaces:
Q(fw|DT ) = Q(fθ|DT )
The chain rule:
∇wQ(fw|DT ) = ∇θQ(fθ|DT )∇wθ = ∇θQ(fθ|DT )∇wF (fw(x))
Hui Jiang Deep Learning Theory 16/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
From Canonical Space back to Literal Space (cont’d)
Disparity Matrix
∂Q∂w1
...∂Q∂wm
...∂Q∂wM
M×1
=
F(∂fw(x)∂w1
)...
F(∂fw(x)∂wm
)...
F(∂fw(x)∂wM
)
M×N︸ ︷︷ ︸
disparity matrix H(w)
∂Q∂θ1
...∂Q∂θn
...∂Q∂θN
N×1
Gradients in literal and canonical spaces are related via a pointwise
linear transformation:[∇wQ
]M×1
=[H(w)
]M×N
[∇θQ
]N×1
Hui Jiang Deep Learning Theory 17/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
From Canonical Space back to Literal Space (cont’d)
Lemma 3
Assume neural network is sufficiently large (M ≥ N). If w∗ is astationary point of Q(fw) and H(w∗) has full rank, w∗ is a globalminimum.
Lemma 4
If w(0) is a stationary point of Q(fw) and H(w(0)) does not has fullrank at w(0), then w(0) may be a local minimum or saddle point orglobal minimum.
Hui Jiang Deep Learning Theory 18/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Why learning of large-scale neural networks behaves likeconvex optimization?
Stochastic Gradient Descent (SGD)
randomly initialize w(0), set k = 0for epoch = 1 to L do
for each minibatch in training set DT dow(k+1) ← w(k) − hk∇wQ(w(k))k ← k + 1
end forend for
Theorem 5
If M ≥ N, and initial w(0) and step sizes hk are chosen as such to ensureH(w) maintains full rank at every w(k), then SGD/GD surely convergesto a global minimum of zero loss like convex optimization.
Hui Jiang Deep Learning Theory 19/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Learning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
Why learning of large-scale neural networks behaves likeconvex optimization? (cont’d)
When H(w) degenerates?
dead neurons: zero rows
duplicated neurons: linearly dependent rows
if M � N, H(w) becomes singular only after at least M − Nneurons are dead or duplicated.
Corollary 6
If an over-parameterized neural network is randomly initialized,SGD/GD converges to a global minimum of zero loss in probability.
Hui Jiang Deep Learning Theory 20/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation
2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
4 Conclusions
Hui Jiang Deep Learning Theory 21/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Why over-parametered neural networks not overfitting?
Current machine learning theory
VC generalization bound for classification [Vap00]:
R(fw | DT ) ≤ Remp(fw | DT ) +
√8dM(ln 2T
dM+ 1) + 8 ln( 4
δ )
T
Error bound for NNs [Bar94]: approx + estimation errors
R(fw | DT ) ≤ O
(C 2f
M
)+ O
(M · KT
log(T )
)
Presumably over-parameterized models will fail due to overfitting:
When M →∞, Remp(fw | DT ) = 0, but R(fw | DT ) will diverge.
Hui Jiang Deep Learning Theory 22/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Learning Problem Formulation (review)
Inputs x ∈ RK follow p.d.f. p(x): x ∼ p(x)
A target function y = f̄ (x): from input x ∈ RK to output y ∈ RT training samples: DT = {(x1, y1), (x2, y2), · · · , (xT , yT )}, wherext ∼ p(x) and yt = f̄ (xt) for all 1 ≤ t ≤ T
A model y = f (x|DT ) is learned from DT to minimize loss
Empirical Risk
Remp(f | DT ) =1
T
T∑t=1
l (yt , f (xt |DT ))
Expected Risk
R(f | DT ) = Ep(x)
[l(f̄ (x), f (x|DT )
) ]Hui Jiang Deep Learning Theory 23/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Bandlimitedness
For real-world applications, target functions must be bandlimited.
Non-bandlimited processes must be driven by unlimited power.
Real-world data are always generated from bandlimited processes.
Definition 7 (strictly bandlimited)
F (ω) =
∫· · ·∫ +∞
−∞f (x) e−ix·ω dx = 0 if ‖ω‖ > B
Definition 8 (approximately bandlimited)
∀ε > 0, ∃Bε > 0, out-of-band residual energy satisfies∫· · ·∫‖ω‖>Bε
‖F (ω)‖2 dω < ε2
Hui Jiang Deep Learning Theory 24/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Illustration of Bandlimited Fourier Spectrum
Strictly bandlimited F (ω):
Approximately bandlimited F (ω):
Hui Jiang Deep Learning Theory 25/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Illustration of Bandlimited Functions
20 40 60 80 100 120 140 160 180 200
input: x
-300
-200
-100
0
100
200
300
ou
tpu
t: y
No Band-limiting
non-bandlimited function
20 40 60 80 100 120 140 160 180 200
input: x
-150
-100
-50
0
50
100
outp
ut: y
Weakly Band-limiting
strictly bandlimited function by a large B
20 40 60 80 100 120 140 160 180 200
-100
-80
-60
-40
-20
0
20
40
60
Strongly Band-limiting
strictly bandlimited function by a small B
20 40 60 80 100 120 140 160 180 200
input: x
-60
-40
-20
0
20
40
60
outp
ut: y
Approximately Band-limiting
approximately bandlimited function
Hui Jiang Deep Learning Theory 26/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Perfect Learning
Definition 9 (perfect learning)
Learn a model from a finite set of training samples to achieve notonly zero empirical risk but also zero expected risk
Theorem 10 (existence of perfect learning)
If target function f̄ (x) is strictly or approximately bandlimited,there exists a method to learn a model f (x|DT ) from DT , not onlyleading to zero empirical risk Remp(f |DT ) = 0 but also yielding
zero expected risk in probability R(f |DT )P−→ 0 as T →∞.
Proof sketch:Like a stochastic version of multidimensional sampling theorem[PM62; Me01]
Hui Jiang Deep Learning Theory 27/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Perfect Learning
Corollary 11
If f̄ (x) · p(x) is not strictly nor approximately bandlimited, nomatter how many training samples to use, R(f |DT ) of allrealizable learning algorithms have a nonzero lower-bound:limT→∞ R(f | DT ) ≥ ε > 0.
Therefore, we conclude:
Perfect Learning
target function f̄ (x) is bandlimited ⇐⇒ perfect learning is feasible
Hui Jiang Deep Learning Theory 28/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Non-asymptotic Analysis of Perfect Learning
When T is finite, performance is measured by mean expected risk:
RT = EDT
[Ep(x)
[‖f (x|DT )− f̄ (x)‖2
]]Theorem 12 (error bound of perfect learning)
If x ∼ p(x) within a hypercube [−U,U]K ⊂ RK , target functionf̄ (x) is bandlimited by B, perfect learner is upper-bounded as:
R∗T <
[(KBU)n+1 · H
(n + 1)!
]2
where n ' O(T 1/K ) and H = supx |f̄ (x)|.
This error bound is independent of model complexity M
When T is small, difficulty of learning is quantified by KBU
Hui Jiang Deep Learning Theory 29/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Perfect Learning in Practice
Theorem 13 (perfect learning in practice)
If target function f̄ (x) is strictly or approximately bandlimited,assume a strictly or approximately bandlimited model, f (x), islearned from a sufficiently large training set DT . If this modelyields zero empirical risk on DT :
Remp(f | DT ) = 0,
then it is guaranteed to yield zero expected risk:
R(f | DT ) −→ 0 as T →∞.
Hui Jiang Deep Learning Theory 30/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Proof sketch of Theorem 13
Any bandlimited function may be represented as a series of Fourier basefunctions with decaying coefficients:
target function:
f̄ (x) = · · · · · ·+ ηk−1e2πik−1·x + ηk0
e2πik0·x + ηk1e2πik1·x + · · · · · ·
model to be learned:
f (x) = · · · · · ·+ θk−1e2πik−1·x + θk0
e2πik0·x + θk1e2πik1·x + · · · · · ·
training samples: DT = {(xt , yt)|1 ≤ t ≤ T}
yt = f̄ (xt) t = 1, · · · ,T
yt = f (xt) t = 1, · · · ,T
Hui Jiang Deep Learning Theory 31/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Proof sketch of Theorem 13 (cont’d)
T linear equations:
f̄ (xt)− f (xt) = 0 t = 1, · · · ,T
target function and the model:
f̄ (x) = · · · · · ·+ ηk−1e2πik−1·x + ηk0
e2πik0·x + ηk1e2πik1·x + · · ·︸ ︷︷ ︸
T most significant terms
· · ·
f (x) = · · · · · ·+ θk−1e2πik−1·x + θk0
e2πik0·x + θk1e2πik1·x + · · ·︸ ︷︷ ︸
T most significant terms
· · ·
determine T coefficients up to good precision: θk → ηk
as T →∞, f (x)→ f̄ (x) �
Hui Jiang Deep Learning Theory 32/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Machine Learning Models vs. Bandlimitedness
All machine learning models are approximately bandlimited undersome minor conditions:
1 input x is bounded
2 model size is finite
3 all model parameters are bounded
4 model is piecewise continuous
PAC-learnable models are bandlimited
All PAC-learnable models are approximately bandlimited,including linear models, logistic regression, statistical models,neural networks ...
Hui Jiang Deep Learning Theory 33/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Why neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
Why Neural Nets Generalize: Asymptotic Regularization
Corollary 14
Assume neural network, fw(x), is learned from a sufficiently largetraining set DT , generated by a bandlimited process f̄ (x). If fw(x)yields zero empirical risk on DT :
Remp(fw | DT ) = 0,
then it surely yields zero expected risk as T →∞:
limT→∞
R(fw | DT ) −→ 0.
Definition 15 (asymptotic regularization)
Due to the bandlimitedness property, neural networkasymptotically regularizes itself as T →∞.
Hui Jiang Deep Learning Theory 34/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
1 Introduction of Deep Learning TheoryDeep learning overviewDeep learning theory reviewLearning problem formulation
2 Optimization Theory for Deep LearningLearning neural nets in literal spaceLearning neural nets in canonical spaceFrom canonical space back to literal spaceWhy large neural nets learn like convex optimization?
3 Learning Theory for Deep LearningWhy neural nets not overfitting?Bandlimited functionsPerfect learningAsymptotic regularization
4 Conclusions
Hui Jiang Deep Learning Theory 35/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Conclusions
Deep Learning Theory: Neural Nets
1 expressiveness: neural nets ⇔ universal approximator in L1(UK )
2 optimization: learning large neural nets is unexpectedly “simple”
behaves like convex optimization, conditional on H(w)over-parameterized neural nets are complete in L1(UK )
3 generalization: asymptotic regularization
real-world data are generated by bandlimited processesneural nets are approximately bandlimitedneural nets self-regularize on sufficiently large training sets
Hui Jiang Deep Learning Theory 36/40
Introduction of Deep Learning TheoryOptimization Theory for Deep Learning
Learning Theory for Deep LearningConclusions
Conclusions
Under big data + big model, neural nets solve supervisedlearning problems in statistical sense:
limT→∞
R(f̂w|DT ) = 0.
1 collect sufficient training samples (determined by KBU)2 fit over-parameterized models (neural nets) onto them
However, adversarial attack is new and different ...
all neural nets are approximately bandlimited
Hui Jiang Deep Learning Theory 37/40
References
More details and all proofs are found in:
Hui Jiang, “A New Perspective on Machine Learning: How to doPerfect Supervised Learning”, preprint arXiv:1901.02046, 2019.
Hui Jiang, “Why Learning of Large-Scale Neural Networks BehavesLike Convex Optimization”, preprint arXiv:1903.02140, 2019.
Thank You!
(Q&A)
Hui Jiang Deep Learning Theory 38/40
References
References:
P. Auer, M. Herbster, and M.K. Warmuth.“Exponentially many local minima for single neurons”.In: Proc. of Advances in Neural Information ProcessingSystems 8 (NIPS). 1995.
A. R. Barron. “Approximation and Estimation Boundsfor Artificial Neural Networks”. In: Machine Learning14 (1994), pp. 115–131.
A. L. Blum and R. L. Rivest. “Training a 3-node neuralnetwork is NP-complete”. In: Neural Networks 5.1(1992), pp. 117–127.
G. Cybenko. “Approximation by superpositions of asigmoidal function”. In: Mathematics of Control,Signals, and Systems 2 (1989), pp. 303–314.
Hui Jiang Deep Learning Theory 39/40
References
F. A. Marvasti and et.al. Nonuniform Sampling :Theory and Practice. Kluwer Academic / PlenumPublishers, 2001.
D. P. Petersen and D. Middleton. “Sampling andReconstruction of Wave-Number-Limited Functions inN-Dimensional Euclidean Spaces”. In: Information andControl 5 (1962), pp. 279–323.
V. N. Vapnik. The nature of statistical learning theory.New York : Springer, 2000.
Hui Jiang Deep Learning Theory 40/40