NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... ·...
Transcript of NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... ·...
![Page 1: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/1.jpg)
Neural Tangent Kernel (NTK) Made Practical
Wei HuPrinceton University
1
![Page 2: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/2.jpg)
Today’s Theme• NTK theory [Jacot et al. ’18]: a class of infinitely wide (or ultra-wide) neural networks
trained by gradient descent ⟺ kernel regression with a fixed kernel (NTK)
• Can we use make this theory practical?1. to understand optimization and generalization phenomena in neural networks2. to evaluate the empirical performance of infinitely wide neural networks3. to understand network architectural components (e.g. convolution, pooling)
through their NTK4. to design new algorithms inspired by the NTK theory
2
![Page 3: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/3.jpg)
Outline• Recap of the NTK theory• Implications on opt and gen• How do infinitely-wide NNs perform?• NTK with data augmentation• New algo to deal with noisy labels
3
![Page 4: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/4.jpg)
Outline• Recap of the NTK theory• Implications on opt and gen• How do infinitely-wide NNs perform?• NTK with data augmentation• New algo to deal with noisy labels
4
![Page 5: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/5.jpg)
Setting: Supervised Learning• Given data: 𝑥!, 𝑦! , … , 𝑥", 𝑦" ∼ 𝐷• A neural network: 𝑓(𝜃, 𝑥)
• Goal: minimize population loss 𝔼 #,% ∼'ℓ(𝑓 𝜃, 𝑥 , 𝑦)
• Algorithm: minimize training loss !"∑()!" ℓ(𝑓 𝜃, 𝑥( , 𝑦() by (S)GD
5
Input 𝑥
Parameters 𝜃
Output 𝑓(𝜃, 𝑥)
![Page 6: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/6.jpg)
Multi-Layer NN with “NTK Parameterization”
Neural network: 𝑓 𝜃, 𝑥 = 𝑊("#$) &'!𝜎 𝑊(") &
'!"#𝜎 𝑊("($)⋯ &
'#𝜎 &
'$𝑊($)𝑥
• 𝑊(+) ∈ ℝ-!×-!"# , single output (𝑑/0! = 1)
• Activation function 𝜎 𝑧 , with 𝑐 = 𝔼1∼𝒩 3,! 𝜎4(𝑧)5!
• Initialization: 𝑊(+(6)∼ 𝒩(0,1)
• Let hidden widths 𝑑!, … , 𝑑/ → ∞
• Empirical risk minimization: ℓ 𝜃 = !4∑()!" 𝑓 𝜃, 𝑥( − 𝑦( 4
• Gradient descent (GD): 𝜃 𝑡 + 1 = 𝜃 𝑡 − 𝜂𝛻ℓ(𝜃(𝑡))
6
![Page 7: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/7.jpg)
The Main Idea of NTK: Linearization
First-order Taylor approximation (linearization) of the network around the initialized parameters 𝜃7879:
𝑓 𝜃, 𝑥 ≈ 𝑓 𝜃)*)+, 𝑥 + ∇,𝑓 𝜃)*)+, 𝑥 , 𝜃 − 𝜃)*)+
• The approximation holds if 𝜃 is close to 𝜃7879 (true for ultra-wide NN trained by GD)• The network is linear in the gradient feature map 𝑥 ↦ ∇:𝑓 𝜃7879, 𝑥
* Ignore 𝑓 𝜃)*)+, 𝑥 for today: can make sure it’s 0 by setting 𝑓 𝜃, 𝑥 = 𝑔 𝜃$, 𝑥 − 𝑔(𝜃-, 𝑥) and initialize 𝜃$ = 𝜃-
7
![Page 8: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/8.jpg)
Kernel Method
8
Input 𝒙
Possibly infinitely dimensional
Kernel regression/SVM: learn linearfunction of 𝜙 𝑥
Kernel trick: only need to compute𝑘 𝑥, 𝑥< = ⟨𝜙 𝑥 , 𝜙(𝑥′)⟩ for pair ofinputs 𝑥, 𝑥ʹ
• Assuming 𝑓 𝜃, 𝑥 ≈ ∇,𝑓 𝜃)*)+, 𝑥 , 𝜃 − 𝜃)*)+ , we care about the kernel𝑘)*)+(𝑥, 𝑥′) = ∇,𝑓 𝜃)*)+, 𝑥 , ∇,𝑓 𝜃)*)+, 𝑥′
![Page 9: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/9.jpg)
The Large Width Limit
If 𝑑! = 𝑑4 = ⋯ = 𝑑/ = 𝑚 → ∞, we have
1. Convergence to a fixed kernel at initialization
9
𝑓 𝜃, 𝑥
= 𝑊("#$) 𝑐𝑑"𝜎 𝑊(") 𝑐
𝑑"&$𝜎 𝑊("&$)⋯
𝑐𝑑$𝜎
𝑐𝑑'𝑊($)𝑥
Theorem [Arora, Du, H, Li, Salakhutdinov, Wang ’19]:If min{𝑑$, … , 𝑑"} ≥ 𝜖(.poly 𝐿 log $
/, then for any inputs 𝑥, 𝑥′, w.p. 1 − 𝛿
over the random initialzation 𝜃7879:𝑘)*)+(𝑥, 𝑥′) − 𝑘(𝑥, 𝑥′) ≤ 𝜖
𝑘:ℝ'×ℝ' is a kernel function that depends on network architecture
![Page 10: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/10.jpg)
The Large Width Limit
If 𝑑! = 𝑑4 = ⋯ = 𝑑/ = 𝑚 → ∞, we have
1. Convergence to a fixed kernel at initialization
2. Weights don’t move much during GD: 0% 1 (0 % 2 &0 % 2 &
= 𝑂 $3
3. Linearization approximation holds: 𝑓 𝜃 𝑡 , 𝑥 = ∇,𝑓 𝜃)*)+, 𝑥 , 𝜃 𝑡 − 𝜃)*)+ ± 𝑂 $3
10
𝑓 𝜃, 𝑥
= 𝑊("#$) 𝑐𝑑"𝜎 𝑊(") 𝑐
𝑑"&$𝜎 𝑊("&$)⋯
𝑐𝑑$𝜎
𝑐𝑑'𝑊($)𝑥
![Page 11: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/11.jpg)
Equivalence to Kernel Regression
For infinite-width network, we have 𝑓 𝜃, 𝑥 = ∇!𝑓 𝜃"#"$, 𝑥 , 𝜃 − 𝜃"#"$ and∇!𝑓 𝜃"#"$, 𝑥 , ∇!𝑓 𝜃"#"$, 𝑥′ = 𝑘(𝑥, 𝑥′)⟹ GD is doing kernel regression w.r.t. the fixed kernel 𝑘(⋅,⋅)Kernel regression solution:
𝑓=>? 𝑥 = 𝑘 𝑥, 𝑥! , … , 𝑘(𝑥, 𝑥") ⋅ 𝐾5! ⋅ 𝑦
Wide NN trained by GD is equivalent to kernel regression w.r.t. the NTK!11
Theorem [Arora, Du, H, Li, Salakhutdinov, Wang ’19]:If 𝑑! = 𝑑4 = ⋯ = 𝑑/ = 𝑚 is sufficiently large, then for any input 𝑥, w.h.p.the NN at the end of GD satisfies:
𝑓88(𝑥) − 𝑓=>? 𝑥 ≤ 𝜖
𝐾 ∈ ℝ!×!, 𝐾#,% = 𝑘(𝑥# , 𝑥%)𝑦 ∈ ℝ!
![Page 12: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/12.jpg)
Outline• Recap of the NTK theory• Implications on opt and gen• How do infinitely-wide NNs perform?• NTK with data augmentation• New algo to deal with noisy labels
12
![Page 13: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/13.jpg)
True vs Random Labels
“Understanding deep learning requires rethinking generalization”, Zhang et al. 2017
13
True Labels: 2 1 3 1 4
Random Labels: 5 1 7 0 8
![Page 14: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/14.jpg)
Phenomena
① Faster convergence with true labels than random labels
② Good generalization with true labels, poor generalization with random labels
14
![Page 15: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/15.jpg)
Explaining Training Speed Difference
15
Theorem [Arora, Du, H, Li, Wang ’19]: Training loss at time 𝑡 isH
!"#
$𝑓% 𝑥! − 𝑦! & ≈H
!"#
$𝑒'&('% ⋅ 𝑣!, 𝑦 &
where 𝐾 = ∑!"#$ 𝜆!𝑣!𝑣!) is the eigen-decomposition
• Components of 𝑦 on larger eigenvectors of 𝐾 converge faster than those onsmaller eigenvectors
• If 𝑦 aligns well with top eigenvectors, then GD converges quickly
• If 𝑦’s projections on eigenvectors are close to being uniform, then GD convergesslowly
![Page 16: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/16.jpg)
MNIST (2 Classes)
16
Convergence Rate Projections
![Page 17: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/17.jpg)
CIFAR-10 (2 Classes)
17
Convergence Rate Projections
![Page 18: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/18.jpg)
Generalization
• It suffices to analyze generalization for 𝑓=>? 𝑥 = 𝑘 𝑥, 𝑥! , … , 𝑘(𝑥, 𝑥") 𝐾5!𝑦• For 𝑓 𝑥 = ∑()!" 𝛼(𝑘(𝑥, 𝑥(), its RKHS norm is 𝑓 ℋ = 𝛼H𝐾𝛼
⟹ 𝑓=>? ℋ = 𝑦H𝐾5!𝑦
• Then, from the classical Rademacher complexity bound [Bartlett and Mendelson ’02]we obtain the population loss bound for 𝑓=>?:
𝔼 *,, ∼. 𝑓/01 𝑥 − 𝑦 ≤2 𝑦)𝐾'#𝑦 tr 𝐾
𝑛+
log(1/𝛿)𝑛
• tr 𝐾 = 𝑂(𝑛) ⟹ 𝔼 *,, ∼. 𝑓/01 𝑥 − 𝑦 ≤ 𝑂 ,(2"#,$
• Such bound appeared in [Arora, Du, H, Li, Wang ’19], [Cao and Gu ’19]18
![Page 19: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/19.jpg)
Explaining Generalization Difference
• Consider binary classification (𝑦( = ±1)• We have the bound on classification error:
Pr#,% ∼'
[sign 𝑓=>? 𝑥 ≠ 𝑦] ≤2𝑦H𝐾5!𝑦
𝑛
• This bound is a priori (can be evaluated before training)• Can this bound distinguish true and random labels?
19
![Page 20: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/20.jpg)
Explaining Generalization Difference
20
MNIST CIFAR-10
![Page 21: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/21.jpg)
Outline• Recap of the NTK theory• Implications on opt and gen• How do infinitely-wide NNs perform?• NTK with data augmentation• New algo to deal with noisy labels
22
![Page 22: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/22.jpg)
NTK Formula
At random init, what does the value ∇3𝑓 𝜃4546, 𝑥 , ∇3𝑓 𝜃4546, 𝑥′ converge to?
23
𝑓 𝜃, 𝑥 = 𝑊("#$) 𝑐𝑑"𝜎 𝑊(") 𝑐
𝑑"&$𝜎 𝑊("&$)⋯
𝑐𝑑$𝜎
𝑐𝑑'𝑊($)𝑥
![Page 23: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/23.jpg)
NTK Formula
At random init, what does the value ∇3𝑓 𝜃4546, 𝑥 , ∇3𝑓 𝜃4546, 𝑥′ converge to?
24
𝑓 𝜃, 𝑥 = 𝑊("#$) 𝑐𝑑"𝜎 𝑊(") 𝑐
𝑑"&$𝜎 𝑊("&$)⋯
𝑐𝑑$𝜎
𝑐𝑑'𝑊($)𝑥
𝑘 𝑥, 𝑥/ =I012
342Σ 052 𝑥, 𝑥/ ⋅L
0!10
342Σ̇ 0! (𝑥, 𝑥′)
[Jacot et al. ’18]:
![Page 24: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/24.jpg)
Convolutional Neural Networks (CNNs)
Convolutional NTK?25
![Page 25: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/25.jpg)
CNTK Formula
26
![Page 26: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/26.jpg)
CNTK Formula (Cont’d)
27
![Page 27: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/27.jpg)
CNTK on CIFAR-10 [Arora, Du, H, Li, Salakhutdinov, Wang ’19]
28
• CNTKs/infinitely wide CNNs achieve reasonably good performance• but are 5%-7% worse than the corresponding finite-width CNNs trained with
SGD (w/o batch norm, data aug)
![Page 28: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/28.jpg)
• On 90 UCI datasets (<5k samples)
[Arora et al. 2020]
NTKs on Small Datasets
29
Classifier Avg Acc P95 PMA
FC NTK 82% 72% 96%
FC NN 81% 60% 95%
RandomForest 82% 68% 95%
RBF Kernel 81% 72% 94%
• On graph classification tasks
[Du et al. 2019]
Method COLLAB IMDB-B IMDB-M PTC
GNN GCN 79% 74% 51% 64%
GIN 80% 75% 52% 65%
GK WL 79% 74% 51% 60%
GNTK 84% 77% 53% 68%
NTK is a strong off-the-shelf classifier on small datasets
![Page 29: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/29.jpg)
Outline• Recap of the NTK theory• Implications on opt and gen• How do infinitely-wide NNs perform?• NTK with data augmentation• New algo to deal with noisy labels
30
![Page 30: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/30.jpg)
Data Augmentation• Idea: create new training images from existing images using pixel translation and
flips, assuming that these operations do not change the label
• An important technique to improve generalization, and easy to implement for SGDin deep learning
• Infeasible to incorporate in kernel methods (quadratic running time in # samples)
31
![Page 31: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/31.jpg)
Global Average Pooling (GAP)• GAP: average the pixel values of each channel in the last conv layer
• We’ve seen that GAP significantly improves accuracy for both CNNs and CNTKs32
![Page 32: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/32.jpg)
GAP ⟺ Data Augmentation [Li, Wang, Yu, Du, H, Salakhutdinov, Arora ’19]
• Augmentation operation: full translation + circular padding
• Theorem: For CNTK with circular padding:{GAP, on original dataset} ⟺ {no GAP, on augmented dataset}
33
![Page 33: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/33.jpg)
Proof Sketch• Let 𝐺 be the group of operations
• Key properties of CNTKs:• cntkOPQ 𝑥, 𝑥< = const ⋅ 𝔼R∼O cntk(𝑔 𝑥 , 𝑥′)• cntk 𝑥, 𝑥< = cntk(𝑔 𝑥 , 𝑔(𝑥′))
• Show that two kernel regression solutions give the same prediction for all 𝑥:
cntk789 𝑥, 𝑋 ⋅ cntk:;< 𝑋, 𝑋 '# ⋅ 𝑦 = cntk 𝑥, 𝑋=>? ⋅ cntk 𝑋=>?, 𝑋=>?'#⋅ 𝑦=>?
34
![Page 34: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/34.jpg)
• Full translation data augmentation seems a little weird
• CNTK + “local average pooling” + pre-processing [Coates et al. ’11] rivals AlexNet on CIFAR-10 (89%)
Enhanced CNTK
35
Full translation Local translation
![Page 35: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/35.jpg)
Outline• Recap of the NTK theory• Implications on opt and gen• How do infinitely-wide NNs perform?• NTK with data augmentation• New algo to deal with noisy labels
36
![Page 36: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/36.jpg)
Training with Noisy Labels• Noisy labels lead to degradation of generalization
• Early stopping is useful empirically
• But when to stop? (need access to clean validationset)
37
Test accuracy curve on CIFAR-10trained with 40% label noise
![Page 37: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/37.jpg)
Setting
• In general, there is a transition matrix 𝑃 such that 𝑃!,# = Pr[class 𝑗 → class 𝑖]
• Goal: Given a noisily labeled dataset 𝑆 = 𝑥! , 1𝑦! !$%& , train a neural network 𝑓(𝜃,⋅)
to get small loss on the clean data distribution 𝒟𝐿𝒟 𝜃 = 𝔼 (,) ∼𝒟ℓ(𝑓 𝜃, 𝑥 , 𝑦)
𝑥 =
𝑦 = 7
w.p. 1 − 𝑝 =𝑦 = 7
w.p. 𝑝 =𝑦 ∈ 0, 1, … , 9 ∖ {7}
![Page 38: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/38.jpg)
NTK Viewpoint
39
𝑓/01 𝑥 = 𝑘 𝑥, 𝑥# , … , 𝑘(𝑥, 𝑥$) ⋅ 𝐾'# ⋅ 𝑦ℓ 𝜃 =12H
!"#
$𝑓 𝜃, 𝑥! − 𝑦! &
Kernel ridge regression (“soft” early stopping)
𝑓14@?0 𝑥 = 𝑘 𝑥, 𝑥# , … , 𝑘(𝑥, 𝑥$) ⋅ 𝐾 + 𝜆𝐼 '# ⋅ 𝑦??
![Page 39: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/39.jpg)
NTK Viewpoint
40
𝑓/01 𝑥 = 𝑘 𝑥, 𝑥# , … , 𝑘(𝑥, 𝑥$) ⋅ 𝐾'# ⋅ 𝑦ℓ 𝜃 =12H
!"#
$𝑓 𝜃, 𝑥! − 𝑦! &
Kernel ridge regression (“soft” early stopping)
𝑓14@?0 𝑥 = 𝑘 𝑥, 𝑥# , … , 𝑘(𝑥, 𝑥$) ⋅ 𝐾 + 𝜆𝐼 '# ⋅ 𝑦
ℓ 𝜃 =12*
)*+
,𝑓 𝜃, 𝑥) − 𝑦) - +
𝜆2𝜃 − 𝜃./.0 -
-
orℓ 𝜃, 𝑏 =
12*
)*+
,𝑓 𝜃, 𝑥) − 𝑦) + 𝜆𝑏)
-
Theorem [H, Li, Yu ’20]:For infinitely wide NN, both methods lead to kernel ridge regression
![Page 40: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/40.jpg)
Generalization Bound• 𝑦 ∈ ±1 ": true labels• e𝑦 ∈ ±1 ": observed labels, each label being flipped w.p. 𝑝 (𝑝 < 1/2)• Goal: analyze generalization of 𝑓14@?0 𝑥 = 𝑘 𝑥, 𝑥# , … , 𝑘(𝑥, 𝑥$) ⋅ 𝐾 + 𝜆𝐼 '#�̂� on the
clean distribution 𝒟
• Set 𝜆 = 𝑛S41
Theorem:
Pr*,, ∼.
[sign 𝑓14@?0 𝑥 ≠ 𝑦] ≤1
1 − 2𝑝𝑂 𝜆
𝑦)𝐾'#𝑦𝑛
+1𝜆
![Page 41: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/41.jpg)
ExperimentsNoise level 0 0.2 0.4 0.6
CCE normal (early stop) 94.05 89.73 86.35 79.13MSE normal (early stop) 93.88 89.96 85.92 78.68
CCE+AUX 94.22 92.07 87.81 82.60MSE+AUX 94.25 92.31 88.92 83.90
[Zhang and Sabuncu. 2018] - 89.83 87.62 82.70
Test accuracy curve for 40% label noiseTest accuracy of ResNet-34 using different methods at various noise levels
• Training with AUX achieves very good accuracy, better than normal training with earlystopping and the recent method of Zhang and Sabuncu using the same architecture
• Training with AUX doesn’t over-fit (no need to stop early)
![Page 42: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/42.jpg)
Concluding Remarks
• Understanding neural networks by understanding kernels, and vice versa
• NTK/infinite-width NN as a powerful classifier
• Designing principled and practical algorithms inspired by the NTK theory
• Beyond NTK?
43
![Page 43: NeuralTangent Kernel(NTK)MadePracticalold.ins.sjtu.edu.cn/files/paper/20200722191346_ntk_wei... · 2020. 7. 22. · Outline •Recap of the NTK theory •Implicationsonoptandgen •How](https://reader035.fdocuments.in/reader035/viewer/2022071419/6117a0030652115e6022d1b3/html5/thumbnails/43.jpg)
References
44
• Arora, Du, Hu, Li, Salakhutdinov, Wang. On Exact Computation with an Infinitely Wide Neural Net. NeurIPS2019
• Arora, Du, Hu, Li, Wang. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. ICML 2019
• Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora. Enhanced Convolutional Neural Tangent Kernels. arXiv 2019• Hu, Li, Yu. Understanding Generalization of Neural Networks Trained with Noisy Labels. ICLR 2020
• Jacot, Gabriel, Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS2018
Thanks!