AdaNet: Adaptive Structural Learning of Artificial Neural ... · Outline 1 Introduction Motivation...

AdaNet: Adaptive Structural Learning of ArtificialNeural Networks

Corinna Cortes1, Xavier Gonzalvo1, Vitaly Kuznetsov1, MehryarMohri21, Scott Yang 2

1Google Research

2Courant Institute

ICML, 2017/ Presenter: Anant Kharkar

Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, Scott Yang (Northwestern University and Intel Corporation)AdaNet: Adaptive Structural Learning of Artificial Neural NetworksICML, 2017/ Presenter: Anant Kharkar 1

/ 18

Outline

1 IntroductionMotivation & Contributions

2 Theoretical BackgroundRegularizer DerivationGeneralization Bounds

3 AdaNetBasicsAlgorithmResultsVariations


/ 18

Outline





/ 18

Motivation & Contributions

Motivation

Lack of theory behind model architecture design

Usually requires domain knowledge

Contributions

New regularizer that learns parameters & architecture simultaneously

Additive model

Context

Strongly convex optimization problems

Generic network structure (not just MLP)


/ 18

Outline





/ 18

Regularizer Derivation

Function families:

H1 = {x 7→ u ·Ψ(x) : u ∈ Rn0 , ‖u‖p ≤ Λ1,0}

Hk =

{x 7→

k−1∑s=1

us · (ϕs ◦ hs)(x) : us ∈ Rns , ‖us‖p ≤ Λk,s , hk,s ∈ Hs

}Definitions:

x: input

u: connections to previous layers

Ψ: feature vector

ϕ: activation function


/ 18

Outline





/ 18

Generalization Bounds

Rademacher Complexity:

R̂S(G) =1

mEσ

[suph∈G

m∑i=1

σih(xi )

]

Definitions:

x: input

h: function expressed by single unit

σ: Rademacher RV {-1, +1}


/ 18


Theorem 1:

R(f ) ≤ R̂S ,ρ(f ) +4

ρ

l∑k=1

‖wk‖1Rm(H̃k) +2

ρ

√log l

m+ C (ρ, l ,m, δ)

C (ρ, l ,m, δ) =

√⌈4

ρ2log(

ρ2m

log l)

⌉log l

m+

log( 2δ )

2m

4ρ

l∑k=1

‖wk‖1Rm(H̃k): weighted sum of Rademacher complexities


/ 18


Corollary 1:

R(f ) ≤ R̂S,ρ(f ) +2

ρ

l∑k=1

‖wk‖1

[r̄∞ΛkN

1q

k

√2log(2n0)

m

]

+2

ρ

√log l

m+ C (ρ, l ,m, δ)

C (ρ, l ,m, δ) =

√⌈4

ρ2log(

ρ2m

log l)

⌉log l

m+

log( 2δ )

2m


/ 18

Outline





/ 18

Basics

Adaptively grows network structure

Objective function:

F (w) =1

m

m∑i=1

Φ

(1− yi

N∑j=1

wjhj

)+

N∑j=1

Γj |wj |

Definitions:

Φ: nondecreasing convex function (Ex: ex)

Γj : λrj + β

rj : Rm(Hkj ) (Rademacher complexity)

This objective function is forcibly convex


/ 18

Outline





/ 18

Algorithm


/ 18

Outline





/ 18

Results


/ 18

Outline





/ 18

Variations

AdaNet.R: R(w,h) = Γh‖w‖1 regularization in obj.

AdaNet.P: subnetworks only connected to previous subnetwork

AdaNet.D: dropout on subnetwork connections

AdaNet.SD: std dev of final layers instead of Rademacher complexity


/ 18

AdaNet: Adaptive Structural Learning of Artificial Neural ... · Outline 1 Introduction Motivation...

Documents

Transcript of AdaNet: Adaptive Structural Learning of Artificial Neural ... · Outline 1 Introduction Motivation...