Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

Mini-Course 6:Hyperparameter Optimization – Harmonica

Yang Yuan

Computer Science DepartmentCornell University

Last time

I Bayesian Optimization[Snoek et al., 2012, Swersky et al., 2013, Snoek et al., 2014,Gardner et al., 2014, Wang et al., 2013].

I Gradient descent[Maclaurin et al., 2015, Fu et al., 2016, Luketina et al., 2015]

I Random Search [Bergstra and Bengio, 2012, Recht, 2016]

I Multi-armed Bandit based algorithms: Hyperband,SuccessiveHalving[Li et al., 2016, Jamieson and Talwalkar, 2016].

I Grid Search

A natural question..


With so many great algorithms for tuning hyperparameters...


With so many great algorithms for tuning hyperparameters...

Why do we still hire PhD students to do it manually?

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to work

I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are pretty good



I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distribution

I Hyperband/SH: f get more accurate as we invest moreresource

I GD: the hyperparameter space is smooth, and all the localminima are pretty good



I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resource

I GD: the hyperparameter space is smooth, and all the localminima are pretty good



I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more


minima are pretty good

Curse of dimensionality!


I Every algorithm needs some assumptions to workI BO:f can be approximated by the prior distributionI Hyperband/SH : f get more accurate as we invest more


minima are roughly equally good

I None of these work in the general high-dimensional setting


I Sample complexity is exponential in number of variables n

I Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?


I Sample complexity is exponential in number of variables nI Hyperband, SH

I random searchI grid search








I Sample complexity is exponential in number of variables nI Hyperband, SHI random search

I grid search








I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search









I Bayesian Optimization is even worseI Exponential sample complexity as well

I Prior distribution may not suit for large n







I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n








I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.







variables.I Only tune the selected variables.

I Not purely Auto-ML.






variables.I Only tune the selected variables.I Not purely Auto-ML.


Our assumption

I f is a high-dimensional function on Boolean variablesI Discretize the continuous variablesI Binarize the categorical variables

I f can be approximated by a small decision tree

Kaggle 101: Survival Rate Prediction For Titanic

I Predict whether the passenger will survive based on thefollowing personal data:

I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)

Kaggle 101: Survival Rate Prediction For Titanic

I Predict whether the passenger will survive based on thefollowing personal data:

I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)

I Related features are much more:I HometownI OccupationI Native languageI RaceI Can swing or not?I · · ·

What is a decision tree?

I This simple decision tree on 4 variables gives you ≈ 75%prediction rate at Kaggle

When a small decision tree approximates f ?

I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good

“estimation”I Some variables are more important than the othersI True for many applications

I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables

How can we learn a small decision tree?

Step 1 convert decision tree into a sparse low degree polynomial inFourier basis (well known)

Step 2 learn the polynomial

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]


I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ ε

I Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].



f (x) =∑S⊆[n]

fSχS(x)


Preliminaries

I f : {−1, 1}n → [−1, 1]





f (x) =∑S⊆[n]

fSχS(x)


Preliminaries

I f : {−1, 1}n → [−1, 1]



I 2n of them

I complete orthonormal basis for Boolean functionsI can be identified with S


f (x) =∑S⊆[n]

fSχS(x)


Preliminaries

I f : {−1, 1}n → [−1, 1]



I 2n of themI complete orthonormal basis for Boolean functions

I can be identified with S


f (x) =∑S⊆[n]

fSχS(x)


Preliminaries

I f : {−1, 1}n → [−1, 1]





f (x) =∑S⊆[n]

fSχS(x)


Preliminaries

I L1 norm: L1(f ) =∑

S |fS |.

I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =

∑S f

2S .

Preliminaries


S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.

I Parseval’s identity: Ex∼D [f (x)2] =∑

S f2S .

Preliminaries


S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =

∑S f

2S .

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

I max2 has L1 = 2, L0 = 4.

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

I max2 has L1 = 2, L0 = 4.

Similarly,

Maj3(x1, x2, x3) =1

2x1 +

1

2x2 +

1

2x3 −

1

2x1x2x3

I Maj has L1 = 2, L0 = 4.

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

Examples


I 2-sparse means it has 2 terms

I degree 2 means its terms have degree at most 2.

Examples




w1 w2 w3 y

1 -1 1 -6-1 -1 1 61 1 -1 10

Examples




w1 w2 w3 y

1 -1 1 -6-1 -1 1 61 1 -1 10

I However, f is not a linear combination of w1,w2,w3.

Examples

Expand the matrix!

w1 w2 w3 w1w2 w1w3 w2w3 y

1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10

Examples

Expand the matrix!

w1 w2 w3 w1w2 w1w3 w2w3 y

1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10

I the blue+yellow matrix is the Fourier basis

I Now f is a linear combination of the basis

Convert decision tree into sparse low degree polynomial

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .


Assume decision tree T has s leaf nodes



Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this levelI It differs by at most ε

s · s = ε fraction by union bound.I So below assume T has depth at most log( s

ε )





ε nodes on this level

I It differs by at most εs · s = ε fraction by union bound.

I So below assume T has depth at most log( sε )






s · s = ε fraction by union bound.

I So below assume T has depth at most log( sε )






s · s = ε fraction by union bound.I So below assume T has depth at most log( s

ε )


Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )


Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε




E[(f − h)2] ≤ ε


Λ ,

{S ||fS | ≥

ε

L1(f )

}

I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε


S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε




E[(f − h)2] ≤ ε


Λ ,

{S ||fS | ≥

ε

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε


S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

How do we learn the polynomial?

Theorem


2


I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]


Theorem


2


I How do we learn the sparse low degree function h?

I Well studied in Boolean analysis, two classical algorithms



Theorem


2





Theorem


2



I KM algorithm [Kushilevitz and Mansour, 1991]

I LMN algorithm [Linial et al., 1993]


Theorem


2




KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm



I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1

I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2


I E[f 211] = f 2

x1x2x3+ f 2

x1x2


I E[f 2110] = f 2

x1x2

KM algorithm




I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2


I E[f 211] = f 2

x1x2x3+ f 2

x1x2


I E[f 2110] = f 2

x1x2

KM algorithm




I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2


I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficientI E[f 2

110] = f 2x1x2

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε


I O(nL1(f )ε

)iterations.

I Two problemsI Pretty slow, depends on ε−7 and n

I Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε


I O(nL1(f )ε

)iterations.

I Two problemsI Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

LMN algorithm



fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m



I O( s2

ε2 · log n) sample complexity, parallelizable

I Two problems


LMN algorithm



fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m



I O( s2



LMN algorithm



fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m



I O( s2


I does not work well in practice.

I does not have guarantees in the noisy setting.

LMN algorithm



fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m



I O( s2



Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!





ε2 · log n)


Harmonica



ε · log n)

I 1ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting





ε2 · log n)


Harmonica



ε · log n)I 1

ε improvement



assumption!






ε2 · log n)


Harmonica



ε · log n)I 1

ε improvement


I Parallelizable

I First “practical” algorithm under uniform samplingassumption!






ε2 · log n)


Harmonica



ε · log n)I 1

ε improvement



assumption!






ε2 · log n)


Harmonica



ε · log n)I 1

ε improvement



assumption!I Previously criticized as useless setting

How do we learn the sparse low degree polynomial?

This problem contains a few key words:

I Noisy measurements

I Sparsity recovery

I Sample efficient

How do we learn the sparse low degree polynomial?

This problem contains a few key words:

I Noisy measurements

I Sparsity recovery

I Sample efficient

Compressed sensing!

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]


I Query: measurement matrix A ∈ Rm×N .I m << N

I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]


I Lasso algorithm:

minx∗{λ‖Ax∗ − y‖2

2 + ‖x∗‖1}

I Effect: linear regression with `1 regularization.

A general compressed sensing theorem

I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.

EX∼D [ψi (X ) · ψj(X )] =

{1 if i = j

0 otherwise.


I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.

EX∼D [ψi (X ) · ψj(X )] =

{1 if i = j

0 otherwise.

I Fourier basis {χS} is a random orthonormal family!


Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

Main Theorem

Theorem (Main theorem)

Consider a decision tree T with s leaf nodes and n variables.Under uniform sampling, Lasso learn T in time nO(log(s/ε)) andsample complexity O(s2 log n/ε) with high probability.

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing




I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .

I f captures all the important variables!







I “top layers” of decision tree

I no overfitting!


Heuristic: iterative selections

I A small decision tree is not accurate enough to give goodresult

I We can only identify ∼ 5 important monomials




Solution: Multi-stage Lasso




Solution: Multi-stage LassoI First, get ∼ 5 important monomials

I Fix them to maximize the sparse linear function

I Rerun lasso for the remaining variables!

Multi-stage Lasso: how does it work?

Need to stop here. Selecting moremonomials won’t approximate thefunction better.

Multi-stage Lasso: how does it work?Fixing 5 monomials, thensample more configurations,rerun Lasso. Select 5 moremonomials.

Multi-stage Lasso: why does it work?

We assume this subtree can be approximated bya sparse function. Different subtrees can beapproximated by different functions. Much moreexpressive than one-stage Lasso!


Step 1 Query (say) 100 random samples for f



Step 2 Expand the samples to include low degree features




Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables




Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables

Step 4 Update f by fixing these important variables. Go to Step 1.

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.


Assume x ∈ {−1, 1}100, y ∈ R.

1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).


Assume x ∈ {−1, 1}100, y ∈ R.


2. Call Lasso on expanded feature vectors, which returns 5important variables

I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.


Assume x ∈ {−1, 1}100, y ∈ R.


2. Call Lasso on expanded feature vectors, which returns 5important variables

I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.

3. Update f as f ′ = f(1,4,3,10,77),(1,−1,1,−1,−1).I For every x , fix its 1, 4, 3, 10, 77-th coordinate to be

(1,−1, 1,−1,−1), send to f .


4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).



′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.



′(x101)), · · · , (x200, f′(x200)).


I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).



′(x101)), · · · , (x200, f′(x200)).


I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).


′′(x201)), · · · , (x300, f′′(x300)).



′(x101)), · · · , (x200, f′(x200)).


I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).


′′(x201)), · · · , (x300, f′′(x300)).

8. · · ·



′(x101)), · · · , (x200, f′(x200)).


I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).


′′(x201)), · · · , (x300, f′′(x300)).

8. · · ·9. Get f ′′′ and run hyperband/random search/spearmint on f ′′′.

Why does Harmonica work?

I Multi-stage sparse function approximation

I Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques


I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.





I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.






I Lasso could provably learn a decision treeI Identify the important monomials

I Compressed sensing techniques




I Lasso could provably learn a decision treeI Identify the important monomialsI Compressed sensing techniques

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well




I 10 machines run in parallel

I Two-stage Lasso, degree 3 for feature selection:








I Small network: 8-layer, 30 total epochs per trial

I Small network is fast!









I 56-layer, 160 total epochs per trial

I Features from small network work well

60 Boolean variables for this task

I Weight initialization

I Optimization Method

I Learning rate

I Learning rate drop

I Momentum

I residual link weight

I Activation layer position

I Convolution bias

I Activation layer type

I Dropout

I Dropout rate

I Batch norm

I Batch norm tuning

I Resnet shortcut type

I Weight decay

I Batch size

I · · ·I and 21 dummy variables

7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5Final Test Error (%)

0

1

2

3

4

5

6

7

Diffe

rent

Alg

orith

m

Best Human RateHarmonica 1Harmonica 2

Harmonica+Random SearchRandom Search

HyperbandSpearmint

Harmonica 1 Harmonica 2 Harmonica+RndS Random Search Hyperband Spearmint0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

22.5

Tota

l Run

ning

Tim

e (G

PU D

ay)

10.1

3.6

8.3

20.017.3

8.5

Performance(Lefter is better)

Time(Shorter is better)

Harmonica

Optimization Time

0 100 200 300 400 500Number of Queries

10 1

100

101

102

103

104

105

Tota

l Opt

imiza

tion

Tim

e (s

)

Spearmint, n=60Harmonica, n=30Harmonica, n=60

Harmonica, n=100Harmonica, n=200

Selected features: matches our experience

Stage Feature Name Weights

1-1 Batch norm 8.05

1-2 Activation 3.47

1-3 Initial learning rate * Initial learning rate 3.12

1-4 Activation * Batch norm -2.55

1-5 Initial learning rate -2.34

2-1 Optimization method -4.22

2-2 Optimization method * Use momentum -3.02

2-3 Resblock first activation 2.80

2-4 Use momentum 2.19

2-5 Resblock 1st activation * Resblock 3rd activation 1.68

3-1 Weight decay parameter -0.49

3-2 Weight decay -0.26

3-3 Initial learning rate * Weight decay 0.23

3-4 Batch norm tuning 0.21

3-5 Weight decay * Weight decay parameter 0.20

Average test error drop

After fixing features in each stage, the average test error drops.

I We are in a better subtree

Average test error drop

Uniform Random After Stage 1 After Stage 2 After Stage 30

10

20

30

40

50

60

Aver

age

Test

Erro

r (\%

)60.16

33.324.33 21.3

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Conclusion

I Curse of dimensionality

I Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.







Conclusion


I Multi-stage sparse function is expressive

I Captures correlations between variablesI Query samples in promising subtrees.






Conclusion


I Multi-stage sparse function is expressiveI Captures correlations between variables

I Query samples in promising subtrees.






Conclusion








Conclusion




I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.



Conclusion




I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.I The first “practical” decision tree learning algorithm with

uniform sampling.


The Last slide..

Thank you for coming to my mini-course!

Bergstra, J. and Bengio, Y. (2012).Random search for hyper-parameter optimization.J. Mach. Learn. Res., 13:281–305.

Candes, E. J., Romberg, J., and Tao, T. (2006).Robust uncertainty principles: Exact signal reconstruction fromhighly incomplete frequency information.IEEE Trans. Inf. Theor., 52(2):489–509.

Donoho, D. L. (2006).Compressed sensing.IEEE Trans. Inf. Theor., 52(4):1289–1306.

Fu, J., Luo, H., Feng, J., Low, K. H., and Chua, T. (2016).Drmad: Distilling reverse-mode automatic differentiation foroptimizing hyperparameters of deep neural networks.CoRR, abs/1601.00917.

Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q.,and Cunningham, J. P. (2014).Bayesian optimization with inequality constraints.

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 937–945.

He, K., Zhang, X., Ren, S., and Sun, J. (2016).Deep residual learning for image recognition.In CVPR, pages 770–778.

Jamieson, K. G. and Talwalkar, A. (2016).Non-stochastic best arm identification and hyperparameteroptimization.In Proceedings of the 19th International Conference onArtificial Intelligence and Statistics, AISTATS 2016, Cadiz,Spain, May 9-11, 2016, pages 240–248.

Kushilevitz, E. and Mansour, Y. (1991).Learning decision trees using the fourier spectrum.In Proceedings of the Twenty-third Annual ACM Symposiumon Theory of Computing, STOC ’91, pages 455–464.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., andTalwalkar, A. (2016).

Hyperband: A Novel Bandit-Based Approach toHyperparameter Optimization.ArXiv e-prints.

Linial, N., Mansour, Y., and Nisan, N. (1993).Constant depth circuits, fourier transform, and learnability.J. ACM, 40(3):607–620.

Luketina, J., Berglund, M., Greff, K., and Raiko, T. (2015).Scalable gradient-based tuning of continuous regularizationhyperparameters.CoRR, abs/1511.06727.

Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015).Gradient-based hyperparameter optimization through reversiblelearning.In Proceedings of the 32Nd International Conference onInternational Conference on Machine Learning - Volume 37,ICML’15, pages 2113–2122. JMLR.org.

Rauhut, H. (2010).Compressive sensing and structured random matrices.

Theoretical foundations and numerical methods for sparserecovery, 9:1–92.

Recht, B. (2016).Embracing the random.http://www.argmin.net/2016/06/23/hyperband/.

Snoek, J., Larochelle, H., and Adams, R. P. (2012).Practical bayesian optimization of machine learningalgorithms.In Advances in Neural Information Processing Systems 25:26th Annual Conference on Neural Information ProcessingSystems 2012. Proceedings of a meeting held December 3-6,2012, Lake Tahoe, Nevada, United States., pages 2960–2968.

Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P.(2014).Input warping for bayesian optimization of non-stationaryfunctions.

http://www.argmin.net/2016/06/23/hyperband/

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 1674–1682.

Swersky, K., Snoek, J., and Adams, R. P. (2013).Multi-task bayesian optimization.In Advances in Neural Information Processing Systems 26:27th Annual Conference on Neural Information ProcessingSystems 2013. Proceedings of a meeting held December 5-8,2013, Lake Tahoe, Nevada, United States., pages 2004–2012.

Wang, Z., Zoghi, M., Hutter, F., Matheson, D., andde Freitas, N. (2013).Bayesian optimization in high dimensions via randomembeddings.In IJCAI 2013, Proceedings of the 23rd International JointConference on Artificial Intelligence, Beijing, China, August3-9, 2013, pages 1778–1784.

Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

Documents

Transcript of Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...