Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

Mini-Course 6:Hyperparameter Optimization – Harmonica

Yang Yuan

Computer Science DepartmentCornell University

Last time

I Bayesian Optimization[Snoek et al., 2012, Swersky et al., 2013, Snoek et al., 2014,Gardner et al., 2014, Wang et al., 2013].

I Gradient descent[Maclaurin et al., 2015, Fu et al., 2016, Luketina et al., 2015]

I Random Search [Bergstra and Bengio, 2012, Recht, 2016]

I Multi-armed Bandit based algorithms: Hyperband,SuccessiveHalving[Li et al., 2016, Jamieson and Talwalkar, 2016].

I Grid Search

A natural question..

With so many great algorithms for tuning hyperparameters...

A natural question..

With so many great algorithms for tuning hyperparameters...

Why do we still hire PhD students to do it manually?

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to work

I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are pretty good

I Every algorithm needs some assumptions to work

I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distribution

I Hyperband/SH: f get more accurate as we invest moreresource

I GD: the hyperparameter space is smooth, and all the localminima are pretty good

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resource

I GD: the hyperparameter space is smooth, and all the localminima are pretty good

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

Curse of dimensionality!

I Every algorithm needs some assumptions to workI BO:f can be approximated by the prior distributionI Hyperband/SH : f get more accurate as we invest more

minima are roughly equally good

I None of these work in the general high-dimensional setting

I Sample complexity is exponential in number of variables n

I Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

I Sample complexity is exponential in number of variables nI Hyperband, SH

I random searchI grid search

I Sample complexity is exponential in number of variables nI Hyperband, SHI random search

I grid search

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as well

I Prior distribution may not suit for large n

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.

variables.I Only tune the selected variables.

I Not purely Auto-ML.

variables.I Only tune the selected variables.I Not purely Auto-ML.

Our assumption

I f is a high-dimensional function on Boolean variablesI Discretize the continuous variablesI Binarize the categorical variables

I f can be approximated by a small decision tree

Kaggle 101: Survival Rate Prediction For Titanic

I Predict whether the passenger will survive based on thefollowing personal data:

I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)

Kaggle 101: Survival Rate Prediction For Titanic

I Predict whether the passenger will survive based on thefollowing personal data:

I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)

I Related features are much more:I HometownI OccupationI Native languageI RaceI Can swing or not?I · · ·

What is a decision tree?

I This simple decision tree on 4 variables gives you ≈ 75%prediction rate at Kaggle

When a small decision tree approximates f ?

I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good

“estimation”I Some variables are more important than the othersI True for many applications

I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables

When a small decision tree approximates f ?

I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good

“estimation”I Some variables are more important than the othersI True for many applications

I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables

How can we learn a small decision tree?

Step 1 convert decision tree into a sparse low degree polynomial inFourier basis (well known)

Step 2 learn the polynomial

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

f (x) =∑S⊆[n]

fSχS(x)

Preliminaries

I f : {−1, 1}n → [−1, 1]

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ ε

I Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

f (x) =∑S⊆[n]

fSχS(x)

Preliminaries

I f : {−1, 1}n → [−1, 1]

f (x) =∑S⊆[n]

fSχS(x)

Preliminaries

I f : {−1, 1}n → [−1, 1]

I 2n of them

I complete orthonormal basis for Boolean functionsI can be identified with S

f (x) =∑S⊆[n]

fSχS(x)

Preliminaries

I f : {−1, 1}n → [−1, 1]

I 2n of themI complete orthonormal basis for Boolean functions

I can be identified with S

f (x) =∑S⊆[n]

fSχS(x)

Preliminaries

I f : {−1, 1}n → [−1, 1]

f (x) =∑S⊆[n]

fSχS(x)

Preliminaries

I f : {−1, 1}n → [−1, 1]

f (x) =∑S⊆[n]

fSχS(x)

Preliminaries

I f : {−1, 1}n → [−1, 1]

f (x) =∑S⊆[n]

fSχS(x)

Preliminaries

I L1 norm: L1(f ) =∑

S |fS |.

I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =

∑S f

Preliminaries

S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.

I Parseval’s identity: Ex∼D [f (x)2] =∑

S f2S .

Preliminaries

S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =

∑S f

Examples

(+1,+1) = +1 max2

(−1,+1) = +1

(+1,−1) = +1 max2

(−1,−1) = −1

Examples

(+1,+1) = +1 max2

(−1,+1) = +1

(+1,−1) = +1 max2

(−1,−1) = −1

(x1, x2) =1

2x2 −

Examples

(+1,+1) = +1 max2

(−1,+1) = +1

(+1,−1) = +1 max2

(−1,−1) = −1

(x1, x2) =1

2x2 −

I max2 has L1 = 2, L0 = 4.

Examples

(+1,+1) = +1 max2

(−1,+1) = +1

(+1,−1) = +1 max2

(−1,−1) = −1

(x1, x2) =1

2x2 −

I max2 has L1 = 2, L0 = 4.

Similarly,

Maj3(x1, x2, x3) =1

2x3 −

2x1x2x3

I Maj has L1 = 2, L0 = 4.

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

Examples

I 2-sparse means it has 2 terms

I degree 2 means its terms have degree at most 2.

Examples

w1 w2 w3 y

1 -1 1 -6-1 -1 1 61 1 -1 10

Examples

w1 w2 w3 y

1 -1 1 -6-1 -1 1 61 1 -1 10

I However, f is not a linear combination of w1,w2,w3.

Examples

Expand the matrix!

w1 w2 w3 w1w2 w1w3 w2w3 y

1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10

Examples

Expand the matrix!

w1 w2 w3 w1w2 w1w3 w2w3 y

1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10

I the blue+yellow matrix is the Fourier basis

I Now f is a linear combination of the basis

Convert decision tree into sparse low degree polynomial

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

ε function h that 2ε-approximates T .

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this levelI It differs by at most ε

s · s = ε fraction by union bound.I So below assume T has depth at most log( s

ε nodes on this level

I It differs by at most εs · s = ε fraction by union bound.

I So below assume T has depth at most log( sε )

s · s = ε fraction by union bound.

I So below assume T has depth at most log( sε )

s · s = ε fraction by union bound.I So below assume T has depth at most log( s

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

{S ||fS | ≥

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

E[(f − h)2] ≤ ε

{S ||fS | ≥

L1(f )

I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

E[(f − h)2] ≤ ε

{S ||fS | ≥

L1(f )

L1(f )ε

L1(f )

=L1(f )2

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

E[(f − h)2] ≤ ε

{S ||fS | ≥

L1(f )

L1(f )ε

L1(f )

=L1(f )2

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

How do we learn the polynomial?

Theorem

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

Theorem

I How do we learn the sparse low degree function h?

I Well studied in Boolean analysis, two classical algorithms

Theorem

I KM algorithm [Kushilevitz and Mansour, 1991]

I LMN algorithm [Linial et al., 1993]

Theorem

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

KM algorithm

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I E[f 2110] = f 2

KM algorithm

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1

I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I E[f 2110] = f 2

KM algorithm

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I E[f 2110] = f 2

KM algorithm

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I E[f 2110] = f 2

KM algorithm

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I E[f 2110] = f 2

KM algorithm

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I E[f 2110] = f 2

KM algorithm

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I E[f 2110] = f 2

KM algorithm

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I E[f 2110] = f 2

KM algorithm

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I E[f 211] = f 2

x1x2x3+ f 2

I At the last level, fα is equal to one coefficientI E[f 2

110] = f 2x1x2

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε

I O(nL1(f )ε

)iterations.

I Two problems

KM algorithm

I θ , ε

I O(nL1(f )ε

)iterations.

I Two problems

KM algorithm

I θ , ε

I O(nL1(f )ε

)iterations.

I Two problems

KM algorithm

I θ , ε

I O(nL1(f )ε

)iterations.

I Two problemsI Pretty slow, depends on ε−7 and n

I Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε

I O(nL1(f )ε

)iterations.

I Two problemsI Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

LMN algorithm

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

I O( s2

LMN algorithm

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

I O( s2

LMN algorithm

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

I O( s2

LMN algorithm

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

I O( s2

ε2 · log n) sample complexity, parallelizable

I Two problems

LMN algorithm

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

I O( s2

LMN algorithm

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

I O( s2

I does not work well in practice.

I does not have guarantees in the noisy setting.

LMN algorithm

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

I O( s2

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

ε2 · log n)

Harmonica

ε · log n)

I 1ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

ε2 · log n)

Harmonica

ε · log n)

I 1ε improvement

assumption!

ε2 · log n)

Harmonica

ε · log n)I 1

ε improvement

assumption!

ε2 · log n)

Harmonica

ε · log n)I 1

ε improvement

assumption!

ε2 · log n)

Harmonica

ε · log n)I 1

ε improvement

I Parallelizable

I First “practical” algorithm under uniform samplingassumption!

ε2 · log n)

Harmonica

ε · log n)I 1

ε improvement

assumption!

ε2 · log n)

Harmonica

ε · log n)I 1

ε improvement

assumption!I Previously criticized as useless setting

How do we learn the sparse low degree polynomial?

This problem contains a few key words:

I Noisy measurements

I Sparsity recovery

I Sample efficient

How do we learn the sparse low degree polynomial?

This problem contains a few key words:

I Noisy measurements

I Sparsity recovery

I Sample efficient

Compressed sensing!

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.

I Lasso algorithm:

minx∗{λ‖Ax∗ − y‖2

2 + ‖x∗‖1}

I Effect: linear regression with `1 regularization.

I Lasso algorithm:

minx∗{λ‖Ax∗ − y‖2

2 + ‖x∗‖1}

I Effect: linear regression with `1 regularization.

A general compressed sensing theorem

I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.

EX∼D [ψi (X ) · ψj(X )] =

{1 if i = j

0 otherwise.

I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.

EX∼D [ψi (X ) · ψj(X )] =

{1 if i = j

0 otherwise.

I Fourier basis {χS} is a random orthonormal family!

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

‖x − x∗‖2 ≤ c‖e‖2√m

I ‖x − x∗‖22 ≤ ε

‖x − x∗‖2 ≤ c‖e‖2√m

I ‖x − x∗‖22 ≤ ε

‖x − x∗‖2 ≤ c‖e‖2√m

I ‖x − x∗‖22 ≤ ε

‖x − x∗‖2 ≤ c‖e‖2√m

I ‖x − x∗‖22 ≤ ε

Main Theorem

Theorem (Main theorem)

Consider a decision tree T with s leaf nodes and n variables.Under uniform sampling, Lasso learn T in time nO(log(s/ε)) andsample complexity O(s2 log n/ε) with high probability.

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

g has small value.

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

g has small value.

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

g has small value.

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

g has small value.

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

≤√ε

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

≤√ε

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

≤√ε

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

≤√ε

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .

I f captures all the important variables!

I “top layers” of decision tree

I no overfitting!

Heuristic: iterative selections

I A small decision tree is not accurate enough to give goodresult

I We can only identify ∼ 5 important monomials

Solution: Multi-stage Lasso

Solution: Multi-stage LassoI First, get ∼ 5 important monomials

I Fix them to maximize the sparse linear function

I Rerun lasso for the remaining variables!

Multi-stage Lasso: how does it work?

Need to stop here. Selecting moremonomials won’t approximate thefunction better.

Multi-stage Lasso: how does it work?Fixing 5 monomials, thensample more configurations,rerun Lasso. Select 5 moremonomials.

Multi-stage Lasso: why does it work?

We assume this subtree can be approximated bya sparse function. Different subtrees can beapproximated by different functions. Much moreexpressive than one-stage Lasso!

Step 1 Query (say) 100 random samples for f

Step 2 Expand the samples to include low degree features

Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables

Step 4 Update f by fixing these important variables. Go to Step 1.

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).

Assume x ∈ {−1, 1}100, y ∈ R.

2. Call Lasso on expanded feature vectors, which returns 5important variables

I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.

Assume x ∈ {−1, 1}100, y ∈ R.

2. Call Lasso on expanded feature vectors, which returns 5important variables

I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.

3. Update f as f ′ = f(1,4,3,10,77),(1,−1,1,−1,−1).I For every x , fix its 1, 4, 3, 10, 77-th coordinate to be

(1,−1, 1,−1,−1), send to f .

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

′(x101)), · · · , (x200, f′(x200)).

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

′(x101)), · · · , (x200, f′(x200)).

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

′′(x201)), · · · , (x300, f′′(x300)).

′(x101)), · · · , (x200, f′(x200)).

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

′′(x201)), · · · , (x300, f′′(x300)).

8. · · ·

′(x101)), · · · , (x200, f′(x200)).

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

′′(x201)), · · · , (x300, f′′(x300)).

8. · · ·9. Get f ′′′ and run hyperband/random search/spearmint on f ′′′.

Why does Harmonica work?

I Multi-stage sparse function approximation

I Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision treeI Identify the important monomials

I Compressed sensing techniques

I Lasso could provably learn a decision treeI Identify the important monomialsI Compressed sensing techniques

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

I 10 machines run in parallel

I Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trial

I Small network is fast!

I 56-layer, 160 total epochs per trial

I Features from small network work well

60 Boolean variables for this task

I Weight initialization

I Optimization Method

I Learning rate

I Learning rate drop

I Momentum

I residual link weight

I Activation layer position

I Convolution bias

I Activation layer type

I Dropout

I Dropout rate

I Batch norm

I Batch norm tuning

I Resnet shortcut type

I Weight decay

I Batch size

I · · ·I and 21 dummy variables

7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5Final Test Error (%)

Best Human RateHarmonica 1Harmonica 2

Harmonica+Random SearchRandom Search

HyperbandSpearmint

Harmonica 1 Harmonica 2 Harmonica+RndS Random Search Hyperband Spearmint0.0

20.017.3

Performance(Lefter is better)

Time(Shorter is better)

Harmonica

Optimization Time

0 100 200 300 400 500Number of Queries

Spearmint, n=60Harmonica, n=30Harmonica, n=60

Harmonica, n=100Harmonica, n=200

Selected features: matches our experience

Stage Feature Name Weights

1-1 Batch norm 8.05

1-2 Activation 3.47

1-3 Initial learning rate * Initial learning rate 3.12

1-4 Activation * Batch norm -2.55

1-5 Initial learning rate -2.34

2-1 Optimization method -4.22

2-2 Optimization method * Use momentum -3.02

2-3 Resblock first activation 2.80

2-4 Use momentum 2.19

2-5 Resblock 1st activation * Resblock 3rd activation 1.68

3-1 Weight decay parameter -0.49

3-2 Weight decay -0.26

3-3 Initial learning rate * Weight decay 0.23

3-4 Batch norm tuning 0.21

3-5 Weight decay * Weight decay parameter 0.20

Average test error drop

After fixing features in each stage, the average test error drops.

I We are in a better subtree

Average test error drop

Uniform Random After Stage 1 After Stage 2 After Stage 30

)60.16

33.324.33 21.3

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Harmonica: benefits

I Scalable in n

I Parallelizable

Harmonica: benefits

I Scalable in n

I Parallelizable

Harmonica: benefits

I Scalable in n

I Parallelizable

Conclusion

I Curse of dimensionality

I Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

Conclusion

I Multi-stage sparse function is expressive

I Captures correlations between variablesI Query samples in promising subtrees.

Conclusion

I Multi-stage sparse function is expressiveI Captures correlations between variables

I Query samples in promising subtrees.

Conclusion

I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.

Conclusion

learning over more than two decades.I The first “practical” decision tree learning algorithm with

uniform sampling.

Conclusion

learning over more than two decades.I The first “practical” decision tree learning algorithm with

uniform sampling.

The Last slide..

Thank you for coming to my mini-course!

Bergstra, J. and Bengio, Y. (2012).Random search for hyper-parameter optimization.J. Mach. Learn. Res., 13:281–305.

Candes, E. J., Romberg, J., and Tao, T. (2006).Robust uncertainty principles: Exact signal reconstruction fromhighly incomplete frequency information.IEEE Trans. Inf. Theor., 52(2):489–509.

Donoho, D. L. (2006).Compressed sensing.IEEE Trans. Inf. Theor., 52(4):1289–1306.

Fu, J., Luo, H., Feng, J., Low, K. H., and Chua, T. (2016).Drmad: Distilling reverse-mode automatic differentiation foroptimizing hyperparameters of deep neural networks.CoRR, abs/1601.00917.

Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q.,and Cunningham, J. P. (2014).Bayesian optimization with inequality constraints.

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 937–945.

He, K., Zhang, X., Ren, S., and Sun, J. (2016).Deep residual learning for image recognition.In CVPR, pages 770–778.

Jamieson, K. G. and Talwalkar, A. (2016).Non-stochastic best arm identification and hyperparameteroptimization.In Proceedings of the 19th International Conference onArtificial Intelligence and Statistics, AISTATS 2016, Cadiz,Spain, May 9-11, 2016, pages 240–248.

Kushilevitz, E. and Mansour, Y. (1991).Learning decision trees using the fourier spectrum.In Proceedings of the Twenty-third Annual ACM Symposiumon Theory of Computing, STOC ’91, pages 455–464.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., andTalwalkar, A. (2016).

Hyperband: A Novel Bandit-Based Approach toHyperparameter Optimization.ArXiv e-prints.

Linial, N., Mansour, Y., and Nisan, N. (1993).Constant depth circuits, fourier transform, and learnability.J. ACM, 40(3):607–620.

Luketina, J., Berglund, M., Greff, K., and Raiko, T. (2015).Scalable gradient-based tuning of continuous regularizationhyperparameters.CoRR, abs/1511.06727.

Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015).Gradient-based hyperparameter optimization through reversiblelearning.In Proceedings of the 32Nd International Conference onInternational Conference on Machine Learning - Volume 37,ICML’15, pages 2113–2122. JMLR.org.

Rauhut, H. (2010).Compressive sensing and structured random matrices.

Theoretical foundations and numerical methods for sparserecovery, 9:1–92.

Recht, B. (2016).Embracing the random.http://www.argmin.net/2016/06/23/hyperband/.

Snoek, J., Larochelle, H., and Adams, R. P. (2012).Practical bayesian optimization of machine learningalgorithms.In Advances in Neural Information Processing Systems 25:26th Annual Conference on Neural Information ProcessingSystems 2012. Proceedings of a meeting held December 3-6,2012, Lake Tahoe, Nevada, United States., pages 2960–2968.

Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P.(2014).Input warping for bayesian optimization of non-stationaryfunctions.

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 1674–1682.

Swersky, K., Snoek, J., and Adams, R. P. (2013).Multi-task bayesian optimization.In Advances in Neural Information Processing Systems 26:27th Annual Conference on Neural Information ProcessingSystems 2013. Proceedings of a meeting held December 5-8,2013, Lake Tahoe, Nevada, United States., pages 2004–2012.

Wang, Z., Zoghi, M., Hutter, F., Matheson, D., andde Freitas, N. (2013).Bayesian optimization in high dimensions via randomembeddings.In IJCAI 2013, Proceedings of the 23rd International JointConference on Artificial Intelligence, Beijing, China, August3-9, 2013, pages 1778–1784.

Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

Documents

Transcript of Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

REDUCING THE SEARCH SPACE FOR HYPERPARAMETER OPTIMIZATION ...home.engineering.iastate.edu/~chinmay/files/papers/pgsrICASSP19.… · of hyperparameter optimization (HPO) addresses

AKNOWLEDGEMENTS - Harmonica · PDF fileAKNOWLEDGEMENTS “Blues Harmonica Lessons” cover page artwork – Dave Brewer “Blues Harmonica Lessons” CD cover artwork ... The World’s

Harmonica Lessons

Beginning Harmonica Lessons

Bilevel Programming for Hyperparameter Optimization and ...

Play Harmonica

Convolution Neural Network Hyperparameter Optimization ...

Harmonica 140117024516-phpapp02

Hyperparameter Optimization via Sequential Uniform Designs

HYPERPARAMETER OPTIMIZATION OF DEEP CONVOLUTIONAL …

Playing Harmonica with Guitar & Ukulele - Lee · PDF filePlaying Harmonica with Guitar & Ukulele IT’S EASY WITH THE LEE OSKAR HARMONICA SYSTEM... IT’S EASY WITH THE LEE OSKAR HARMONICA

Harmonica Digital Servo Drive Specifications - · PDF file5. Harmonica Connectors 5.1 Connector Types The table below shows the connector panel of the Harmonica. The Harmonica Cable

Harmonica Course

Hyperparameter Optimization 101

Provenance Supporting Hyperparameter Analysis in Deep ...

Esc Harmonica

Predictive Process Monitoring with Hyperparameter Optimization

Harmonica 030912x

FREE REEDS vs BEATING REEDS INTRODUCTION TO THE HARMONICA MUSIC 318 Mini-course on the Harmonica and Free-reed instruments.

Harmonica Reference