Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

198
Mini-Course 6: Hyperparameter Optimization – Harmonica Yang Yuan Computer Science Department Cornell University

Transcript of Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

Page 1: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Mini-Course 6:Hyperparameter Optimization – Harmonica

Yang Yuan

Computer Science DepartmentCornell University

Page 2: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Last time

I Bayesian Optimization[Snoek et al., 2012, Swersky et al., 2013, Snoek et al., 2014,Gardner et al., 2014, Wang et al., 2013].

I Gradient descent[Maclaurin et al., 2015, Fu et al., 2016, Luketina et al., 2015]

I Random Search [Bergstra and Bengio, 2012, Recht, 2016]

I Multi-armed Bandit based algorithms: Hyperband,SuccessiveHalving[Li et al., 2016, Jamieson and Talwalkar, 2016].

I Grid Search

Page 3: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A natural question..

Page 4: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A natural question..

With so many great algorithms for tuning hyperparameters...

Page 5: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A natural question..

With so many great algorithms for tuning hyperparameters...

Why do we still hire PhD students to do it manually?

Page 6: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to work

I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are pretty good

Page 7: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to work

I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are pretty good

Page 8: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distribution

I Hyperband/SH: f get more accurate as we invest moreresource

I GD: the hyperparameter space is smooth, and all the localminima are pretty good

Page 9: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resource

I GD: the hyperparameter space is smooth, and all the localminima are pretty good

Page 10: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are pretty good

Page 11: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to workI BO:f can be approximated by the prior distributionI Hyperband/SH : f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are roughly equally good

I None of these work in the general high-dimensional setting

Page 12: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables n

I Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 13: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SH

I random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 14: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random search

I grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 15: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 16: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 17: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as well

I Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 18: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 19: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 20: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 21: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.I Only tune the selected variables.

I Not purely Auto-ML.

I How can we do better?

Page 22: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 23: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Page 24: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our assumption

I f is a high-dimensional function on Boolean variablesI Discretize the continuous variablesI Binarize the categorical variables

I f can be approximated by a small decision tree

Page 25: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Kaggle 101: Survival Rate Prediction For Titanic

I Predict whether the passenger will survive based on thefollowing personal data:

I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)

Page 26: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Kaggle 101: Survival Rate Prediction For Titanic

I Predict whether the passenger will survive based on thefollowing personal data:

I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)

I Related features are much more:I HometownI OccupationI Native languageI RaceI Can swing or not?I · · ·

Page 27: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

What is a decision tree?

I This simple decision tree on 4 variables gives you ≈ 75%prediction rate at Kaggle

Page 28: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

When a small decision tree approximates f ?

I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good

“estimation”I Some variables are more important than the othersI True for many applications

I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables

Page 29: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

When a small decision tree approximates f ?

I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good

“estimation”I Some variables are more important than the othersI True for many applications

I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables

Page 30: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

How can we learn a small decision tree?

Step 1 convert decision tree into a sparse low degree polynomial inFourier basis (well known)

Step 2 learn the polynomial

Page 31: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 32: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 33: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ ε

I Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 34: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 35: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of them

I complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 36: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functions

I can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 37: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 38: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 39: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Page 40: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I L1 norm: L1(f ) =∑

S |fS |.

I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =

∑S f

2S .

Page 41: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I L1 norm: L1(f ) =∑

S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.

I Parseval’s identity: Ex∼D [f (x)2] =∑

S f2S .

Page 42: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Preliminaries

I L1 norm: L1(f ) =∑

S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =

∑S f

2S .

Page 43: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

Page 44: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

Page 45: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

I max2 has L1 = 2, L0 = 4.

Page 46: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

I max2 has L1 = 2, L0 = 4.

Similarly,

Maj3(x1, x2, x3) =1

2x1 +

1

2x2 +

1

2x3 −

1

2x1x2x3

I Maj has L1 = 2, L0 = 4.

Page 47: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

Page 48: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

I 2-sparse means it has 2 terms

I degree 2 means its terms have degree at most 2.

Page 49: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

I 2-sparse means it has 2 terms

I degree 2 means its terms have degree at most 2.

w1 w2 w3 y

1 -1 1 -6-1 -1 1 61 1 -1 10

Page 50: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

I 2-sparse means it has 2 terms

I degree 2 means its terms have degree at most 2.

w1 w2 w3 y

1 -1 1 -6-1 -1 1 61 1 -1 10

I However, f is not a linear combination of w1,w2,w3.

Page 51: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

Expand the matrix!

w1 w2 w3 w1w2 w1w3 w2w3 y

1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10

Page 52: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Examples

Expand the matrix!

w1 w2 w3 w1w2 w1w3 w2w3 y

1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10

I the blue+yellow matrix is the Fourier basis

I Now f is a linear combination of the basis

Page 53: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

Page 54: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Page 55: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this levelI It differs by at most ε

s · s = ε fraction by union bound.I So below assume T has depth at most log( s

ε )

Page 56: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this level

I It differs by at most εs · s = ε fraction by union bound.

I So below assume T has depth at most log( sε )

Page 57: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this levelI It differs by at most ε

s · s = ε fraction by union bound.

I So below assume T has depth at most log( sε )

Page 58: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this levelI It differs by at most ε

s · s = ε fraction by union bound.I So below assume T has depth at most log( s

ε )

Page 59: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Page 60: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Page 61: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Page 62: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Page 63: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

Page 64: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}

I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

Page 65: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

Page 66: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Convert decision tree into sparse low degree polynomial

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

Page 67: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

Page 68: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?

I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

Page 69: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

Page 70: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]

I LMN algorithm [Linial et al., 1993]

Page 71: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

Page 72: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 73: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 74: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1

I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 75: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 76: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 77: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 78: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 79: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 80: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

Page 81: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficientI E[f 2

110] = f 2x1x2

Page 82: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

Page 83: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

Page 84: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

Page 85: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

Page 86: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problemsI Pretty slow, depends on ε−7 and n

I Sequential algorithm, cannot query in parallel

Page 87: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problemsI Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

Page 88: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

Page 89: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

Page 90: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

Page 91: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

Page 92: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizable

I Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

Page 93: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

Page 94: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.

I does not have guarantees in the noisy setting.

Page 95: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

Page 96: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Page 97: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)

I 1ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Page 98: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)

I 1ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Page 99: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Page 100: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Page 101: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I Parallelizable

I First “practical” algorithm under uniform samplingassumption!

I Previously criticized as useless setting

Page 102: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Page 103: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!I Previously criticized as useless setting

Page 104: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

How do we learn the sparse low degree polynomial?

This problem contains a few key words:

I Noisy measurements

I Sparsity recovery

I Sample efficient

Page 105: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

How do we learn the sparse low degree polynomial?

This problem contains a few key words:

I Noisy measurements

I Sparsity recovery

I Sample efficient

Compressed sensing!

Page 106: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

Page 107: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

Page 108: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

Page 109: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

Page 110: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

What is compressed sensing?

I Lasso algorithm:

minx∗{λ‖Ax∗ − y‖2

2 + ‖x∗‖1}

I Effect: linear regression with `1 regularization.

Page 111: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

What is compressed sensing?

I Lasso algorithm:

minx∗{λ‖Ax∗ − y‖2

2 + ‖x∗‖1}

I Effect: linear regression with `1 regularization.

Page 112: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A general compressed sensing theorem

I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.

EX∼D [ψi (X ) · ψj(X )] =

{1 if i = j

0 otherwise.

Page 113: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A general compressed sensing theorem

I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.

EX∼D [ψi (X ) · ψj(X )] =

{1 if i = j

0 otherwise.

I Fourier basis {χS} is a random orthonormal family!

Page 114: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

Page 115: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

Page 116: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

Page 117: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

Page 118: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

Page 119: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Main Theorem

Theorem (Main theorem)

Consider a decision tree T with s leaf nodes and n variables.Under uniform sampling, Lasso learn T in time nO(log(s/ε)) andsample complexity O(s2 log n/ε) with high probability.

Page 120: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Page 121: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Page 122: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Page 123: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Page 124: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Page 125: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Page 126: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Page 127: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Page 128: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Page 129: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Page 130: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Page 131: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Page 132: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .

I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Page 133: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Page 134: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision tree

I no overfitting!

I Lasso learns f by compressed sensing

Page 135: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Page 136: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Page 137: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Heuristic: iterative selections

I A small decision tree is not accurate enough to give goodresult

I We can only identify ∼ 5 important monomials

Page 138: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Heuristic: iterative selections

I A small decision tree is not accurate enough to give goodresult

I We can only identify ∼ 5 important monomials

Solution: Multi-stage Lasso

Page 139: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Heuristic: iterative selections

I A small decision tree is not accurate enough to give goodresult

I We can only identify ∼ 5 important monomials

Solution: Multi-stage LassoI First, get ∼ 5 important monomials

I Fix them to maximize the sparse linear function

I Rerun lasso for the remaining variables!

Page 140: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Multi-stage Lasso: how does it work?

Need to stop here. Selecting moremonomials won’t approximate thefunction better.

Page 141: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Multi-stage Lasso: how does it work?Fixing 5 monomials, thensample more configurations,rerun Lasso. Select 5 moremonomials.

Page 142: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Multi-stage Lasso: why does it work?

We assume this subtree can be approximated bya sparse function. Different subtrees can beapproximated by different functions. Much moreexpressive than one-stage Lasso!

Page 143: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

Step 1 Query (say) 100 random samples for f

Page 144: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

Step 1 Query (say) 100 random samples for f

Step 2 Expand the samples to include low degree features

Page 145: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

Step 1 Query (say) 100 random samples for f

Step 2 Expand the samples to include low degree features

Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables

Page 146: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Our algorithm: Harmonica

Step 1 Query (say) 100 random samples for f

Step 2 Expand the samples to include low degree features

Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables

Step 4 Update f by fixing these important variables. Go to Step 1.

Page 147: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

Page 148: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).

Page 149: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).

2. Call Lasso on expanded feature vectors, which returns 5important variables

I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.

Page 150: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).

2. Call Lasso on expanded feature vectors, which returns 5important variables

I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.

3. Update f as f ′ = f(1,4,3,10,77),(1,−1,1,−1,−1).I For every x , fix its 1, 4, 3, 10, 77-th coordinate to be

(1,−1, 1,−1,−1), send to f .

Page 151: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

Page 152: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

Page 153: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

Page 154: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

7. Query 100 more random samples(x201, f

′′(x201)), · · · , (x300, f′′(x300)).

Page 155: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

7. Query 100 more random samples(x201, f

′′(x201)), · · · , (x300, f′′(x300)).

8. · · ·

Page 156: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

7. Query 100 more random samples(x201, f

′′(x201)), · · · , (x300, f′′(x300)).

8. · · ·9. Get f ′′′ and run hyperband/random search/spearmint on f ′′′.

Page 157: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Why does Harmonica work?

I Multi-stage sparse function approximation

I Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Page 158: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Page 159: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Page 160: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Page 161: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Page 162: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision treeI Identify the important monomials

I Compressed sensing techniques

Page 163: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision treeI Identify the important monomialsI Compressed sensing techniques

Page 164: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Page 165: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Page 166: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallel

I Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Page 167: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Page 168: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trial

I Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Page 169: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Page 170: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Page 171: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trial

I Features from small network work well

Page 172: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Page 173: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

60 Boolean variables for this task

I Weight initialization

I Optimization Method

I Learning rate

I Learning rate drop

I Momentum

I residual link weight

I Activation layer position

I Convolution bias

I Activation layer type

I Dropout

I Dropout rate

I Batch norm

I Batch norm tuning

I Resnet shortcut type

I Weight decay

I Batch size

I · · ·I and 21 dummy variables

Page 174: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5Final Test Error (%)

0

1

2

3

4

5

6

7

Diffe

rent

Alg

orith

m

Best Human RateHarmonica 1Harmonica 2

Harmonica+Random SearchRandom Search

HyperbandSpearmint

Harmonica 1 Harmonica 2 Harmonica+RndS Random Search Hyperband Spearmint0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

22.5

Tota

l Run

ning

Tim

e (G

PU D

ay)

10.1

3.6

8.3

20.017.3

8.5

Performance(Lefter is better)

Time(Shorter is better)

Harmonica

Page 175: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Optimization Time

0 100 200 300 400 500Number of Queries

10 1

100

101

102

103

104

105

Tota

l Opt

imiza

tion

Tim

e (s

)

Spearmint, n=60Harmonica, n=30Harmonica, n=60

Harmonica, n=100Harmonica, n=200

Page 176: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Selected features: matches our experience

Stage Feature Name Weights

1-1 Batch norm 8.05

1-2 Activation 3.47

1-3 Initial learning rate * Initial learning rate 3.12

1-4 Activation * Batch norm -2.55

1-5 Initial learning rate -2.34

2-1 Optimization method -4.22

2-2 Optimization method * Use momentum -3.02

2-3 Resblock first activation 2.80

2-4 Use momentum 2.19

2-5 Resblock 1st activation * Resblock 3rd activation 1.68

3-1 Weight decay parameter -0.49

3-2 Weight decay -0.26

3-3 Initial learning rate * Weight decay 0.23

3-4 Batch norm tuning 0.21

3-5 Weight decay * Weight decay parameter 0.20

Page 177: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Average test error drop

After fixing features in each stage, the average test error drops.

I We are in a better subtree

Page 178: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Average test error drop

Uniform Random After Stage 1 After Stage 2 After Stage 30

10

20

30

40

50

60

Aver

age

Test

Erro

r (\%

)60.16

33.324.33 21.3

Page 179: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Page 180: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Page 181: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Page 182: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Page 183: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionality

I Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Page 184: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Page 185: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressive

I Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Page 186: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variables

I Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Page 187: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Page 188: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Page 189: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Page 190: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Page 191: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.I The first “practical” decision tree learning algorithm with

uniform sampling.

I This is a pretty new area, pretty important problem.

Page 192: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.I The first “practical” decision tree learning algorithm with

uniform sampling.

I This is a pretty new area, pretty important problem.

Page 193: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

The Last slide..

Thank you for coming to my mini-course!

Page 194: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Bergstra, J. and Bengio, Y. (2012).Random search for hyper-parameter optimization.J. Mach. Learn. Res., 13:281–305.

Candes, E. J., Romberg, J., and Tao, T. (2006).Robust uncertainty principles: Exact signal reconstruction fromhighly incomplete frequency information.IEEE Trans. Inf. Theor., 52(2):489–509.

Donoho, D. L. (2006).Compressed sensing.IEEE Trans. Inf. Theor., 52(4):1289–1306.

Fu, J., Luo, H., Feng, J., Low, K. H., and Chua, T. (2016).Drmad: Distilling reverse-mode automatic differentiation foroptimizing hyperparameters of deep neural networks.CoRR, abs/1601.00917.

Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q.,and Cunningham, J. P. (2014).Bayesian optimization with inequality constraints.

Page 195: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 937–945.

He, K., Zhang, X., Ren, S., and Sun, J. (2016).Deep residual learning for image recognition.In CVPR, pages 770–778.

Jamieson, K. G. and Talwalkar, A. (2016).Non-stochastic best arm identification and hyperparameteroptimization.In Proceedings of the 19th International Conference onArtificial Intelligence and Statistics, AISTATS 2016, Cadiz,Spain, May 9-11, 2016, pages 240–248.

Kushilevitz, E. and Mansour, Y. (1991).Learning decision trees using the fourier spectrum.In Proceedings of the Twenty-third Annual ACM Symposiumon Theory of Computing, STOC ’91, pages 455–464.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., andTalwalkar, A. (2016).

Page 196: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Hyperband: A Novel Bandit-Based Approach toHyperparameter Optimization.ArXiv e-prints.

Linial, N., Mansour, Y., and Nisan, N. (1993).Constant depth circuits, fourier transform, and learnability.J. ACM, 40(3):607–620.

Luketina, J., Berglund, M., Greff, K., and Raiko, T. (2015).Scalable gradient-based tuning of continuous regularizationhyperparameters.CoRR, abs/1511.06727.

Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015).Gradient-based hyperparameter optimization through reversiblelearning.In Proceedings of the 32Nd International Conference onInternational Conference on Machine Learning - Volume 37,ICML’15, pages 2113–2122. JMLR.org.

Rauhut, H. (2010).Compressive sensing and structured random matrices.

Page 197: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

Theoretical foundations and numerical methods for sparserecovery, 9:1–92.

Recht, B. (2016).Embracing the random.http://www.argmin.net/2016/06/23/hyperband/.

Snoek, J., Larochelle, H., and Adams, R. P. (2012).Practical bayesian optimization of machine learningalgorithms.In Advances in Neural Information Processing Systems 25:26th Annual Conference on Neural Information ProcessingSystems 2012. Proceedings of a meeting held December 3-6,2012, Lake Tahoe, Nevada, United States., pages 2960–2968.

Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P.(2014).Input warping for bayesian optimization of non-stationaryfunctions.

Page 198: Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter Optimization { Harmonica Yang Yuan Computer Science Department Cornell University. ... the

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 1674–1682.

Swersky, K., Snoek, J., and Adams, R. P. (2013).Multi-task bayesian optimization.In Advances in Neural Information Processing Systems 26:27th Annual Conference on Neural Information ProcessingSystems 2013. Proceedings of a meeting held December 5-8,2013, Lake Tahoe, Nevada, United States., pages 2004–2012.

Wang, Z., Zoghi, M., Hutter, F., Matheson, D., andde Freitas, N. (2013).Bayesian optimization in high dimensions via randomembeddings.In IJCAI 2013, Proceedings of the 23rd International JointConference on Artificial Intelligence, Beijing, China, August3-9, 2013, pages 1778–1784.