Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

Post on 02-Jun-2020

29 views 0 download

Transcript of Mini-Course 6: Hyperparameter Optimization { Harmonica · Mini-Course 6: Hyperparameter...

Mini-Course 6:Hyperparameter Optimization – Harmonica

Yang Yuan

Computer Science DepartmentCornell University

Last time

I Bayesian Optimization[Snoek et al., 2012, Swersky et al., 2013, Snoek et al., 2014,Gardner et al., 2014, Wang et al., 2013].

I Gradient descent[Maclaurin et al., 2015, Fu et al., 2016, Luketina et al., 2015]

I Random Search [Bergstra and Bengio, 2012, Recht, 2016]

I Multi-armed Bandit based algorithms: Hyperband,SuccessiveHalving[Li et al., 2016, Jamieson and Talwalkar, 2016].

I Grid Search

A natural question..

A natural question..

With so many great algorithms for tuning hyperparameters...

A natural question..

With so many great algorithms for tuning hyperparameters...

Why do we still hire PhD students to do it manually?

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to work

I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are pretty good

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to work

I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are pretty good

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distribution

I Hyperband/SH: f get more accurate as we invest moreresource

I GD: the hyperparameter space is smooth, and all the localminima are pretty good

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resource

I GD: the hyperparameter space is smooth, and all the localminima are pretty good

Implicit Assumptions

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are pretty good

Curse of dimensionality!

I If f is random noise, no algorithm is better than randomsearch

I Every algorithm needs some assumptions to workI BO:f can be approximated by the prior distributionI Hyperband/SH : f get more accurate as we invest more

resourceI GD: the hyperparameter space is smooth, and all the local

minima are roughly equally good

I None of these work in the general high-dimensional setting

Curse of dimensionality!

I Sample complexity is exponential in number of variables n

I Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SH

I random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random search

I grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worse

I Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as well

I Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?

I manually select ∼ 10 important variables among all possiblevariables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.

I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.I Only tune the selected variables.

I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Curse of dimensionality!

I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search

I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n

I How to decrease dimension?I manually select ∼ 10 important variables among all possible

variables.I Only tune the selected variables.I Not purely Auto-ML.

I How can we do better?

Our assumption

I f is a high-dimensional function on Boolean variablesI Discretize the continuous variablesI Binarize the categorical variables

I f can be approximated by a small decision tree

Kaggle 101: Survival Rate Prediction For Titanic

I Predict whether the passenger will survive based on thefollowing personal data:

I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)

Kaggle 101: Survival Rate Prediction For Titanic

I Predict whether the passenger will survive based on thefollowing personal data:

I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)

I Related features are much more:I HometownI OccupationI Native languageI RaceI Can swing or not?I · · ·

What is a decision tree?

I This simple decision tree on 4 variables gives you ≈ 75%prediction rate at Kaggle

When a small decision tree approximates f ?

I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good

“estimation”I Some variables are more important than the othersI True for many applications

I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables

When a small decision tree approximates f ?

I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good

“estimation”I Some variables are more important than the othersI True for many applications

I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables

How can we learn a small decision tree?

Step 1 convert decision tree into a sparse low degree polynomial inFourier basis (well known)

Step 2 learn the polynomial

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ ε

I Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of them

I complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functions

I can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I f : {−1, 1}n → [−1, 1]

I D is uniform distribution on {−1, 1}n

I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].

I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S

I Representation of f under χS(x):

f (x) =∑S⊆[n]

fSχS(x)

I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].

Preliminaries

I L1 norm: L1(f ) =∑

S |fS |.

I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =

∑S f

2S .

Preliminaries

I L1 norm: L1(f ) =∑

S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.

I Parseval’s identity: Ex∼D [f (x)2] =∑

S f2S .

Preliminaries

I L1 norm: L1(f ) =∑

S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =

∑S f

2S .

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

I max2 has L1 = 2, L0 = 4.

Examples

max2

(+1,+1) = +1 max2

(−1,+1) = +1

max2

(+1,−1) = +1 max2

(−1,−1) = −1

max2

(x1, x2) =1

2+

1

2x1 +

1

2x2 −

1

2x1x2

I max2 has L1 = 2, L0 = 4.

Similarly,

Maj3(x1, x2, x3) =1

2x1 +

1

2x2 +

1

2x3 −

1

2x1x2x3

I Maj has L1 = 2, L0 = 4.

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

I 2-sparse means it has 2 terms

I degree 2 means its terms have degree at most 2.

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

I 2-sparse means it has 2 terms

I degree 2 means its terms have degree at most 2.

w1 w2 w3 y

1 -1 1 -6-1 -1 1 61 1 -1 10

Examples

f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.

I 2-sparse means it has 2 terms

I degree 2 means its terms have degree at most 2.

w1 w2 w3 y

1 -1 1 -6-1 -1 1 61 1 -1 10

I However, f is not a linear combination of w1,w2,w3.

Examples

Expand the matrix!

w1 w2 w3 w1w2 w1w3 w2w3 y

1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10

Examples

Expand the matrix!

w1 w2 w3 w1w2 w1w3 w2w3 y

1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10

I the blue+yellow matrix is the Fourier basis

I Now f is a linear combination of the basis

Convert decision tree into sparse low degree polynomial

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this levelI It differs by at most ε

s · s = ε fraction by union bound.I So below assume T has depth at most log( s

ε )

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this level

I It differs by at most εs · s = ε fraction by union bound.

I So below assume T has depth at most log( sε )

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this levelI It differs by at most ε

s · s = ε fraction by union bound.

I So below assume T has depth at most log( sε )

Convert decision tree into sparse low degree polynomial

Assume decision tree T has s leaf nodes

Step 1 Truncating T at depth log( sε )

I There are 2log( sε ) = s

ε nodes on this levelI It differs by at most ε

s · s = ε fraction by union bound.I So below assume T has depth at most log( s

ε )

Convert decision tree into sparse low degree polynomial

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Convert decision tree into sparse low degree polynomial

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Convert decision tree into sparse low degree polynomial

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Convert decision tree into sparse low degree polynomial

Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )

I A tree with s leaf nodes can be represented by union of s“AND” terms.

I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.

I Every “AND” term has at most log( sε ) variables, so degree at

most log( sε )

Convert decision tree into sparse low degree polynomial

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

Convert decision tree into sparse low degree polynomial

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}

I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

Convert decision tree into sparse low degree polynomial

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

Convert decision tree into sparse low degree polynomial

Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2

ε anddegree log( sε ), s.t.,

E[(f − h)2] ≤ ε

I Let h include all terms in

Λ ,

{S ||fS | ≥

ε

L1(f )

}I h has terms at most

L1(f )ε

L1(f )

=L1(f )2

ε

I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑

S 6∈Λ

(fS)2 ≤ maxS 6∈Λ|fS | ·

∑S 6∈Λ

|fS | ≤ε

L1(f )· L1(f ) = ε

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?

I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]

I LMN algorithm [Linial et al., 1993]

How do we learn the polynomial?

Theorem

For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s

2

ε function h that 2ε-approximates T .

I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms

I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1

I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficient

I E[f 2110] = f 2

x1x2

KM algorithm

I Recursively prune less promising set of basis, explorepromising set of basis

I fα means the function of all the Fourier basis starting with αsummed together.

I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1

I f01 = fx2x3 · x2x3 + fx2 · x2

I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2

I All of these functions are well defined, also satisfy Parsevalidentity

I E[f 211] = f 2

x1x2x3+ f 2

x1x2

I At the last level, fα is equal to one coefficientI E[f 2

110] = f 2x1x2

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problems

I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problemsI Pretty slow, depends on ε−7 and n

I Sequential algorithm, cannot query in parallel

KM algorithm

I θ , ε

I Running time per iteration: O( 1ε6 )

I O(nL1(f )ε

)iterations.

I Two problemsI Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizable

I Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.

I does not have guarantees in the noisy setting.

LMN algorithm

I Take m uniform random samples for f

I For every S with degree ≤ log( sε ), we estimate fS using msamples

fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m

i=1 f (xi )χS(xi )

m

I By concentration, estimation is accurate

I Do it for all S , we get the function

I O( s2

ε2 · log n) sample complexity, parallelizableI Two problems

I does not work well in practice.I does not have guarantees in the noisy setting.

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)

I 1ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)

I 1ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I Parallelizable

I First “practical” algorithm under uniform samplingassumption!

I Previously criticized as useless setting

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!

I Previously criticized as useless setting

Our algorithm: Harmonica

LMN algorithm [Linial et al., 1993] requires

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε2 · log n)

I Not improved for more than two decades!

Harmonica

I Running time: O(nlog(s/ε))

I Sample complexity: O( s2

ε · log n)I 1

ε improvement

I Works for noisy setting

I ParallelizableI First “practical” algorithm under uniform sampling

assumption!I Previously criticized as useless setting

How do we learn the sparse low degree polynomial?

This problem contains a few key words:

I Noisy measurements

I Sparsity recovery

I Sample efficient

How do we learn the sparse low degree polynomial?

This problem contains a few key words:

I Noisy measurements

I Sparsity recovery

I Sample efficient

Compressed sensing!

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

What is compressed sensing?

I Query: measurement matrix A ∈ Rm×N .I m << N

I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.

I In general, infinitely many solutions, can’t find x

I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]

What is compressed sensing?

I Lasso algorithm:

minx∗{λ‖Ax∗ − y‖2

2 + ‖x∗‖1}

I Effect: linear regression with `1 regularization.

What is compressed sensing?

I Lasso algorithm:

minx∗{λ‖Ax∗ − y‖2

2 + ‖x∗‖1}

I Effect: linear regression with `1 regularization.

A general compressed sensing theorem

I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.

EX∼D [ψi (X ) · ψj(X )] =

{1 if i = j

0 otherwise.

A general compressed sensing theorem

I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.

EX∼D [ψi (X ) · ψj(X )] =

{1 if i = j

0 otherwise.

I Fourier basis {χS} is a random orthonormal family!

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

A general compressed sensing theorem

Theorem ([Rauhut, 2010])

Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.

‖x − x∗‖2 ≤ c‖e‖2√m

with probability 1− δ, as long as m ≥ O(s logN). c is a constant.

I In other words, if error term is bounded, x can be recovered

I If we could show that ‖e‖2√m≤√ε

c

I ‖x − x∗‖22 ≤ ε

I By Parseval identity, f is recovered with ε error!

Main Theorem

Theorem (Main theorem)

Consider a decision tree T with s leaf nodes and n variables.Under uniform sampling, Lasso learn T in time nO(log(s/ε)) andsample complexity O(s2 log n/ε) with high probability.

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Proof for the main theorem

I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis

I T = h + g , where

g =∑

S,|S |>d

fSχS +∑

S,|S |<d ,fS<O(ε)

fSχS

g has small value.

I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked

I g(zi ) are independent as well

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Proof for the main theorem

Theorem (Multidimensional Chebyshev inequality)

Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:

Pr(‖e‖2 >√‖V ‖2δ) ≤ m

δ2

I It suffices to show ‖(g(z1,··· ,zm))‖2√m

≤√ε

c

I E[g(zi )] = 0, since g does not contain constant

I Var[g(zi )] = ε/2, so√‖V ‖2 ≤

√ε/2.

I Set δ =√

2m, we get

Pr(‖e‖2 >√εm) ≤ 1

2

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .

I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision tree

I no overfitting!

I Lasso learns f by compressed sensing

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Go over the whole proof

I Object: learn a decision tree T of size s

I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.

I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!

I “top layers” of decision treeI no overfitting!

I Lasso learns f by compressed sensing

Heuristic: iterative selections

I A small decision tree is not accurate enough to give goodresult

I We can only identify ∼ 5 important monomials

Heuristic: iterative selections

I A small decision tree is not accurate enough to give goodresult

I We can only identify ∼ 5 important monomials

Solution: Multi-stage Lasso

Heuristic: iterative selections

I A small decision tree is not accurate enough to give goodresult

I We can only identify ∼ 5 important monomials

Solution: Multi-stage LassoI First, get ∼ 5 important monomials

I Fix them to maximize the sparse linear function

I Rerun lasso for the remaining variables!

Multi-stage Lasso: how does it work?

Need to stop here. Selecting moremonomials won’t approximate thefunction better.

Multi-stage Lasso: how does it work?Fixing 5 monomials, thensample more configurations,rerun Lasso. Select 5 moremonomials.

Multi-stage Lasso: why does it work?

We assume this subtree can be approximated bya sparse function. Different subtrees can beapproximated by different functions. Much moreexpressive than one-stage Lasso!

Our algorithm: Harmonica

Step 1 Query (say) 100 random samples for f

Our algorithm: Harmonica

Step 1 Query (say) 100 random samples for f

Step 2 Expand the samples to include low degree features

Our algorithm: Harmonica

Step 1 Query (say) 100 random samples for f

Step 2 Expand the samples to include low degree features

Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables

Our algorithm: Harmonica

Step 1 Query (say) 100 random samples for f

Step 2 Expand the samples to include low degree features

Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables

Step 4 Update f by fixing these important variables. Go to Step 1.

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).

2. Call Lasso on expanded feature vectors, which returns 5important variables

I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.

Harmonica: an example

Assume x ∈ {−1, 1}100, y ∈ R.

1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).

2. Call Lasso on expanded feature vectors, which returns 5important variables

I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.

3. Update f as f ′ = f(1,4,3,10,77),(1,−1,1,−1,−1).I For every x , fix its 1, 4, 3, 10, 77-th coordinate to be

(1,−1, 1,−1,−1), send to f .

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

7. Query 100 more random samples(x201, f

′′(x201)), · · · , (x300, f′′(x300)).

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

7. Query 100 more random samples(x201, f

′′(x201)), · · · , (x300, f′′(x300)).

8. · · ·

Harmonica: an example

4. Query 100 more random samples(x101, f

′(x101)), · · · , (x200, f′(x200)).

5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.

I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.

6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).

7. Query 100 more random samples(x201, f

′′(x201)), · · · , (x300, f′′(x300)).

8. · · ·9. Get f ′′′ and run hyperband/random search/spearmint on f ′′′.

Why does Harmonica work?

I Multi-stage sparse function approximation

I Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.

I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision tree

I Identify the important monomialsI Compressed sensing techniques

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision treeI Identify the important monomials

I Compressed sensing techniques

Why does Harmonica work?

I Multi-stage sparse function approximationI Very expressive

I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.

I Lasso could provably learn a decision treeI Identify the important monomialsI Compressed sensing techniques

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallel

I Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trial

I Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trial

I Features from small network work well

Experimental setting

I Cifar10 with residual network [He et al., 2016]

I 60 different hyperparameters, 39 real, 21 dummy

I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:

I Small network: 8-layer, 30 total epochs per trialI Small network is fast!

I Base algorithm is hyperband/random search for fine tuning onlarge network

I 56-layer, 160 total epochs per trialI Features from small network work well

60 Boolean variables for this task

I Weight initialization

I Optimization Method

I Learning rate

I Learning rate drop

I Momentum

I residual link weight

I Activation layer position

I Convolution bias

I Activation layer type

I Dropout

I Dropout rate

I Batch norm

I Batch norm tuning

I Resnet shortcut type

I Weight decay

I Batch size

I · · ·I and 21 dummy variables

7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5Final Test Error (%)

0

1

2

3

4

5

6

7

Diffe

rent

Alg

orith

m

Best Human RateHarmonica 1Harmonica 2

Harmonica+Random SearchRandom Search

HyperbandSpearmint

Harmonica 1 Harmonica 2 Harmonica+RndS Random Search Hyperband Spearmint0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

22.5

Tota

l Run

ning

Tim

e (G

PU D

ay)

10.1

3.6

8.3

20.017.3

8.5

Performance(Lefter is better)

Time(Shorter is better)

Harmonica

Optimization Time

0 100 200 300 400 500Number of Queries

10 1

100

101

102

103

104

105

Tota

l Opt

imiza

tion

Tim

e (s

)

Spearmint, n=60Harmonica, n=30Harmonica, n=60

Harmonica, n=100Harmonica, n=200

Selected features: matches our experience

Stage Feature Name Weights

1-1 Batch norm 8.05

1-2 Activation 3.47

1-3 Initial learning rate * Initial learning rate 3.12

1-4 Activation * Batch norm -2.55

1-5 Initial learning rate -2.34

2-1 Optimization method -4.22

2-2 Optimization method * Use momentum -3.02

2-3 Resblock first activation 2.80

2-4 Use momentum 2.19

2-5 Resblock 1st activation * Resblock 3rd activation 1.68

3-1 Weight decay parameter -0.49

3-2 Weight decay -0.26

3-3 Initial learning rate * Weight decay 0.23

3-4 Batch norm tuning 0.21

3-5 Weight decay * Weight decay parameter 0.20

Average test error drop

After fixing features in each stage, the average test error drops.

I We are in a better subtree

Average test error drop

Uniform Random After Stage 1 After Stage 2 After Stage 30

10

20

30

40

50

60

Aver

age

Test

Erro

r (\%

)60.16

33.324.33 21.3

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Harmonica: benefits

I Scalable in n

I Fast optimization time (running Lasso)

I Parallelizable

I Feature extraction

Conclusion

I Curse of dimensionality

I Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressive

I Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variables

I Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recovery

I The first improvement on sample complexity for decision treelearning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.

I The first “practical” decision tree learning algorithm withuniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.I The first “practical” decision tree learning algorithm with

uniform sampling.

I This is a pretty new area, pretty important problem.

Conclusion

I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.

I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.

I With lots of important variables fixed, can call some basealgorithm for fine-tuning.

I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree

learning over more than two decades.I The first “practical” decision tree learning algorithm with

uniform sampling.

I This is a pretty new area, pretty important problem.

The Last slide..

Thank you for coming to my mini-course!

Bergstra, J. and Bengio, Y. (2012).Random search for hyper-parameter optimization.J. Mach. Learn. Res., 13:281–305.

Candes, E. J., Romberg, J., and Tao, T. (2006).Robust uncertainty principles: Exact signal reconstruction fromhighly incomplete frequency information.IEEE Trans. Inf. Theor., 52(2):489–509.

Donoho, D. L. (2006).Compressed sensing.IEEE Trans. Inf. Theor., 52(4):1289–1306.

Fu, J., Luo, H., Feng, J., Low, K. H., and Chua, T. (2016).Drmad: Distilling reverse-mode automatic differentiation foroptimizing hyperparameters of deep neural networks.CoRR, abs/1601.00917.

Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q.,and Cunningham, J. P. (2014).Bayesian optimization with inequality constraints.

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 937–945.

He, K., Zhang, X., Ren, S., and Sun, J. (2016).Deep residual learning for image recognition.In CVPR, pages 770–778.

Jamieson, K. G. and Talwalkar, A. (2016).Non-stochastic best arm identification and hyperparameteroptimization.In Proceedings of the 19th International Conference onArtificial Intelligence and Statistics, AISTATS 2016, Cadiz,Spain, May 9-11, 2016, pages 240–248.

Kushilevitz, E. and Mansour, Y. (1991).Learning decision trees using the fourier spectrum.In Proceedings of the Twenty-third Annual ACM Symposiumon Theory of Computing, STOC ’91, pages 455–464.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., andTalwalkar, A. (2016).

Hyperband: A Novel Bandit-Based Approach toHyperparameter Optimization.ArXiv e-prints.

Linial, N., Mansour, Y., and Nisan, N. (1993).Constant depth circuits, fourier transform, and learnability.J. ACM, 40(3):607–620.

Luketina, J., Berglund, M., Greff, K., and Raiko, T. (2015).Scalable gradient-based tuning of continuous regularizationhyperparameters.CoRR, abs/1511.06727.

Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015).Gradient-based hyperparameter optimization through reversiblelearning.In Proceedings of the 32Nd International Conference onInternational Conference on Machine Learning - Volume 37,ICML’15, pages 2113–2122. JMLR.org.

Rauhut, H. (2010).Compressive sensing and structured random matrices.

Theoretical foundations and numerical methods for sparserecovery, 9:1–92.

Recht, B. (2016).Embracing the random.http://www.argmin.net/2016/06/23/hyperband/.

Snoek, J., Larochelle, H., and Adams, R. P. (2012).Practical bayesian optimization of machine learningalgorithms.In Advances in Neural Information Processing Systems 25:26th Annual Conference on Neural Information ProcessingSystems 2012. Proceedings of a meeting held December 3-6,2012, Lake Tahoe, Nevada, United States., pages 2960–2968.

Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P.(2014).Input warping for bayesian optimization of non-stationaryfunctions.

In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 1674–1682.

Swersky, K., Snoek, J., and Adams, R. P. (2013).Multi-task bayesian optimization.In Advances in Neural Information Processing Systems 26:27th Annual Conference on Neural Information ProcessingSystems 2013. Proceedings of a meeting held December 5-8,2013, Lake Tahoe, Nevada, United States., pages 2004–2012.

Wang, Z., Zoghi, M., Hutter, F., Matheson, D., andde Freitas, N. (2013).Bayesian optimization in high dimensions via randomembeddings.In IJCAI 2013, Proceedings of the 23rd International JointConference on Artificial Intelligence, Beijing, China, August3-9, 2013, pages 1778–1784.