Recap for Final - GitHub Pages€¦ · Sreyas Mohan (NYU) Recap for Final 8 May 2019 25 / 33....

33
Recap for Final Sreyas Mohan CDS at NYU 8 May 2019 Sreyas Mohan (NYU) Recap for Final 8 May 2019 1 / 33

Transcript of Recap for Final - GitHub Pages€¦ · Sreyas Mohan (NYU) Recap for Final 8 May 2019 25 / 33....

Recap for Final

Sreyas Mohan

CDS at NYU

8 May 2019

Sreyas Mohan (NYU) Recap for Final 8 May 2019 1 / 33

Learning Theory Framework

1 Input Space X , output space Y and Action Space A2 Prediction function f : X → A3 Loss Function ` : A× Y → R4 Risk R(f ) = E (`(f (x), y)

5 Bayes prediction function f ∗ = argminf R(f )

6 Bayes Risk R(f ∗)7 Empirical Risk,R̂(f ) - sample approximation for R(f ).

R̂n(f ) = 1n

∑ni=1(f (xi ), yi )

8 Empirical Risk Minimization f̂ = argminf R̂(f )

9 Constrained Empirical Risk Minization

Sreyas Mohan (NYU) Recap for Final 8 May 2019 2 / 33

Learning Theory Framework

f ∗ = arg minf

`(f (X ),Y )

fF = arg minf ∈F

`(f (X ),Y ))

f̂n = arg minf ∈F

1

n

n∑i=1

`(f (xi ), yi )

Approximation Error (of F) = R(fF )− R(f ∗)

Estimation error (of f̂n in F) = R(f̂n)− R(fF )

Sreyas Mohan (NYU) Recap for Final 8 May 2019 3 / 33

Approximation Error of Regression Stumps

Suppose X = U [−1, 1]. y |x ∼ N (x , σ2). We are minimizing square loss.Consider a hypothesis class H made of regression stumps. What is thef ∈ H that minimizes the approximation error of H?

Sreyas Mohan (NYU) Recap for Final 8 May 2019 4 / 33

Solution

Note that f ∗(x) = x and R(f ∗) = σ2.Also R(f ) = E ((f (x)− y)2) = E [(f (x)− E [y |x ])2] + E [(y − E [y |x ])2]We have to minimize E [(f (x)− E [y |x ])2] = E [(f (x)− x)2]The function we are looking for is

f (x) =

{−0.5 −1 ≤ x ≤ 0

+0.5 0 < x ≤ 1

Sreyas Mohan (NYU) Recap for Final 8 May 2019 5 / 33

Lasso and Ridge Regression

Linear Regression Closed form Expression (w = XTX )−1XT y

Ridge Regression Solution

Lasso Regression Solution Methods - co-ordinate descent.

Correlated Features for Lasso and Ridge

Sreyas Mohan (NYU) Recap for Final 8 May 2019 6 / 33

Optimization Methods

(True/False) Gradient Descent or SGD is only defined for convex lossfunctions.

(True/False) Negative Subgradient direction is a descent direction

(True/False) Negative Stochastic Gradient direction is a descentdirection

Sreyas Mohan (NYU) Recap for Final 8 May 2019 7 / 33

Optimization Methods

False

False, but takes you closed to minimizer

False

Sreyas Mohan (NYU) Recap for Final 8 May 2019 8 / 33

SVM, RBF Kernel and Neural Network

Suppose we have fit a kernelized SVM to some training data(x1, y1) , . . . , (xn, yn) ∈ ×{−1, 1}, and we end up with a score function ofthe form

f (x) =n∑

i=1

αik(x , xi ),

where k(x , x ′) = ϕ(x − x ′), for some function ϕ : R→ R. Writef : R→ R as a multilayer perceptron by specifying how to set m andai ,wi , bi for i = 1, . . . ,m, and the activation function σ : R→ R in thefollowing expression:

f (x) =m∑i=1

aiσ(wix + bi )

Sreyas Mohan (NYU) Recap for Final 8 May 2019 9 / 33

Solution

Simply take m = n. Then ai = αi , wi = 1, and bi = −xi . Finally takeσ = ϕ.

Sreyas Mohan (NYU) Recap for Final 8 May 2019 10 / 33

Question continued

Suppose you knew which all data points were support vectors. How wouldyou use this information to reduce the size of the network?

Sreyas Mohan (NYU) Recap for Final 8 May 2019 11 / 33

Solution

Remove all the nodes corresponding to ai = 0 or non-support vectors.

Sreyas Mohan (NYU) Recap for Final 8 May 2019 12 / 33

Maximum Likelihood

Suppose we have samples x1 ≤ x2 ≤ · · · ≤ xn drawn from U [a, b] FindMaximum Likelihood estimate of a and b.

Sreyas Mohan (NYU) Recap for Final 8 May 2019 13 / 33

ML Solution

The likelihood is:

L(a, b) = Πni=1

(1

b − a1[a,b](xi )

)Let x(1), . . . , x(n) be the order statistics.

The likelihood is greater than zero if and only a < x(1) and b > x(n).

When a < x(1) and b > x(n), the likelihood is a monotonicallydecreasing function of (b − a).

And the smallest (b − a) will be attained when b = x(n) and a = x(1).

Therefore, b = x(n) and a = x(1) give us the MLE.

Sreyas Mohan (NYU) Recap for Final 8 May 2019 14 / 33

Bayesian

Bayesian Bernoulli ModelSuppose we have a coin with unknown probability of heads θ ∈ (0, 1). Weflip the coin n times and get a sequence of coin flips with nh heads and nttails.Recall the following: A Beta(α, β) distribution, for shape parametersα, β > 0, is a distribution supported on the interval (0, 1) with PDF givenby

f (x ;α, β) ∝ xα−1(1− x)β−1.

The mean of a Beta(α, β) distribution is αα+β . The mode is α−1

α+β−2assuming α, β ≥ 1 and α+ β > 2. If α = β = 1, then every value in (0, 1)is a mode.

Sreyas Mohan (NYU) Recap for Final 8 May 2019 15 / 33

Question Continued

Which ONE of the following prior distributions on θ corresponds to astrong belief that the coin is approximately fair (i.e. has an equalprobability of heads and tails)?

1 Beta(50, 50)

2 Beta(0.1, 0.1)

3 Beta(1, 100)

Sreyas Mohan (NYU) Recap for Final 8 May 2019 16 / 33

Question Continued

1 Give an expression for the likelihood function LD(θ) for this sequenceof flips.

2 Suppose we have a Beta(α, β) prior on θ, for some α, β > 0. Derivethe posterior distribution on θ and, if it is a Beta distribution, give itsparameters.

3 If your posterior distribution on θ is Beta(3, 6), what is your MAPestimate of θ?

Sreyas Mohan (NYU) Recap for Final 8 May 2019 17 / 33

LD(θ) = θnh(1− θ)nt

p(θ|D) ∝ p(θ)L(θ)

∝ θα−1(1− θ)β−1θnh(1− θ)nt

∝ θnh+α−1(1− θ)nt+β−1

Thus the posterior distribution is Beta(α + nh, β + nt).

Based on information box above, the mode of the beta distribution isα−1

α+β−2 for α, β > 1. So the MAP estimate is 27 .

Sreyas Mohan (NYU) Recap for Final 8 May 2019 18 / 33

Boostrap

What is the probability of not picking one datapoint while creating abootstrap sample?

Suppose the dataset is fairly large. In an expected sense, whatfraction of our bootstrap sample will be unique?

Sreyas Mohan (NYU) Recap for Final 8 May 2019 19 / 33

Bootstrap Solutions

1 (1− 1n )n

2 As n→∞, (1− 1n )n → 1

e . So 1− 1e unique samples.

Sreyas Mohan (NYU) Recap for Final 8 May 2019 20 / 33

Random Forest and Boosting

Indicate whether each of the statements (about random forests andgradient boosting) is true or false.

1 True or False: If your gradient boosting model is overfitting, takingadditional steps is likely to help.

2 True or False: In gradient boosting, if you reduce your step size, youshould expect to need fewer rounds of boosting (i.e. fewer steps) toachieve the same training set loss.

3 True or False: Fitting a random forest model is extremely easy toparallelize.

4 True or False: Fitting a gradient boosting model is extremely easy toparallelize, for any base regression algorithm.

5 True or False: Suppose we apply gradient boosting with absolute lossto a regression problem. If we use linear ridge regression as our baseregression algorithm, the final prediction function from gradientboosting always will be an affine function of the input.

Sreyas Mohan (NYU) Recap for Final 8 May 2019 21 / 33

Solution

False, False, True, False, True

Sreyas Mohan (NYU) Recap for Final 8 May 2019 22 / 33

Hypothesis space of GBM and RF

Let HB represent a base hypothesis class of (small) regression trees. LetHR = {g |g =

∑Ti=1

1T fi , fi ∈ HB} represent the hypothesis space of

prediction functions in a random forest with T trees where each tree ispicked from HB . Let HG = {g |g =

∑Ti=1 νi fi , fi ∈ HB , ni ∈ R} represent

the hypothesis space of prediction functions in a gradient boosting with Ttrees.True or False:

1 If fi ∈ HR then αfi ∈ HR for all α ∈ R2 If fi ∈ HG then αfi ∈ HG for all α ∈ R3 If fi ∈ HG then fi ∈ HR

4 If fi ∈ HR then fi ∈ HG

Sreyas Mohan (NYU) Recap for Final 8 May 2019 23 / 33

Solutions

True, True, True, True

Sreyas Mohan (NYU) Recap for Final 8 May 2019 24 / 33

Neural Networks

1 True or False: Consider a hypothesis space H of prediction functionsf : Rd → R given by a multilayer perceptron (MLP) with 3 hiddenlayers, each consisting of m nodes, for which the activation function isσ(x) = cx , for some fixed c ∈ R. Then this hypothesis space isstrictly larger than the set of all affine functions mapping Rd to R.

2 True or False: Let g : [0, 1]d → R be any continuous function on thecompact set [0, 1]d . Then for any > 0, there exists m ∈ {1, 2, 3, . . .},a = (a1, . . . , am) ∈m,b = (b1, . . . , bm) ∈m, and

W =

- wT1 −

......

...− wT

m −

∈ Rm×d for which the function f : [0, 1]d → R

given by

f (x) =m∑i=1

ai max(0,wTi x + bi )

satisfies |f (x)− g(x)| < ε for all x ∈ [0, 1]d .

Sreyas Mohan (NYU) Recap for Final 8 May 2019 25 / 33

Solutions

False, True

Sreyas Mohan (NYU) Recap for Final 8 May 2019 26 / 33

Neural Networks

We have a neural network with one input neuron in the first layer, ahidden layer with n neurons and a output layer with 1 neuron. The hiddenlayer has ReLU activation functions. You are allowed to use bias. Explainif the following functions can be represented in this architecture and if yes,give a value for n and set of weights for which this is true.

1 f (x) = |x |

Sreyas Mohan (NYU) Recap for Final 8 May 2019 27 / 33

Backprop

Suppose f (x ; θ) and g(x ; θ) are differentiable with respect to θ. For anyexample (x , y), the log-likelihood function is J = L(θ; x , y). Give anexpression for the scalar value ∂J

∂θiin terms of the “local” partial

derivatives at each node. That is, you may assume you know the partialderivative of the output of any node with respect to each of its scalarinputs. For example, you may write your expression in terms of∂J∂α ,

∂J∂β ,

∂α∂sα

, ∂sα∂θ

(1)i

, etc. Note that, as discussed in lecture, θ(1) and θ(2)

represent identical copies of θ, and similarly for x (1) and x (2).

θ ∈ Rm sα = f(x(1), θ(1)) α = ψ(sα) y ∈ (0, 1)

x ∈ Rd sβ = g(x(2), θ(2)) β = ψ(sβ) J = L(α, β; y)

θ(1)

x(2)

θ(2)

x(1)

α

β

y

J

Sreyas Mohan (NYU) Recap for Final 8 May 2019 28 / 33

Backprop solution

∂J

∂θi=∂J

∂α

∂α

∂sα

∂sα

∂θ(1)i

+∂J

∂β

∂β

∂sβ

∂sβ

∂θ(2)i

Sreyas Mohan (NYU) Recap for Final 8 May 2019 29 / 33

KMeans

Sreyas Mohan (NYU) Recap for Final 8 May 2019 30 / 33

KMeans Solution

Sreyas Mohan (NYU) Recap for Final 8 May 2019 31 / 33

Mixture Models

Suppose we have a latent variable z ∈ {1, 2, 3} and an observed variablex ∈ (0,∞) generated as follows:

z ∼ Categorical(π1, π2, π3)

x | z ∼ Gamma(2, βz),

where (β1, β2, β3) ∈ (0,∞)3, and Gamma(2, β) is supported on (0,∞)and has density p(x) = β2xe−βx . Suppose we know thatβ1 = 1, β2 = 2, β3 = 4. Give an explicit expression for p(z = 1|x = 1) interms of the unknown parameters π1, π2, π3.

Sreyas Mohan (NYU) Recap for Final 8 May 2019 32 / 33

Solutions

p(z = 1|x = 1) ∝ p(z = 1|x = 1)p(z = 1) = π1e−1

p(z = 2|x = 1) ∝ p(z = 2|x = 1)p(z = 2) = π24e−2

p(z = 3|x = 1) ∝ p(z = 3|x = 1)p(z = 3) = π316e−4

p(z = 1|x = 1) =π1e−1

π1e−1 + π24e−2 + π316e−4

Sreyas Mohan (NYU) Recap for Final 8 May 2019 33 / 33