1 Machine Learning CS 189 5 - Brandon...

Contents

1 Machine LearningCS 189 5

1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Risk Minimization & Optimization Abstractions . . . . . . . . . . . . . . 11

1.5 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 Multivariate Gaussians and Random Vectors . . . . . . . . . . . . . . . . 15

1.7 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.8 LDA & QDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.9 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.10 Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.11 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.12 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.13 Neural Networks II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.14 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 30

1.15 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.15.1 Kernels - CS229 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.15.2 Kernels - Discussion 10 . . . . . . . . . . . . . . . . . . . . . . . . 38

1.16 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

1.17 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.17.1 ISLR - Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . 44

1.18 Decision Trees II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

1.19 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1.19.1 SVD - Independent Notes . . . . . . . . . . . . . . . . . . . . . . 51

1.20 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 53

1.20.1 PCA - Andrew Ng’s CS 229 Notes . . . . . . . . . . . . . . . . . 56

1

1.21 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2 Shewchuck Notes 62

2.1 Perceptron Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.1.1 Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . 63

2.1.2 Hyperplanes with Perceptron . . . . . . . . . . . . . . . . . . 64

2.1.3 Algorithm: Gradient Descent . . . . . . . . . . . . . . . . . . 64

2.1.4 Algorithm: Stochastic GD . . . . . . . . . . . . . . . . . . . . 65

2.1.5 Maximum Margin Classifiers . . . . . . . . . . . . . . . . . . 65

2.2 Soft-Margin SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.3 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.4 Gaussian Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 70

2.4.1 Quadratic Discriminant Analysis (QDA) . . . . . . . . . . . 70

2.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.6 Justifications & Bias-Variance (12) . . . . . . . . . . . . . . . . . . . . . 71

3 Elements of Statistical Learning 72

3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.1.1 Models and Least-Squares . . . . . . . . . . . . . . . . . . . . . . 73

3.1.2 Subset Selection (3.3) . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.1.3 Shrinkage Methods (3.4) . . . . . . . . . . . . . . . . . . . . . . . 77

3.2 Linear Methods for Classification (Ch. 4) . . . . . . . . . . . . . . . . . . 78

3.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.4 Trees and Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.4.1 Boosting Methods (10.1) . . . . . . . . . . . . . . . . . . . . . . . 80

3.5 Basis Expansions and Regularization (Ch 5) . . . . . . . . . . . . . . . . 81

3.6 Model Assessment and Selection (Ch. 7) . . . . . . . . . . . . . . . . . . 83

4 Concept Summaries 85

4.1 Probability Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

2

4.2 Commonly Used Matrices and Properties . . . . . . . . . . . . . . . . . . 90

4.3 Midterm Studying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4 Support Vector Machines (CS229) . . . . . . . . . . . . . . . . . . . . . . 92

4.5 Spring 2016 Midterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.6 Multivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.6.1 Misc. Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.7 Final Studying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7.2 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7.3 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7.4 Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7.5 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.7.6 Risk, Empirical Risk Minimization, and Related . . . . . . . . . . 100

4.7.7 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3

Machine Learning

CS 189

Contents

1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Risk Minimization & Optimization Abstractions . . . . . . . . . . . . . . 11

1.5 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 Multivariate Gaussians and Random Vectors . . . . . . . . . . . . . . . . 15

1.7 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.8 LDA & QDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.9 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.10 Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.11 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.12 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.13 Neural Networks II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.14 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 30

1.15 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.15.1 Kernels - CS229 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.15.2 Kernels - Discussion 10 . . . . . . . . . . . . . . . . . . . . . . . . 38

1.16 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

1.17 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.17.1 ISLR - Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . 44

5

1.18 Decision Trees II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

1.19 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1.19.1 SVD - Independent Notes . . . . . . . . . . . . . . . . . . . . . . 51

1.20 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 53

1.20.1 PCA - Andrew Ng’s CS 229 Notes . . . . . . . . . . . . . . . . . 56

1.21 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6

Machine Learning August 30

ClassificationTable of Contents Local Written by Brandon McKinzie

Goal: Want to prove that w is normal to decision boundary.

• Starting axiom: Any vector x along the decision boundary satisfies, by definition,

w · x+ β = 0 (1)

• Let x and x′ be two such vectors that lie on the decision boundary. Then thevector x′ − x points from x to x′ and is parallel to the decision boundary. If wreally is normal to the decision boundary line, then

w · (x′ − x) = 0

= w · x′ − w · x= (w · x′ + β)− (w · x+ β)

= 0 + 0

(2)

• Euclidean distance of x to decision boundary:

τ = −(w · x+ β)

||w||= −f(x)

||w||(3)

• The margin is can be found as the minimum over all training data τ :

M = mini∈1···n

|f(xi)|||w||

(4)

7

Machine Learning September 1

Gradient DescentTable of Contents Local Written by Brandon McKinzie

• Optimization: maximize goals/minimize cost, subject to constraints. However,can model a lot while ignoring constraints.

• Main optimization algorithm is stochastic gradient descent.

• The SVM1 is just another cost function. Want to minimize2

Cn∑

i=1

(1− yi(w · xi + β)

)++ ||w||2 (5)

with respect to the decision variables (w, β); Note that C is a hyperparameter.

1so-called because you could represent the decision boundary as a set of vectors pointing to thehyperplane.

2Notation: (z)+ = max(z, 0).

8


Stochastic Gradient DescentTable of Contents Local Written by Brandon McKinzie

• Review: minimize cost function f(w) over w. Take gradient; set to zero to solvefor w.

• If can’t solve analytically, then Gradient Descent:

wk+1 = wk − αk∇f(wk) (6)

• For convex f , can always find solution. Guaranteed global minimum.

• Cost functions of form: minimize∑

loss(wi(xi, yi)) + penalty (w).

• SVM example:

min C/n∑

(1− yiwTxi)+ + ||w||2

. Add squared norm because better margins and better classifications. Also,because algorithms converge faster. C is the regularization parameter. “Do Ifit the data, or make w simple?”. Doesn’t change optimal set, just changes the“Cost” (wat).

• Want algorithm constant in number of data points n3.

• Unbiased estimate of the gradient:

– Want vlaue of g to be gradient of cost function.

– Sample i uniformly at random from 1, . . . , n.– Then set g to gradient of loss at ith data point.

• SGD:

– initialize w0, k = 0.

– (Repeat) sample i at uniform. Do weight update on the loss for i term. Untilconverged.

– Follow the value of the gradient (rather than the true gradient) until converge.Following a noisy version. As long as variance is bounded, direction will bemore or less correct. For large number of data points n, will be pretty good.

3Regular GD is linear in n

9

• Numerical example:

– f(w) = 1/2n∑

(w − yi)2. Assumes x always 1.

– Solve ∇f(w) = 0 = 1/n∑

(w − yi) = 0.

– Optimal w = 1/n∑

yi. The empirical mean.

– Init w1 = 0. Set αk = 1/k. Where k is kth update reference.

– w2 = w1 − αk∇loss () = y1. Where loss the grad of f.

– w3 = w2 − α2(w2 − y2) = y1 − 12(y1 − y2) =

y1+y22

.

– w4 = · · · =. idk

– Lesson: order we passed through data didn’t matter. One pass over all datapoints leads to optimal w. Why advocate randomness then? He uses sumof trig function example to illustrate how SGD can struggle if done in order,but converge much quicker when randomly sampled.

• Illustrates ”region of confusion”. Coined by Bertsekas. Different convex func-tions along w. Rapid decrease in error early on iterations means we are far outsidethis region. Constant α means you’ll jiggle around later iterations. That is whyyou do diminishing α; helps in region of confusion.

• Most important rules of SGD: (buzzwords)

– shuffle! Can speed up by as much as 20x.

– diminishing stepsize (α learning rate decay). After n steps, set α = β · α.– momentum. wk+1 = wk−α∇li(wk) + βk(wk−wk−1). Momentum is in final

term. Typical value is 0.9.

• Notation: e(z) = (z < 1). Evaluates to 1 or 0.

10


Risk Minimization & Optimization AbstractionsTable of Contents Local Written by Brandon McKinzie

• Where do these optimization problems come from?

– General framework: minimizing an average loss + λ penalty.

– Loss: measures data fidelity.

– Penalty: Controls model complexity.

– Features/representation: How to write (xi, yi).

– Algorithms: ∇ cost(w) = 0.

– Risk: Integral of loss over probability(x, y).

– Empirical risk: Sample average. Converges to true Risk with more points;variance decreases.

• Begins discussion of splitting up data.

– Let some large portion be the Training set and the small remaining datapoints be the Validation set.

RT =1

nT

∑train

loss(w, (xi, yi)) (7)

RV =1

nV

∑val

loss(w, (xi, yi)) (8)

– By law of large numbers, can say the RV will go like 1nV

. Looking lots of

times at validation set becomes a problem with nV ≈ 104.

• Classification example:

– Hinge loss: (1− ywTx)+. Means you’re solving a SVM.

– Least-Squares: (1− ywTx)2. Bayes classifiers.

– In practice, hinge and LS perform basically the same.

– Logistic loss useful for MLE.

11

• Most important theorem in machine learning: Relates risk with empirical risk:

R[w] =R[w]−RT [w]

generalization err+

RT [w]

train err(9)

≈ RV [w]−RT [w] +RT [w] (10)

• One vs. all classifation MNIST example: Have a classifier for each digit that treatsas (their digit) vs. (everything else). Choose classifier with highest margin whenclassifying digit.

• Maximum Likelihood:

– Have p(x, y;w). Pick the w that makes data set have highest probability.

– Assumes data points come independently.

– Can get same result by minimizing the negative log avg.

• More than most things (like loss functions) are choosing the features. Conicsections because why not.

• N-grams. Bag of words.

– xi = number occurrences of term i. Count number of times each word appearsin some document.

– The two-gram is the lifted version. xij = number occurrences of terms i, jin same context. Count number of terms two words, e.g. appear in the samesentence, or next to each other. Like a quadratic model.

• Histograms

– xij = 1 if xi ∈ bin(j) else 0.

– e.g. histograms of image gradients in the notes.

• “If you have too many features, then you have to have a penalty.” - D.J. Khaled.i.e. if d > n, must use pen(w). Never need to have d > n, becaues of Kerneltrick.

12


Decision TheoryTable of Contents Local Written by Brandon McKinzie

• Decision Theory:

– Given feature distributions conditioned on class values (e.g. ±1).– Goes over Bayes rule like a pleb.

– Classification depends on loss function.

• Loss function can be asymmetric. e.g. would rather misclassify as cancer —normal rather than misclassify normal — cancer. So to minimize loss, may preferto be wrong more on some misclassification than another.

• Can minimize on probability of error.

min

[Pr(error) =

∫Pr(error|x)Pr(x)dx

](11)

The area (under) of overlap between conditionals (think hw problem w/Gaussians)is Pr(error). If K classes, similarly, classify as the maximum conditional, and erroris 1 - Pr()max.

• Modified rule:

min∑k

LkjP (ck|x) (12)

where Lkj is loss where true class is k but classify as j. “by integrating over x, wecan compute the loss.”

• Loss function in regression pretty clear (e.g. least squared loss). Classification isless so. Can use the “Doubt option”. Classifier humility 4. In some range ofinputs near the decision boundary, just say “idk”.

• Good classifiers have loss closer to Bayes risk, given certain choice of features.

• Three ways of building classifiers:

4Wanna see my posterior

13

– Generative: Model the class conditional distribution P (x|ck). Model priorsP (ck). Use bayes duh. Want to understand distributions under which datawere generated. Physicists can get away with this. They have “models”

– Discriminative: Model P (ck|x) directly.– Find decision boundaries.

• Posterior for Gaussian class-conditional densities :

– P (x|c1) and P (x|c2) ∼ N (µi, σ2).

– Univariate gaussian example...

– Posterior probabilities P (ci|x) turn out to be logistic σ functions.

14


Multivariate Gaussians and Random VectorsTable of Contents Local Written by Brandon McKinzie

⇒ Recht’s Decision notation: P (x|H0).

⇒ Want to discuss case of x being non-scalar. Random Vectors.

– Def: vector x with probability density p(x) : Rn → R.– Example density is the multivariate gaussian.

– Usually want to know Pr(x ∈ A) =∫Ap(x)dx1 . . . dxn, the prob that x lives in

set A.

– Properties of random vectors: mean and covariance.

⇒ Let f : Rn → Rm. Expected value of f:

E[f(x)] =∫ ∫

f(x)p(x)dx1 . . . dxn (13)

⇒ Covariance Matrix. A matrix. Λ = ΛT .

Λ = E[(x− µ)(x− µ)T ] = E[xxT ]− µxµTx (14)

var(x1) = E[(x1 − µx1)2] = Λ11 (15)

⇒ Let v ∈ Rn. Then var(vTx) = vTΛv ≥ 0⇒ Λx is positive semidefinite.

– Suppose A is some square matrix, λ is an eigval of A with corresponding eigvecx if Ax = λx. Larger eigenvalues tell about how much variance in a givendirection.

– Eigenvectors are, like, eigendirections, man.

– Spectral Theorem: If A = AT , then ∃S = v1, . . . , vn ∈ Rn such that vTi vj = 0, i 6= j, and Avi = λivi and λi ∈ R.

– Because matrix of eigenvectors has vectors linearly independent, invertible.

– A p.d ⇒ BTAB is p.s.d. ∀B.

– If A is p.s.d., then f(x) = xTAx is convex.

15

⇒ Multivariate Gaussian

p(x) =1

det(2πΛx)1/2exp

[−1

2(x− µx)

TΛ−1x (x− µx)

](16)

If µx ∈ Rn, Λx p.d. ⇒ p(x) is a density.

⇒ What happens if covariance is diagonal? Then the the vars are indepenedent.

– Λx diagonal,

– x Gaussian

– ⇒ x1, . . . , xn independent.

16


Maximum LikelihoodTable of Contents Local Written by Brandon McKinzie

• Estimation: hypothesis testing on the continuum.

• Maximum Likelihood: Pick the model so that p(data|model), the likelihoodfunction, is maximized.

• Treat model as random var. Then maximize p(model—data), ‘‘maximum aposteriori”. Assume uniform priors over all models. Flaw is assuming model is arandom variable.

• ML examples:

– Biased Coin. Flipping a coid.

P (X = x) =

(n

x

)px(1− p)n−x = P (x|p) (17)

See n = 10, 8 heads. Choose estimate p = 45. Binomial is not concave, when

you take the log it becomes concave.

• Gaussians.x1, . . . , xn ∼ N (µ, σ2) (18)

independent samples.

P (xi|µ, σ2) =∏

P (xi|µ, σ2) (19)

where each term in prod is standard Gaussian PDF. Next, take the log.

logP (xi|µ, σ2) =∑−xi − µ

2σ2− log σ − 1/2 log 2π (20)

Ideally, best estimates for mean and variance:

µ =1

n

∑i

xi (21)

σ2 =1

n

∑i

(xi − µ

)2(22)

• Multivariate Gaussian:

P (x|µ,Λ) = 1

det 2πΛ1/2exp(−1

2(x− µ)TΛ−1(x− µ) (23)

17


LDA & QDATable of Contents Local Written by Brandon McKinzie

• How do we build classifiers?

– ERM. Minimize

1

n

∑(loss) + λpen (24)

Equivalent to discriminative

– Generative models. Fit model to data, use that model to classify. For eachclass C, fit p(x|y = C). Estimate p(y = C). To minimize Pr(err), pick ythat maximizes P (y|x) via Bayes rule.

– Discriminative. Fit p(y|x). Function fitting. Fitting each data point. For costfunction, want to maximize

∏P (yi|xi) ≡ max 1/n

∑log p(yi|xi). Equivalent

to ERM.

• Generative example: What is a good model of x given y. Fit blobs of data giventheir labels.

– Letp(x|y = C) = N (µc,Λc) (25)

where µc = 1/nc

∑i∈IC Xi, and Λc = 1/nc

∑(xi − µc)(xi − µc)

T . Only haveone index i because it is a dxd sum of matrices.

– Decision rule:

argmaxc

−1/2(x− µc)T Λ−1(x− µc)− 1/2 log det Λc − log πc (26)

where last two terms are apparently constant. First term is a quadratic. Qc(x)denotes the whole thing. Decision boundary is set x : Qc=−1(x) = Qc=1(x).

– Need to make sure n >> d2 in order to avoid overfitting.

– called Quadratic Discriminant Analysis.

• Linear Discriminant Analysis. Assume Λc same for every class. They all havethe same covariance matrix.

18

– How to find Λ?

Λc =∑C

nc

n

1

n

∑i∈C

(xi − µc)(xi − µc)T (27)

=1

n

n∑i

(xi − µyi)(xi − µyi)T (28)

– Extremely similar to QDA, have (sort of; don’t rely on this)

argmaxc

−(x− µc)T Λ−1(x− µc)− 1/2 log det Λ− log πc (29)

– end up with linear boundaries.

• Did something called method of centroids

• Many people prefer LDA because optimization is hard.

19


RegressionTable of Contents Local Written by Brandon McKinzie

• Model p(y|x) ∼ N (wTx, σ2), where y = wTx + ε. Epsilon is noise causing datapoints to fluctuate about hyperplane. Assume noise is gaussian with zero mean,some variance. Variance of y is the variance of the noise ε.

• Unknown we want to estimate: w. Estimate by using maximum likelihood. Q:says this maximizes p(y|x). Figure out how this is same as maximizing p(x|y)...

• P (data|θ) = p(y1, . . . , yn|x1, . . . , xn, θ) =∏

p(yi|xi, θ).

• Use matrices so you can express as∑(yi − wTxi)

2 = ||y − Aw||2 (30)

where A is typically denoted as the designed matrix X.

• Take gradient of loss like usual...

∇WL = −ATy + ATAw (31)

where, if we differentiate again, yields the hessian H = ATA.

• Implicitly want y ≈ Aw here. Re-interpret A as a bunch of columns now (ratherthan a bunch of rows).

A =[a1 · · · ad

](32)

and so

||y − Aw||2 = ||y − (w1a1 + · · ·+ wdad)||2 (33)

• column space of A refers to this type of linear combination of columns ai.

• References 3.2 figure in ESL. Error is vertical component of y in figure. This “errorvector” is perpendicular to the subspace spanned by the x’s.

• y - Aw is perpendicular to each and every column of A. AT (y − Aw) = 0.

20

Machine Learning October 4

Bias-Variance TradeoffTable of Contents Local Written by Brandon McKinzie

• Fitting the data to model is called bias. Bias summarizes the fact that the modelis wrong, but we want to know how wrong.

• Variance is robustness to changes in data.

• Good model has both low bias and low variance.

• Example:

– Sample one point x ∼ N (µ, σ2Id).

– What is most likely estimate for µ? Just x (only have one point).

– Since, the value of x is (by definition) µ, we have that

E [µ− µ] = 0 (34)

– How about squared error

E[||µ− µ||2

]= E

[(µ− µ)(µ− µ)

](35)

= E[Tr(x− µ)(x− µ)T

](36)

= Tr(Λ) (37)

which uses the cyclic property of the trace: if dot product is scalar, thenit is equal to trace of outer product5.

– What is the trace of the covariance matrix Λ? Here (only) it is dσ2.

– What if I’m bored and I define µ = αx, where 0 < α < 1? Then

E [µ] = αµ (38)

E [µ− µ] = (α− 1)µ (39)

which isn’t zero (woaAHhhh!)

– Variance won’t go down.

E[||µ− µ||2

]= E

[||µ− E [µ] + E

[µ− µ||2

]](40)

5Oh, it is just the fact that Tr(AB) = Tr(BA). Moving on. . .

21

• Me making sense of Newton’s Method (as defined in this lecture):

– Slow for high dimensional probs; Better than gradient descent though.

– Gradient descent models func with first order taylor approx. Newton’s methoduses second order.

f(x) ≈ f(xk) +∇f(xk)T (x− xk) +

1

2(x− xk)

T∇2f(xk)(x− xk) (41)

where grad-squared is the Hessian.

– Actual derivation by Wikipedia :

Let f(α) = 0 (42)

f(α) = f(xn) + f ′(xn)(α− xn) +R1 (43)

where R1 =1

2f ′′(ξn)(α− xn)

2 (44)

f(xn)

f ′(xn)+ (α− xn) = −

f ′′(ξn)

2f ′(xn)(α− xn)

2 (45)

(46)

and all the xn represent the nth approximation of some root of f(x).

• Oh I get it now:

– Gradient descent: Find optimal w iteratively by assuming first-order taylorexpansion of ∇J(w∗):

∇J(w∗) ≈ ∇J(wk) (47)

(48)

where wk is the current best guess for the minimum of J . If this gradient iszero, we are done. If it is not, then we continue to iterate closer and closervia the update

wk+1 = wk − η∇J(wk) (49)

until our first-order approximation (appears) valid.

– Newton’s method goes a step further and expands to second order:

∇J(w∗) ≈ ∇J(wk) +∇J(wk)2(w∗ − wk) (50)

= ∇J(wk) +H(w∗ − wk) (51)

22

where6, implicit in all these optimization algorithms, is the hope that wk+1 ≈w∗, and so we can set this derivative to 0 to “solve” for wk+1 = w∗ as

(wk+1 − wk) = −H−1∇J(wk) (52)

wk+1 = wk −H−1∇J(wk) (53)

where equatoin 53 is Newton’s Update. It is computationally better tocompute e, where

He = −∇J(wk) −→ e = −H−1∇J(wk) (54)

6Remember that we are dealing with matrices now, so keep the order of H before (w −wk) even ifyou don’t like it.

23


RegularizationTable of Contents Local Written by Brandon McKinzie

• bias = E [f(x)− y]

• Risk = E [loss(f(x), y)]

• Variance = E [(f(x)− E [f(x)])2]

• Regularization: Minimize empirical loss + penalty term.

24


Neural NetworksTable of Contents Local Written by Brandon McKinzie

Basics/Terminology. Outputs can be computed for, say, a basic neuron to output2 as S2 =

∑i w2ixi. We can also feed this through activations functions g such as

the logistic or RELU. Why shouldn’t we connect linear layers to linear layers? Becausethat is equivalent to one linear layer. If we want to stack (multilayer) need some non-linearity. Want to find good weights so that output can perform classification/regression.

Learning and Training. Goal: Find w such that Oi is as close as possible to yi (thelabeled/desired output). Approach:

→ Define loss function L(w).→ Compute ∇wL.→ Update wnew ← wold − η∇wL.

and so training is all about computing the gradient. Amounts to computing partialderivatives like ∂L

∂wjk. Approach for training a 2-layer neural network:

→ Compute ∇wL for all weights from input to hidden, hidden output.→ Use SGD. Loss function no longer convex so can only find local minima.→ Naive gradient computation is quadratic in num. weights. Backpropagation is a

trick to compute it in linear time.

Computing gradients [for a two layer net]. The value of the ith output neuron canbe computed as

Oi = g

(∑j

Wij g(∑

k

Wjkxk

))(55)

where let’s focus on the weight W12. Simple idea:

• Consider some situation where we have value for output Oi as well as another valueO′

i which is the same as Oi except one of the weights is slightly changed:

Oi = g(. . . , wjk, . . . , x) (56)

O′i = g(. . . , wjk +∆wjk, . . . , x) (57)

25

• Then we can compute numerical approx to derivative for one of the weights:

L(O′i)− L(Oi)

∆wjk

(58)

a process typically called the forward pass7 This is O(n) if there are n weights inthe network. But since this is just the derivative for one of the weights, the totalcost over all weights is O(n2).

• This is why we need backprop: to lower complexity from O(n2) to O(n).

Backpropagation. Big picture: A lot of these computations [gradients] seem to beshared. Want to find some way of avoiding computing quantities more than once.

→ Idea: Want to compute some quantity δi at output layer for each of the i outputneurons. Then, find the δi−1 for the layer below, repeat until reach back to inputlayer [hence name backprop]. Key idea is the chain rule.

→ Notation:8

x(l)j = g

(∑i

w(l)ij x

(l−1)i

)≡ g

(S(l)j

)where now wij is from i to j. We will also denote e for ’error’.

→ Define partial derivative of error with respect to the linear combination input toneuron j as

δ(l)j ,

∂e

∂S(l)j

(59)

which carry the information we want about the partial derivatives along the way.→ Consider simple case of

x(l−1)i → w

(l)ij → x

(l)j

and we want to calculate

∂e

∂w(l)ij

=∂e

∂S(l)j

∂S(l)j

∂w(l)ij

(60)

= δ(l)j

∂S(l)j

∂w(l)ij

(61)

7Not sure why he says this. See pg 396 of ESL. Forward Pass: the current weights are fixed andthe predicted values fk(xi).

8Note: he really screws this up.

26

→ “Inductive step” for calculating δ with chain rule [for regression problem using

squared error loss of e = 12

(g(S

(l)i )− y

)2corresponds to a given example]:

δ(l)i =

∂e

∂S(l)i

(62)

=1

2

[2(g(S

(l)i )− y

)g′(S

(l)i )

](63)

Don’t confuse the above expression for sigmoid deriv. It is not assuming anythingabout the functional form of g.

27


Neural Networks IITable of Contents Local Written by Brandon McKinzie

Backprop Review. Cross-entropy loss derivation uses the following. Note that we aredefining all yi = 0 or yi = 1.

Oyii (1−Oi)

1−yi → yi lnOi + (1− yi) ln(1−Oi) (64)

where the expression on the RHS is the log (likelihood) of the LHS. We want to takepartial derivatives of loss function with respect to weights. The δ terms represent layer-specific derivatives of error with respect to the S values (the summed input). See previouslecture note for more details on this. Note that, in order to get the values of the errorin the first place, need to first perform the forward pass.

Clarifying the notation. In the last lecture, we barely scratched the surface of actuallycalculating δ

(l−1)i , the partial of the error with respect to S

(l−1)i . Recall that the subscript

on S(l−1)i means the weighted sum into the ith neuron at layer. Specifically

S(l−1)j =

∑i

w(l)ij x

(l−2)i → x

(l−1)j (65)

Calculating the δ terms. Setup: Only consider the following portion of the network:A summation value S

(l−1)i is fed into a single neuron x

(l−1)i at the l− 1 layer. From this

neuron, g(S(l−1)i ) is fed to the neurons at the layer above (l) by connection weights w.

We calculate the partial derivative of the error corresponding to these particular weightswith respect to the summation fed to x

(l−1)i as

δ(l−1)i =

∂err(w)

∂S(l−1)i

(66)

=∑j

∂err(w)

∂S(l)j

∂S(l)j

∂x(l−1)i

∂x(l−1)i

∂S(l−1)i

(67)

=∑j

δ(l)j w

(l)ij g

′(S(l−1)i ) (68)

28

where we’ve already calculated all δ value in the layers above (i.e. we are somewherealong the backward pass).

[Inspiration for] Convolutional Neural Networks. [1 hr into lec]. Reviews biologyof brain/neuron/eye. Rods and cones are the eye’s pixels. Think of as 2D sheet ofinputs. Such sheets can be thought of as 1D layers. Bipolar cell gets direct input(center input) from two photo-receptors, and gets indirect input (surround input) fromhorizontal cell. “Disc where you’re getting indirect input from horizontal cell.” Weightsbetween neurons can be positive (excitatory) or negative (inhibitory). Assume centerinput is excitatory, surround input in inhibitory.

• Very small spot of light means neuron fires, as you increase size of spot, inhibitionfrom surround cells kick in, and its output is diminished. Uses example of ON/OFFcells in retinal ganglia. Neurons can only individually communicate positive values,but multiple neurons can “encode” negative values.

• Receptive Fields. The receptive field of a receptor is simply the area of thevisual field from which light strikes that receptor. For any other cell in the visualsystem, the receptive field is determined by which receptors connect to the cell inquestion.

• Relation to Convolution. Consider convolving an image with a filter.

Each output unit gets the weighted sum of image pixels. The [-1, 0, 1] is a“weighting mask.”

29


Convolutional Neural NetworksTable of Contents Local Written by Brandon McKinzie

[started at 11:09]

Review. Neurons in input layer arranged in rectangular array, as well as the neurons innext layer. There is local connectivity of neurons between layers (not fully connected).

Also recall that the receptive field of retinal ganglion cells can be modeled as a Differ-ence of Gaussians:

Gσ(x, y) =1

2πσ2e−

r2

2σ2 (69)

Convolution. The 3x3 convolution kernel9 in example below is like the wij. It doesweighted combinations of input squares to map to the output squares. You keep thesame wij kernel, which is related to shift invariance. Relation to brain: think of thedifference of Gaussians in retinal ganglion as the mask (the wij).

• This makes the training problem easier, since it greatly reduces the number ofparameters (the wij).

• Now, we increase complexity by increasing depth (more layers) rather than increas-ing the different parameters like fully-connected layers do.

• Long discussion of biological relevance. skip to 50:00.

9Interchangeable terminology: conv kernel/filter/mask

30

Convolutional Neural Networks. Consider Input-¿middle-¿output rectangular lay-ers. Neurons A and B in middle layer “like” specific orientations of the inputs. Haveneuron C that does max(A,B). [General case: Have some group of neurons in middlelayer that use the same mask (by definition) and some neuron in the next layer thattakes a max of these neurons in its local neighborhood.]

• This has a slight shift-invariance, accomplished by the max operation. [56:00]This is called max pooling.

• Max pooling allows output layer to have less neurons than previous layers. Thisis related to subsampling.

• Note that when we stack layers, make sure they introduce nonlinearities.

Introduce Yann Lecun (1989). Here we go over his architecture.

• 32x32 grid of pixels (digits).

• Convolution with 5x5 filters (masks) → 28x28 grid. Six such masks. Each masksgive 52 parameters + 1 for bias.

• Stopped at [1 hour].

31

Machine Learning November 1

Kernel MethodsTable of Contents Local Written by Brandon McKinzie

Motivation. Previously, we focused on mapping inputs X to Y via some function f.Neural nets learn features at the same time as learning the linear map. Now we aregoing to enter the portion of the course on

→ Kernels→ Nearest Neighbors→ Decision Trees

which have a different feel than standard linear maps. Today, we focus on kernels.

Review of the Optimization Problem. Our usual problem goes as follows: Wehave a data matrix X that is n by d. We want to solve the following (ridge regression)optimization problem:

minw||Xw − y||2 + λ||w||2 (70)

whose solution can be written in one of the two equivalent forms:

w =(XTX + λId

)−1XTy (71)

= XT(XXT + λIn

)−1y (72)

where the key idea to notice is that 71 involved optimizing matrices like XTX ∈ Rd,d,while 72 involves optimizing matrices like XXT ∈ Rn,n.

If n >> d, then eq 71 is preferred. If d >> n, then eq 72 is preferred.

32

Quick Derivations

Comparing the optimization for w∗ compared with w∗ = XTα∗. This shows we canarrive at the two equivalent solutions 71 72.

w∗ =(XTX + λId

)−1XTy ↔ min

w

1

2||Xw − y||2 + λ

2||w||2 (73)

w∗ = XTα∗ = XT(XXT + λIn

)−1y ↔ min

α

1

2||XXTα− y||2 + λ

2||XTα||2 (74)

Computing α∗ consists of computing XXT (O(n2d)) and inverting the resulting n × nmatrix ((O(n3))). Computing w∗ consists of computing XTX (O(nd2)) and invertingthe resulting d× d matrix ((O(d3))). So you might want to find α∗ when n << d.

More detailed look at w∗ = XTα∗:

w∗ =n∑

i=1

xi

[(XXT + µIn

)−1y]i

(75)

=n∑

i=1

xi

d∑i=1

(ujΛ

−1uTj

)yj (76)

,n∑

i=1

αixi (77)

where the uj are eigenvectors of XXT + µIn and each yj is a scalar in this problem.

General optimization problem. Beginning with the newest form for w∗ above butnow for generality we allow

w =n∑

i=1

αixi + v where vTxi = 0 ∀i

and we end up finding that minimizing over α yields v = 0 (we did this in homework).The corresponding generalized optimization problem is

minw

n∑i=1

loss(wTxi, yi) + λ||w||2 (78)

minw

n∑i=1

loss(n∑

j=1

αjxTj xi, yi) + λ

n∑j=1

n∑k=1

αjαkxTj xk (79)

which we see is now over just the n data points instead of the d dimensions.

33

• This changed our problem from d dimensions to n.• Notice how this new form only depends on dot products of features and notthe features [alone] themselves.

Big Idea: Kernels

New interpretation: want to find coefficients α such that dissimilar inner productsxTj xi become small while similar inner products become large. Solution: the n by n

Kernel matrixa [24:00] Kij ≡ xTi xj. With this new definition, our optimization

problem can be rewritten as

minα

n∑i=1

loss(Kαi, yi) + λαTKα (80)

aalso known as Gram matrix

Kernel Trick. This new process of rewriting optimization in above form is an exampleof the Kernel trick, where we realize that instead of working with the features, we canreplace all dot products with the kernel matrix. Also note that “everything has beenreduced to dot products” since our new hypothesis function is just

f(x) = wTx =n∑

i=1

αixTi x (81)

Kernel functions. Now, imagine we have a function k(x, z) that evaluates inner prod-ucts. [We’ve been doing]Explicit lift: Map x to higher dimensional space and thenperform kernel trick to map it back down to smaller space. Now, we want a way toeliminate the middle step. Consider a kernel function on scalar points x, z that appearsto lift them into a new d = 3 space

k(x, z) = (1 + xz)2 x, z ∈ R (82)

=

1√2xx2

T 1√2zz2

(83)

and we see that this is actually equivalent to the dot product in the lifted space. KEYPOINT: We’ve actually skipped the process of lifting entirely!

34

Rather than explicitly (1) lifting the points into the higher dimensional space,then (2) using the kernel trick of computing the dot product to bring backinto scalar space, we can just compute k(x, z) directly in one swoop.

Examples. Let x ∈ Rd be number of quadratic monomials of O(d2). The quadratickernel and linear kernel respectively:

k(x, z) =(1 + xT z

)2(84)

k(x, z) = xT z (85)

Radial Basis Functions. For the Gaussian kernel, our hypothesis function takesthe form of a RBF:

f(x) =n∑

i=1

αik(xi, x) (86)

=n∑

i=1

αi exp[−γ||xi − x||2

](87)

where points closest to x will correspond with largest exponentials and thus have theirαi contributing more to overall sum. Remember how quickly exponentials fall off withdistance. All we trust is regions nearby.

• Large γ: approaches delta functions. Kernel matrix basically looks like the identity.• Note that k(x, x) = 1 for ALL γ.• Small γ: Then exp(γxT z) ≈ 1 + γxT z approaches linear regression.• How to pick γ? Cross-validation.• If k1, k2 are kernels, then α1k1 + α2k2 is also a kernel, where α1, α2 > 0.

Note the feature space of the kernel has an infinite number of dimensions:

k(x, z) = exp(−γ||x− z||2) (88)

= exp(2γxT z) exp(−γ||x||2) exp(−γ||z||2) (89)

=

( ∞∑k=1

(2γxT z)k

k!

)exp(−γ||x||2) exp(−γ||z||2) (90)

35

Kernels - CS229

Recap. For convenience, I’ve rewritten our general optimization problem below, fullyexpanded out.

minw

n∑i=1

loss(n∑

j=1

αjxTj xi, yi) + λ

n∑j=1

n∑k=1

αjαkxTj xk (91)

It is crucial to notice that our algorithm can now be written entirely in terms of theinner products of training data. Since this is the general case, we’ve made no as-sumptions about where the xi came from, therefore, we are free to substituteφ(xi) in for all the xi. This is permissible since φ(x) is nothing more than a featuremap to different dimensional space, e.g.

φ(x) =[x x2 x3

]T(92)

Given some feature mapping φ, define the corresponding Kernel as

K(x, z) = φ(x)Tφ(z) (93)

which will now replace any instance of the inner product 〈x, z〉 > in our algorithm.

Often, K(x, z) may be very inexpensive to calculate, even though (x) itselfmay be very expensive to calculate (perhaps because it is an extremely highdi- mensional vector).

Example 1. Let our kernel be

K(x, z) = (xT z)2 (94)

=n∑

i,j=1

(xixj)(zizj) (95)

which allows us to see that K(x, z) = φ(x)Tφ(z) where φ(x) is a vector consisting of allpairwise products xixj, 1 ≤ i ≤ n, 1 ≤ j ≤ n. Important: Notice that finding K(x, z)takes O(n) time, as opposed to φ(x) taking O(n2) time.

36

The Kernel matrix. Suppose K is a valid kernel corresponding to some mapping φ.Consider some finite set of m points10 x(1), . . . x(m) and define a square, m by m matrixK to be defined such that

Kij , K(x(i), x(j)

)= φ(x(i))Tφ(x(j)) = φ(x(j))Tφ(x(i)) = Kji (96)

This matrix is called the Kernel matrix, symmetric by definition11.

Theorem: (Mercer) Let K : Rn × Rn → R be given. Then for K to be a valid kernel,it is necessary and sufficient that for any x(1), . . . x(m), (m < ∞), the correspondingkernel matrix is symmetric and positive semi-definite.

10not necessarily the training set11It is also p.s.d. For any vector z, we have

zTKz =∑i

∑j

ziKijzj (97)

=∑i

∑j

∑k

ziφk(x(i))φk(x

(j))zj (98)

=∑k

(∑i

ziφk(x(i))

)2

(99)

≥ 0 (100)

37

Kernels - Discussion 10

Transforming the Risk Function. Begin with risk function J we want to minimizeover w:

J(w) = (φ(X)w − y)2 + λ|w|2 (101)

In the familiar case where φ(X) = X, we found that we could represent the optimal w∗as∑

i αixi = XTα. More generally, we can let w∗ = φ(X)Tα. Substituting this into therisk function yields

J(α) =(φ(X)φ(X)Tα− y

)2+ λ|φ(X)Tα|2 (102)

= (Kα− y)2 + λαTKα (103)

which is now a minimization over α, where12 K = φ(X)φ(X)T . The closed form solution,along with how to use it for classifying some test set Xtest is:

α∗ = (K + λIn)−1y (104)

y = φ(Xtest)w = φ(Xtest)φ(Xtrain)Tα∗ = Kα∗ (105)

Kernel Trick. For many feature maps φ, can find equivalent expression that avoidshaving to actually compute the feature map (φ). For example, we can write the quadratic

feature map φ(x) =[x21 x2

2

√2x1x2

√2x1

√2x2 1

]Tin the form

K = φ(X)φ(X)T = (1 +XXT )2 (106)

12Note that this is consistent with Kij = k(x(i), x(j)) = φ(x(i))Tφ(x(j)).

38


Nearest NeighborsTable of Contents Local Written by Brandon McKinzie

[started at 2:17]

Key idea: Just store all training examples 〈xi, f(xi)〉.

• (Single) NN: Given query xq, locate nearest training example xn, then estimate

f(xq)← f(xn).• KNN: Given xq, vote among its k nearest neighbors (if discrete-valued targetfunction).

Nearest-Neighbor Rule. For non-parametric models, the number of parameters growswith the number of training examples, as opposed to parametric models, which is inde-pendent of training set size.

For KNN, the number of parameters is Nk.

which is still considered non-parametric because it grows with N . Sidenote: SVM isparametric, while kernel SVM is non-parametric.

Classification with NN. NN classifier allows you to be much more creative aboutdecision boundary (as opposed to, say, a linear classifier). If N = 2 decision boundary isa line. For higher N , decision boundary described by a Voronoi diagram. The union ofall the Voronoi cells associated with class A gives all locations that would be classifiedas A.

39

Analysis/Performance. Let ε ∗ (x) denote the error of optimal prediction, and letεNN(x) denote the error of the (single) nearest-neighbor classifier. Then

limn→∞

εNN ≤ 2ε∗ (107)

limn→∞,k→∞,k/n→0

εkNN = ε∗ (108)

which means it can be rather hard to outperform kNN when we have extremely largedatasets.

• Advantages: No training needed. Learn complex functions easily (no assump-tions on shape of target function). Don’t lose information, i.e. has a nice way ofadjusting to the data; not limited/restricted by the number of parameters.

→ Categories are overrated, because it reduces our ability to adjust to newdata. kNN can, in a sense, create new categories on the fly when needed.

• Disadvantages: Slow at query time. Lots of storage. Easily fooled by irrelevantattributes (related to curse of dimensionality).

Dimensionality. Remedies: (1) get more data, increase N . (2) decrease the size ofspace d (“pushing” points together). Bag-of-words models, which have the effect ofmarginalizing histograms, can be useful in greatly reducing dimensionality.

Learning better features. ML in general can basically be thought of as (1) learn afeature, and then (2) apply a kNN or linear classifier. For example, convolutional neuralnetworks learn the important features of an image by embedding the pixels in a lowerdimensional space where they can essentially classify via kNN.

40


Decision TreesTable of Contents Local Written by Brandon McKinzie

Terminology. Inputs x need not be real-valued. Two types of discrete variables ofinterest are

→ Categorical/Nominal: have 2 or more categories but there is no intrinsic order.Examples: hair color, zip code, gender, types of housing.

→ Ordinal: discrete but there is a natural order. Example: education (HS-college-gradschool). One can often convert these to real numbers (e.g. years of education).

Entropy and Information.

• Consider some random variable like the result of a biased coin toss. Let P (H) = p.If p = 1, there is no information gain when we flip the coin and observe heads.Interested in maximizing the information gain.

• Now consider general case of n-sided coin. When does knowing about its outcomegive us the most information? A: when p = 1/n. Introduce notion of Informa-tion/Surprise:

I(p) , − logb(p) (109)

where p is probability of the event happening, and b is the branching factor (wealmost always have b=2). Knowing a rare event has occurred is more informative.

• Entropy: the value of information:

Entropy = −n∑

i=1

pi logb(pi) (110)

which is maximized when all n values are equally likely. One way of maximizingentropy is through the use of Lagrange multipliers.

41

Training a Decision Tree. Assume we have data x1, . . . , xn and desired outputy. where each xi is d-dimensional. Ideally, leaves should be pure cases: meaning alloutcomes at leaf [from training set] belong to single class. This approach can be usedfor both classification and regression.

• Training time: Construct the tree by picking the “questions” at each node ofthe tree. This is done to decrease entropy with each level of the tree. A single leafnode is then associated with a set of training examples.13

• At test time, given a new example that we drop in at root, and it ends up at oneof the leaves. What y should we then predict? Predict the majority vote from setof values given at the end node. Below, the end nodes only give one value so wewould just use that value as our prediction.

Root

f1(x) > Thresh?

f2(x) > Thresh?

y = 0

Answer = NO

y = 1

Answe

r =YE

SAnswer =NO

. . .

Answer

= YES

If we make the decision tree deep enough, we can guarantee zero training error (bad;overfitting). Can reduce training error by pruning or averaging trees together [50:00].Suppose at test time you land at a leaf containing y = 0, y = 1, meaning “the posteriorprobability that y = 1 is 50 percent. Decision trees give way of computing posteriorprobabilities from empirical distribution at leaf.

Random Forests. Train multiple trees for solving the same problem. Each tree willgive a value for posterior probabilities. Averaging the posterior probabilities over manytrees is the basic idea of a random forest (ensembling decision trees)14.

• Boosting: rather than just random averaging, assign weights to each of the treesin a certain way. Disadvantage: if the labels have noise, then we may end up withmore weights on wrong examples. This is opposed to the case of random averaging,which is very robust to noise.

13Similar to KNN where distance function is the answer of these questions.14Factoid: ensembling trees gives better performance gain than ensembling neural nets.

42

Creating the Trees. At the leaves, we want low entropy. Entropies are computedfrom empirical frequencies, not model probabilities. Example: 256 countries that areall equally likely. Then highest possible entropy is 8, which is at the root of the tree.Creation procedure:

1. Begin at root. Enropy H. Say our questions are T/F, meaning binary branch-ing15.

Root

Entropy = H

Question = Q

Size = n Entropy = H+

Size = n1

Answer =+

Entropy = H−

Size = n2

Answe

r =-

2. Calculate next layer entropy. Our goal is to reduce the entropy as we go downthe tree. Take the average over all nodes in the layer. Here, the entropy of thesecond layer will be

n1H+ + n2H−

n1 + n2

where our diagram above shows the 1st (root) and second layer.

3. Calculate Information gain. The information gain, for asking this question isthe difference in entropy:

I1→2 := H − n1H+ + n2H−

n1 + n2

(111)

15We denote yes as +, and no as −. The H values are the entropy at the given node.

43

ISLR - Tree-Based Methods

For classification trees, we predict that each observation belongs to the most commonlyoccurring class of training observations in the region to which it belongs. Sometimes wealso want to know the class proportions among the training observation that fall into agiven region. To grow the tree, use recursive binary splitting. The two most commoncriteria used for making the binary splits are the Gini index and/or the cross-entropydefined, respectively as

G =K∑k=1

pmk(1− pmk) (112)

D = −K∑k=1

pmk log pmk (113)

where subscripts correspond to mth region, kth class. These are used to evaluate thequality of a particular split when building a classification tree.

44


Decision Trees IITable of Contents Local Written by Brandon McKinzie

Recap.

• Advantages of decision trees: can use a mix of feature types, the output is veryexplainable.• Basic Tree Construction: While building the tree, at each level, choose thequestion such that the entropy over the children is minimized16. This is a greedyapproach, meaning that we pick the [single] question that maximizes IG for ournext step only.• Generating different trees: With previous procedure, we would always get thesame tree. One possible remedy: Choose a random subset of features at each levelto optimize question-asking [randomization].→ Example: Randomization. Suppose our randomly chosen feature [indices] are

1, 4, 6. Then, we repeat the following for each index i ∈ 1, 4, 6: Choose threshold

such that question, xi > thresh, maximizes information gain for this feature. After

we have our 3 best questions, one for each of these features, we then pick the question

out of these that maximizes the IG.→ Example: Nominal Feature. Say feature is hair color can be black, brown,

red. Possible questions, where h denotes some sample’s hair color:

h ∈ black,brown? h ∈ red,black? h ∈ brown

Digit Recognition. Suppose we have three trees, denoted as A, B, C, that each giveus their prediction, say PA(i), 0 ≤ i ≤ 9, for tree A, for some test image of a digit. Ourprediction for some sample point is just the average

P (i) :=1

3

∑t∈A,B,C

Pt(i)

16Another way of saying this is “we want to maximize the purity of the children”

45

Boosting::Motivation. Assign weights to each of our trees based on how much we“trust” their classification abilities17 How do we decide which classifiers are “experts”?Based on their test performance. We also want classifiers to be experts on specificexamples as well.

Boosting::Terminology.

→ Weak Learner: classifier that is correct at least 50% of the time. T→ Error rate: denoted εt.→ Weighting factor: for tree t is denoted

αt :=1

2ln(1− εt

εt

)which is computed during the training process of classifier t (online).

AdaBoost Algorithm. an approach to machine learning based on the idea of creatinga highly accurate prediction rule by combining many relatively weak and inaccurate rules[Shapire]. Procedure: Assume we’ve been given a labeled training set of (xi, yi), 1 ≤ i ≤m. First, initialize D1(i) = 1/m ∀i. Then, for each of T total classifiers t = 1, . . . , T ,do:

• Train classifier t using distribution Dt.• This will yield its corresponding hypothesis function ht. The aim is to select ht

corresponding lowest weighted error εt := Pri∼Dt(ht(xi) 6= yi)• Compute αt from formula 1.18 above.• Update Dt+1(i) for each sample i as a lookup table defined as

Dt+1(i) =

Dt(i)

e−αt if ht(xi) = yi

eαt if ht(xi) 6= yi

Zt

(114)

where Zi is just a simple normalization factor.

and, upon completion, output final hypothesis as The combination rule over classifiers

h(x) :=T∑t=1

αtht(x) (115)

17Analogous to asking a bunch of people questions but weighing the opinions of experts higher.

46

Misc. Below is how Jitendra described this in lecture, which more informal than above.[52:00]

• What we instruct the classifiers to do. Tell 2nd classifier to work on the hardproblems that the first classifier screwed up on. Every new classifier in sequenceworks harder and harder, meaning they’re looking at samples previous classifiergot wring. i.e. making experts.

• Have probability/weighting distribution over samples. Initially, its equal weightover all training data. But in round 2, give the distribution more weight to exam-ples that prev classifier got wrong. This probability distribution over examples isdenoted Dt(i), where examples indexed by i, where∑

i

Dt(i) = 1

• Big drawback of boosting comes from label noise. Simple averaging, however, doesnot have the same issue. Boosting could repeatedly weigh the bad labels, makingit worse and worse.

47


Unsupervised LearningTable of Contents Local Written by Brandon McKinzie

Basic Idea. Some set of data, and we want to find a mapping into “no idea.” Notall data is useful, and we explore two techniques for dealing with this: dimensionalityreduction and clustering.

• Dimensionality Reduction. Can we compress the dimension of our dataset?Reducing d has the effect of reducing runtime, storage, generalization, and inter-pretability. Note that we can choose to reduce n, d, or both.• Clustering. Reducing n has effect of reducing runtime and “understanding archetypes”18,segmentation, and outlier removal. Segmentation: figuring out what features arerelevant/not relevant in terms of the examples.

Most unsupervised learning methods appeal to SVD, or at least to matrix factoriza-tion.

Xd,n = Ad,rBr,n (116)

where we might imagine A as tall and thin, and B as short and wide. This is relevant,e.g., in the case where we want to see if our data matrix X can be written as a productusing only a subset r of the n examples.

18a.k.a. understanding what types of examples are representative.

48

Singular Value Decomposition. So, how do we factorize matrices? By using Sin-gular Value Decomposition (SVD): Every matrix X ∈ Rd,n, where n ≥ d, admits afactorization

Xd,n = Ud,dSd,dVTn,d where (117)

UTU = Id, V TV = Id, (118)

S = diag(σi), σ1 ≥ σ2 ≥ . . . ≥ σd ≥ d (119)

and the σi are the singular values, and U, V are the singular vectors.

• Interpretation. An equivalent way of writing this, as well as multiplication by avector z:

X =d∑

i=1

σiuivTi and Xz =

d∑i=1

σiui(vTi z) (120)

which means we (1) find the amount that z lies in the direction of vi, (2) we scaleit by σi, and (3) multiply the output by ui.

• Singular values vs. Eigenvalues. [35:00] Notice that vi is not an eigenvectorof X. However, vi is an eigenvector of XTX with eigenvalue σ2

i .

Xvi = σiui (121)

XTXvi = σiXTui = σ2

i vi (122)

XXTui = σ2i ui (123)

49

Computation and Examples. In general, we use the following relation.

XXT = USV TV SUT = US2UT (124)

XTX = V S2V T (125)

→ Symmetric PSD matrices. Consider some symmetric, psd matrix A, which meansit’s eigenvalues are non-negative. This means can write eigenvalue decompositionA = WΛW T , with WW T = I and Λ is diagonal as λi. In this case, the eigenvaluedecomposition is the same as the SVD, which is always the case for symmetric psdmatrices.

→ Symmetric only. (σi = |λi|) Now consider just some symmetric B but not neces-sary psd. B = WΛW T is diagonalizable. W TW = I. The eigvalues of a symmetricmatrix are real. Might not be positive. [50:00]. How to get decomposition?

– Suppose first k entries of Λ are non-negative. Consider diagonal Γ with

Γii =

1 i ≤ k

−1 otherwise(126)

which means ΛΓ is psd and WΓ is orthogonal. Our decomposition is thus

B = W (ΛΓ)(WΓ)T (127)

→ General Square Matrix. For arbitrary square matrices, there is no general re-lationship between its singular values and eigenvalues. As an example, considerC =

[1 10120 1

]. Its eigenvalues are clearly 1. Its singular values are 1012, 10−12.

Also, note that

maxz||Cz|| = σ1 where ||z|| = 1 (128)

→ Rank-related discussion starts at [1:00:00]. Lecture ends 8 minutes later.

50

SVD - Independent Notes

Background Material.

• Eigendecomposition. Let A be a square N × N matrix with N linearly inde-pendent eigenvectors qi. Then A can be factorized as A = QΛQ−1 where Q is thesquare N ×N matrix whose ith column is the eigenvector qi of A.

SVD Definition. Suppose M is m×n matrix with either real or complex entries. Thenthere exists a factorization, called a SVD of M of the form

M = UΣV T (129)

where

• U is a m×m unitary matrix.19

• Σ is a diagonal m × n matrix with non-negative real numbers. The diagonalentries σi are the singular values of M . Convention is to list the singular valuesin descending order.

• V T is n× n unitary matrix.

Relation to eigenvalue decomp. SVD is general, while eigenvalue decomp is a specialcase for certain square matrices.

19Unitary: square matrix with conjugate transpose equal to its inverse.

51

Proving the outer product summation of X to myself . X =∑d

i=1 σiuivTi

U =[u1 u2 · · · ud

](130)

V =[v1 v2 · · · vd

]↔ V T =

vT1vT2...vTd

(131)

USV T =[u1 u2 · · · ud

]σ1

σ2

. . .

σd

vT1vT2...vTd

(132)

=[u1σ1 u2σ2 · · · udσd

]vT1vT2...vTd

(133)

d∑i=1

σiuivTi (134)

52


Principal Component AnalysisTable of Contents Local Written by Brandon McKinzie

SVD Review.

• Econ-Sized SVD: If all σr+i = 0 in the S matrix, we can just use all the nonzerodiagonals (the first r): X = UrSrV

Tr where20 we are still assuming that X is d×n.

This allows us to compute a d× n matrix with only d · r + r + r · n entries (fromeach of Ur, S, Vr).• Dimensionality Reduction. Can reduce problem to r-dimensional without los-ing any classification power with

X := UTr X = SrV

Tr (135)

If our classifier is the wTx kind, then notice we can let w = Urα, which means wenow classify with wTx = αTUT

r x.

PCA Algorithm. Our inputs are X matrix of shape d× n.

1. Centering. Subtract out the means21 such that our new data matrix is

Xc =[x1 − µx x2 − µx . . . xn − µx

](136)

2. SVD. Compute SVD(Xc) = USV T .

3. Return [X = SrVTr , Ur, µx]

20In case it’s not obvious, dimensions of Ur and Vr are d× r and n× r, respectively.21We will assume this has been done for the rest of lecture.

53

View 1: Maximizing Variance. Want variance of low-dimensional X to be as closeas possible to the actual X. For example, if we choose r = 1, then we want to find thedirection of maximum variance. Note: AFAIK, u here is not related to U from SVD. Ornot?

max||u||=1

= Var(uTxi) (137)

Subtract out the first principal component22 from xi.

xi = xi − (uT1 xi)u1 (138)

so X is centered and orthogonal to u1. Furthermore σ1(X) = σ2(X).

Comparing variance. The total variance of the first r principal components, and thetotal variance in X are, respectively,

r∑i=1

σi(X)2 andd∑

i=1

σi(X)2 (139)

View 2: Maximum Likelihood. Assume our data can be expressed as

Xi = αiz + ωi (140)

• z is unknown, not random. Rd.• αi is unknown, not random. R1.• ωi ∼ N (0, σ2Id). Added white noise.

Our goal is to find z. Do this via

maxα,z

log p(X,α, z) (141)

= − 1

2σ2

n∑i=1

||xi − αiz||2 + [unimportant] (142)

minα,z||X − zαT ||2F (143)

where optimal solution is

ZαT = σ1U1VT1 (144)

which (Z and α) aren’t unique, but as a product it is.

22Can prove it’s “gone” by evaluating uT1 xi (it is zero)

54

minA∈Rn×k, Z∈Rd×k

||X − ZAT ||2F (145)

View 3: Minimizing Projection Error [49:00]. Thinking more like regression.

dist(xi, L(w)) = ||xi −wTxi

||w||2w||2 (146)

Latent Factor Analysis. [1:03:00]. Uses PCA to recognize relationships betweenterms and concepts by applying dimensionality reduction to a term-document matrix.Each row is a document and each column is a word, so Xij represents the number ofoccurrences of word j in document i. We see that this term-document matrix X iseffectively a bag-of-words model, representing an unstructured piece of text.

55

PCA - Andrew Ng’s CS 229 Notes

Data Pre-Processing. Before running PCA, ensure the following is done:

1. Let µ = 1n

∑ni=1 x

(i)

2. Replace each x(i) with x(i) − µ.

3. Let σ2j = 1

n

∑ni=1

(x(i)j

)2be the variance for the jth feature.

4. Replace each x(i)j with x

(i)j /σj.

Intuition: Major Axis of Variation. [One approach is to] let the unit vector u sothat when the data is projected onto the direction corresponding to u, the variance ofthe projected data is maximized. Note that the length of the projection of a data pointx onto u is given by23 xTu. Also note that the new projected vector24 is (xTu)u.

Maximizing Variance. We can formalize the previous paragraph with the followingoptimization problem.

u← argmax||u||=1

[1

n

n∑i=1

(x(i)Tu)2

](147)

= argmax||u||=1

uT

(1

n

n∑i=1

x(i)x(i)T

)u (148)

It should be clear that this will return the maximal (principal) eigenvector of Σ, whichis the quantity in big parentheses25.

23xTu = ||x|| ||u|| cos θ = x‖, where x‖ is the component of x along u.24Remember that our data is now mean-zero.25If this is not obvious, you need to review.

56

Recap. We have found that if we wish to find a 1-dimensional subspace with which toapproximate the data, we should choose u to be the principal eigenvector of Σ. Moregenerally, if we wish to project our data into a k-dimensional subspace (k < n), we shouldchoose u1, . . . , uk to be the top k eigenvectors of Σ. The ui’s now form a new, orthogonalbasis for the data. Because Σ is symmetric, the ui’s will (or always can be chosen to be)orthogonal to each other. Finally, to represent x(i) in the new k-dimensional subspace,we project it onto the k principal components we found.

x(i) ←

uT1 x

(i)

uT2 x

(i)

...uTk x

(i)

where x(i) ∈ Rk (149)

Properties of PCA.

• PCA can be derived by picking the basis that minimizes the approximation errorarising from projecting the data onto the k-dimensional subspace spanned by them.

• More to come . . .

57


ClusteringTable of Contents Local Written by Brandon McKinzie

[started at 5:42; goal: end at 7:00]

Introduction.

• Why cluster? (1) Find archetypes, (2) segmentation, (3) faster lookups.• Approaches. (1) k-means [quantization], (2) agglomeration [hierarchy], (3) spec-tral [segmentation].

K-Means. Partition the data points into k-clusters. Let µj denote the mean of pointsin cluster j, where a point xi is in Cj if

||xi − µj||2 ≤ ||xi − µ′j||2 (150)

µj =1

|Cj|∑i∈Cj

xi (151)

and our optimization problem is to minimize the differences between points and the kmeans.

minµ1,...,µk

n∑i=1

minj||xi − µj||2 (152)

The Algorithm. Initialize µ1, . . . , µk by strategy of choice. Repeat the following untilconvergence (i.e. assignments stop changing).

1. Set i ∈ Cj if ||xi − µj||2 ≤ ||xi − µ′j||2 for all 1 ≤ j′ ≤ k.

2. Set µj =1

|Cj |∑

i∈Cjxi for j = 1, . . . , k.

58

Initializing µi: k-means++. Procedure:

1. Choose µ1 at random from data positions x1, . . . , xn.

2. For c = 1, . . . , k − 1, do

(a) Set di = minj ||xi − µj||2 for all i = 1, . . . , n.

(b) Set µc+1 = xi with probability pi = di/∑

i di.

Hierarchical Clustering. Alternative approach which doesn’t require us to specifythe number of clusters K beforehand. Results in a tree-based representation of theobservations, called a dendogram. First, we define the four most commonly-used typesof linkage (dissimilarity between two groups of observations).

• Complete linkage. The largest dissimilarity of all pairwise between observationsin cluster A and in B (Maximal intercluster dissimilarity).• Single linkage. . . . smallest . . . . . . (minimal . . . ).• Average linkage. . . . average . . . . . . (mean . . . )• Centroid linkage. Dissimilarity between the centroid for cluster A and B.• Summary and Equations for each:

[COMPLETE] d(A,B) = maxdist(x, z) : x ∈ A, z ∈ B (153)

[SINGLE] d(A,B) = mindist(x, z) : x ∈ A, z ∈ B (154)

[AV ERAGE] d(A,B) =1

|A| |B|∑

x∈A, z∈B

dist(x, z) (155)

[CENTROID] d(A,B) = dist(µA, µB) (156)

Now, we can give an example via a greedy algorithm.

1. Initialize with n clusters Ci = xi for i = 1, . . . , n.

2. Repeat the following until there is one giant cluster:

(a) For all pairs of clusters (A,B), compute d(A,B).

(b) For the minimum pair, link Cnew = A ∪B.

59

Spectral Clustering26. View data as a graph, where the nodes are the x1, . . . , xn, theedges wij is the similarity sim(xi, xj). Some common similarity measures are

sim(xi, xj) =xTi xj

||xi|| ||xj||(157)

sim(xi, xj) = k(xi, xj) (158)

sim(xi, xj) =

1 ||xi − xj|| < D0

0 otherwise(159)

Now clustering is the process of graph partitioning. We start with a fully connectedgraph, and want to remove as few edges as possible. [56:00]

• Partition. We define a partition V1, V2 of our node set V as two sets where

V1 ∪ V2 = V and V1 ∩ V2 = ∅ (160)

• Cut set. The amount of weight we’ve cut. It is just the sum of connections thatwere previously between V1 and V2 that we’ve now cut out after partitioning.

cut(v1, v2) =∑i∈V1

j ∈ V2wij (161)

• Optimization problem. Define a balanced cut

min cut(V1, V2) subj. to |V1| = |V2| =n

2(162)

which is (very) NP-hard. 2N different splits; combinatorial problem.

Graph Laplacian. [1:05] Define a cut indicator (vector) v ∈ Rn where

vi =

1 i ∈ V1

−1 i ∈ V2

(163)

26Page 544 of ESL

60

cut(V1, V2) =1

4

n∑i=1

n∑j=1

wij (164)

=1

4vTLv (165)

Lij =

−wij i 6= j∑

l wil i = j(166)

min1

4vTLv (167)

61

Shewchuck Notes

Contents

2.1 Perceptron Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.1.1 Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . 63

2.1.2 Hyperplanes with Perceptron . . . . . . . . . . . . . . . . . . 64

2.1.3 Algorithm: Gradient Descent . . . . . . . . . . . . . . . . . . 64

2.1.4 Algorithm: Stochastic GD . . . . . . . . . . . . . . . . . . . . 65

2.1.5 Maximum Margin Classifiers . . . . . . . . . . . . . . . . . . 65

2.2 Soft-Margin SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.3 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.4 Gaussian Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 70

2.4.1 Quadratic Discriminant Analysis (QDA) . . . . . . . . . . . 70

2.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.6 Justifications & Bias-Variance (12) . . . . . . . . . . . . . . . . . . . . . 71

62

Shewchuck Notes

Perceptron LearningTable of Contents Local Written by Brandon McKinzie

Perceptron Algorithm

• Consider n sample points X1, . . . , Xn.

• For each sample point, let

yi =

1 Xi ∈ class C

−1 Xi /∈ class C

• Goal: Find weights w that satisfy the constraint

yiXi · w ≥ 0 (168)

• In order to minimize the number of constraint violations, need a way to quantifyhow “good” we are doing. Do this with the loss function

L(z, yi) =

0 yiz ≥ 0

−yiz otherwise(169)

Notice that this can only be ≥ 0 by definition. The larger L is, the worse you areas a human being.

• The Risk/Objective/Cost function is a sum total of your losses.

R(w) =n∑

i=1

L(Xi · w, yi) =∑i∈V

(−yi Xi · w) (170)

where (∀i ∈ V )(yiXi · w < 0).

• Goal: Find w that minimizes R(w).

63

Hyperplanes with Perceptron

• Notice the different between the two (purple) Goals stated in the previous subsec-tion. We went from constraining w to certain hyperplanes in x-space (yiXi ·w ≥0) to constraining w to certain points in w-space (minw R(w)).

• Figure ?? illustrates how the data points constrain the possible values for w inour optimization problem. For each sample point x, the constraints can be statedas

– x in the “positive” class ⇒ x and w must be on the same side of the hyper-plane that x transforms into27.

– x in the “negative” class ⇒ x and w must be on the opposite side of x’shyperplane.

Figure 1: Illustration of how three sample points in x-space (left) can constrain the possiblevalues for w in w space (right).

Algorithm: Gradient Descent

• GD on our risk function R is an example of an optimization algorithm. Wewant to minimize our risk, so we take successive steps in the opposite direction of∇R(w).28

∇R(w) = ∇∑i∈V

(−yi Xi · w) (171)

= −∑i∈V

(yi Xi) (172)

• Algorithm:

– w ← arbitrary nonzero (e.g. any yiXi )

– while R(w) > 0

27x transforms into a hyperplane in w space defined as all w that satisfy x · w = 0.28Recall that the gradient points in direction of steepest ascent.

64

∗ V ← all i for which yiXi · w < 0

∗ w ← w + ε∑

i∈V (yi Xi)

where ε is the learning rate/step size. Each step is O(nd) time.

Algorithm: Stochastic GD

• Procedure is simply GD on one data point only per step, i.e. no summation symbol.Called the perceptron algorithm.

• Algorithm:

– while some yiXi · w < 0

∗ w ← w + εyiXi

– return w.

• Perceptron Convergence: If data is linearly separable, perfect linear classifierwill be found in at most O(R2/γ2) iterations, where

– R = maxi |Xi| is radius of the data

– γ is the maximum margin.

Maximum Margin Classifiers

• Margin: (of a linear classifier) the distance from the decision boundary to thenearest sample point.

• Goal: Make the margin as large as possible.

– Recall that the margin is defined as |τmin|, the magnitude of the smallesteuclidean distance from a sample point to the decision boundary, where forsome xi,

τi =|f(xi)|||w||

and our goal is to maximize the value of the smallest τ in the dataset.

– Enforce the (seemingly arbitrary?) constraints that |f(xi)| ≥ 1, or equiva-lently

yi(w · xi + α) ≥ 1 (173)

which can also be stated as requiring all τi ≥ 1/||w||.

65

– Optimize: Find w and α that minimize ||w||2, subject to yi(w · xi + α) ≥ 1for all i ∈ [1, n]. “Called a quadratic program in d+1 dimensions and nconstraints. It has one unique solution.”

• The solution is a maximum margin classifier aka a hard SVM.

66

Shewchuck Chapter 4

Soft-Margin SVMsTable of Contents Local Written by Brandon McKinzie

• Hard-margin SVMs fail if not linearly separable.

• Idea: Allow some points to violate the margin, with slack variables ξ

yi(Xi · w + α) ≥ 1− ξi (174)

where ξi ≥ 0. Note that each sample point is assigned a value of ξi, which is onlynonzero iff xi violates the margin.

• To prevent abuse of slack, add a loss term to our objective function29.

– Find w, α, and ξi that minimize our objective function,

|w|2 + Cn∑

i=1

ξi (175)

subject to

yi(Xi · w + α) ≥ 1− ξi for all i ∈ [1, n] (176)

ξi ≥ 0 for all i ∈ [1, n] (177)

a quadratic program in d + n + 1 dimensions and 2n constraints. The relativesize of C, the regularization hyperparameter determines whether you aremore concerned with getting a large margin (small C) or keeping the slackvariables as small as possible (large C).

29Before this, looks like our objective function was just |w|2 since that is what we wanted to minimize(subject to constraints).

67

Shewchuck Chapter 6

Decision TheoryTable of Contents Local Written by Brandon McKinzie

• For when “a sample point in feature space doesn’t have just one class”. Solutionis to classify with probabilities.

• Important terminology:

– Loss Function L(z, y): Specifies badness of classifying as z when the trueclass is y. Can be asymmetrical. We are typically used to the 0-1 lossfunction which is symmetric: 1 if incorrect, 0 if correct.

– Decision rule (classifier) r : Rd → ±1. Maps feature vector x to a class (1if in class, -1 if not in class for binary case).

– Risk: Expected loss over all values of x, y:

R(r) = E[L(r(X), Y )

](178)

=∑y

P (Y = y)∑x

P (X = x|Y = y)L(r(x), y) (179)

In ESL Chapter 2.4, this is denoted as the prediction error.

– Bayes decision rule/classifier r∗: Defined as the decision rule r = r∗ thatminimizes R(r). If we assume L(z, y) = 0 when z = y, then

r∗(x) =

1 L(−1, 1)P (1|x) > L(1, 1)P (−1|x)−1 otherwise

(180)

which has optimal risk, also called the Bayes risk R(r∗).

• Three ways to build classifiers:

– Generative models (LDA): Assume sample points come frome class-conditionedprobability distributions P (x|c), different for each class. Guess the form ofthese dists. For each class C, fit (guessed) distributions to points labeled asclass C. Also need to estimate (basically make up?) P (C). Use bayes ruleand classify on maxC P (Y = C|X = x). Advantage: Can diagnose out-liers (small P(x)). Can know the probability that prediction is wrong. Realdefinition: A full probabilistic model of all variables.

68

– Discriminative models. Model P (Y |X) directly. (I guess this means don’tbother with modelling all the other stuff like X—Y, just go for it bruh.) Ad-vantage: Can know probability of prediction being wrong. Real definition:A model only for the target variables.

– Decision boundary finding: e.g. SVMs. Model r(x) directly. Advan-tage: Easier; always works if linearly separable; don’t have to guess explicitdistributions.

69

Shewchuck Chapter 7

Gaussian Discriminant AnalysisTable of Contents Local Written by Brandon McKinzie

• Fundamental assumption: Each class C comes from a normal distribution.

• For a given x, want to maximize P (X = x|Y = C)πC , where πC prior probabilityof class c. Easier to maximize ln(z) since increases monotonically for z > 0. Thefollowing gives the “quadratic in x” function QC(x),

QC(x) = ln

((√2π)dP (x)πC

)(181)

= −|x− µC |2

2σ2C

− d lnσC + ln πC (182)

where P (x), a normal distribution, is what we use to estimate the class conditionalP (x|C).

• The Bayes decision rule r∗ returns the class C that maximizes QC(x) above.

Quadratic Discriminant Analysis (QDA)

• Suppose only 2 classes, C and D. Then

r∗(x) =

C QC(x)−QD(x) > 0

D otherwise(183)

which is quadratic in x. The Baye’s Decision Boundary (BDB) is the solution ofQC(x)−QD(x) = 0.

• In 1D, BDB may have 1 or 2 points (solution to quadratic equation)

• In 2D, BDB is a quadric (e.g. for d=2, conic section).

• In 2-class problems, naturally leads to logistic/sigmoid function for determiningP (Y |X).

70

Newton’s Method

• Iterative optimization for some smooth function J(w).

• Can Taylor expand gradient about v:

∇J(w) = ∇J(v) + (w − v)∇2J(v) +O(|w − v|2) (184)

where ∇2J(v) is the Hessian matrix of J(w) at v, which I’ll denote H.

• Find critical point w where ∇J(w) = 0:

w = v −H−1∇J(v) (185)

• Shewchuck defines Newton’s method algorithm as:

1. Initialize w.

2. until convergence do:

e := solve linear system

(He = −∇J(w)

).

w := w + e.

where starting w must be “close enough” to desired solution.

Justifications & Bias-Variance (12)

• Overview: Describes models, how they lead to optimization problems, and howthey contribute to underfitting/overfitting.

• Typical model of reality:

yi = f(Xi) + εi (186)

where εi ∼ D′ has mean zero.

• Goal of regression: find h that estimates f .

71

Elements of Statistical Learning

Contents

3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.1.1 Models and Least-Squares . . . . . . . . . . . . . . . . . . . . . . 73

3.1.2 Subset Selection (3.3) . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.1.3 Shrinkage Methods (3.4) . . . . . . . . . . . . . . . . . . . . . . . 77

3.2 Linear Methods for Classification (Ch. 4) . . . . . . . . . . . . . . . . . . 78

3.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.4 Trees and Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.4.1 Boosting Methods (10.1) . . . . . . . . . . . . . . . . . . . . . . . 80

3.5 Basis Expansions and Regularization (Ch 5) . . . . . . . . . . . . . . . . 81

3.6 Model Assessment and Selection (Ch. 7) . . . . . . . . . . . . . . . . . . 83

72

ESL

Linear RegressionTable of Contents Local Written by Brandon McKinzie

• Assumption: The regression function E [Y |X] is linear30 in the inputsX1, . . . , Xp.

• Perform well for. . .

– Small numbers of training cases.

– Low signal/noise.

– Sparse data.

Models and Least-Squares

• The model:

f(X) = β0 +

p∑j=1

Xjβj (3.1)

• Most popular estimation method is least-squares.

RSS(β) =n∑

i=1

(yi − f(xi)

)2(3.2)

= (y −Xβ)T (y −Xβ)T (3.3)

which is reasonable if training observations (xi, yi) represent independent randomdraws from their population31.

• First two derivatives wrt to parameter vector β:

∂RSS

∂β= −2XT (y −Xβ)

∂2RSS

∂β∂βT= 2XTX

(3.4)

• Assuming that X has full column rank so that XTX is positive definite32, set firstderive to 0 to obtain the unique solution:

30or reasonably approximated as linear31and/or if yi’s conditionally indep given the xi’s.32A matrix is positive definite if it’s symmetric and all its eigenvalues are positive. What would we

73

β = (XTX)−1XT~y (3.6)

• Geometry of Least Squares i get it now!

– The (p+ 1) column vectors of X span a subspace of RN .33.

– Minimizing RSS(β) = ||~y−Xβ||2 is choosing β such that the residual vector

~y − ~y is orthogonal to this subspace34. Stated another way, (the optimal) yis the orthogonal projection of ~y onto the column space of X.

– Since y = Xβ, we can define this projection matrix (aka hat matrix), denotedas H, where

y = Xβ

= X(XTX)−1XT~y

= H~y

(3.7)

Why V ar(β) = (XTX)−1σ2.

• Note: Variance-covariance matrix ≡ Covariance matrix.

• Can express the correlation matrix in terms of the covariance matrix:

corr(X) =(diag(Σ)

)−1/2Σ(diag(Σ)

)−1/2(187)

or, equivalently, the correlation matrix can be seen as the covariance matrix of thestandardized random variables Xi/σ(Xi).

• Recall from decision theory that, when we want find a function f(X) for predictingsome Y ∈ R, we can do this by minimizing the risk (aka the prediction errorEPE(f)). This is accomplished first by defining a loss function. Here we will usethe squared error loss L(Y, f(X)) = (Y − f(X))2. We can express EPE(f) as anintegral over all values that Y and X may take on (i.e. the joint distribution).Therefore, we can factor the joint distribution and define f(x) via minimizing EPE

do here if X were not full column rank?. Answer: X columns may not be linearly independentif, e.g., two inputs were perfectly correlated x2 = 3x1. The fitted y will still be projection onto C(X),but there will be more than 1 way (not unique) to express that projection. Occurs most often whenone or more (qualitative) inputs are coded in a redundant fashion.

33This is the column space C(X) of X. It is the space of Xv ∀v ∈ RN , since the produce Xv isjust a linear combination of the columns in X with coefficients vi.

34Interpret: Xβ will always lie somewhere in this subspace, but we want β such that, when wesubtract each component (WOAH JUST CLICKED) from the prediction, they cancel exactly,i.e. yi −(Xβ)i = 0 for all dimensions i in C(X). The resultant vector ~y− y will only contain components outsidethis subspace, hence it is orthogonal to it by definition.

74

piecewise (meaning at each value of X = x.) This whole description is writtenmathematically below.

EPE(f) = E [Y − f(X)]2 (2.9)

=

∫ [y − f(x)

]2fXY (x, y)dxdy (2.10)

= EX

[EY |X

[(Y − f(X))2|X

]](2.11)

and therefore, the best predictor of Y is a function f : Rp → R that satisfies, foreach x value separately

f(x) = argminc

EY |X

[(Y − c)2|X

](2.12)

= E [Y |X = x] (2.13)

which essentially defines what is meant by E [Y |X = x], also referred to as theconditional mean35.

Bias-Variance Tradeoff

The test MSE, for a given value x0, can always be decomposed into the sumof three fundamental quantities:

E[y0 − f(x0)

]2= V ar

(f(x0)

)+[Bias(f(x0))

]2+ V ar(ε) (188)

which is interpreted as the test MSE : the average test MSE that we wouldobtain if we repeatedly estimated f using a large number of training sets,and tested each at x0. The overall test MSE can be computing the average(of this average) over all possible values of x0 in the TEST set.

• What bias means here: On the other hand, bias refers to the error that is introduced

by approximating a real-life problem, which may be extremely complicated, by a much simpler

model. For example, linear regression assumes that there is a linear relationship between Y and

X1, . . . , Xp. It is unlikely that any real-life problem truly has such a simple linear relationship,

and so performing linear regression will undoubtedly result in some bias in the estimate of f. In

Figure 2.11, the true f is substantially non-linear, so no matter how many training observations

we are given, it will not be possible to produce an accurate estimate using linear regression. In

other words, linear regression results in high bias in this example. However, in Figure 2.10 the

true f is very close to linear, and so given enough data, it should be possible for linear regression

to produce an accurate estimate. Generally, more flexible methods result in less bias.

35At the same time, don’t forget that least-squared error assumption was built-in to this derivation.

75

• Returning now to the case where know (aka assume) that the true relationshipbetween X and Y is linear

Y = XTβ + ε (2.26)

and so in this particular case the least squares estimates are unbiased.

• This is the proof: (relies on the fact that V ar(β) = 0 since β is the true (NONRANDOM) vector we are estimating)36

V ar[β] = V ar

[(XTX)−1XT~y

](189)

= V ar

[(XTX)−1XT

(Xβ + ε

)](190)

= V ar

[(XTX)−1XTXβ + (XTX)−1XT ε

](191)

= V ar

[β + (XTX)−1XT ε

](192)

= V ar

[(XTX)−1XT ε

](193)

= E[(

(XTX)−1XT ε

)((XTX)−1XT ε

)T](194)

=

((XTX)−1XT

)E[εεT] (

X(XTX)−1

)(195)

=

((XTX)−1XT

)σ2

(X(XTX)−1

)(196)

= σ2

((XTX)−1XT

)(X(XTX)−1

)(197)

= σ2(XTX)−1 (198)

where we have assumed that the X are FIXED (not random)37 and so the varianceof (some product of Xs) × ε is like taking the variance with a constant out front.We’ve also assumed that XTX (and thus its inverse too) is symmetric, apparently.

Subset Selection (3.3)

• Two reasons why we might not be satisfied with 3.6:

1. Prediction accuracy. Often have low bias, high variance. May improve ifshrink coefficients. Sacrifices some bias to reduce variance.

36Also, ∀a ∈ R : V ar(a+X) = V ar(X)37another way of stating this is that we took the variance given (or conditioned on) each X

76

2. Interpretation. Sacrifices some of the small details.

• Appears that subset selection refers to retaining a subset of the predictors βi anddiscarding the rest.

• Doing this can often exhibit high variance, even if lower prediction error.

Shrinkage Methods (3.4)

• Shrinkage methods are considered continuous (as opposed to subset selection) anddon’t suffer as much from high variability.

77

ESL July 30, 2017

Linear Methods for Classification (Ch. 4)Table of Contents Local Written by Brandon McKinzie

Logistic Regression (4.4). Motivation: we want to model Pr [G = k | X = x], for eachof our K classes, via linear functions in x.

logPr [G = k | X = x]

Pr [G = K | X = x]= (β0)k + βT

k x ∀k ∈ [1, K − 1] (199)

Pr [G = k | X = x] =eβ0k+βT

k x

1 +∑K−1

`=1 eβ0`+βT` x

∀k ∈ [1, K − 1] (200)

, pk(x; θ) (201)

where θ , β01, βT1 , . . . , β(K−1)0, β

TK−1.

78

ESL

Naive BayesTable of Contents Local Written by Brandon McKinzie

Appropriate when dimension p of feature space is large. It assume that given a classG = j, the features Xk are independent:

fj(X) ≡ fj((X1, X2, . . . , Xp)T ) =

p∏k=1

fjk(Xk) (202)

which can simplify estimation [of the class-conditional probability densities fj(X)] dra-matically: The individual class-conditional marginal densities fjk can each be estimatedseparately using 1D kernel density estimates.

79

ESL

Trees and BoostingTable of Contents Local Written by Brandon McKinzie

Boosting Methods (10.1)

Terminology:

• Weak Classifier: one whose error rate is only slightly better than random guess-ing.

The AdaBoost algorithm.

1. Initialize observation weights wi := 1/N , i = 1, 2, . . . , N .

2. For m = 1 to M38:

(a) Fit classifier Gm(x) to the training data using weights wi.

(b) Compute

errm =

∑Ni=1wiI(yi 6= Gm(xi))∑N

i=1wi

(203)

(c) Compute αm = log ((1− errm)/errm).

(d) Update wi ← wi · exp[αm · I(yi 6= Gm(xi))], i = 1, . . . , N .

3. Output G(x) =∑m

i=1 αmGm(x).

38where M is the number of weak classifiers (trees) that we want to train.

80

ESL

Basis Expansions and Regularization (Ch 5)Table of Contents Local Written by Brandon McKinzie

Introduction. Core idea is to augment/replace the vector of inputs X with additionalvariables, which are transformations of X, and then use linear models in this new spaceof derived inputs features. Specifically, we model a linear basis expansion in X:

hm(X) : Rp R.f(X) =

M∑m=1

βmhm(X) (204)

where hm(X) is the mth transformation of X, m = 1, . . . ,M .

Piecewise Polynomials and Splines. Assume that X is one-dimensional. We willexplore increasingly complex cases that build off each other.

• Piecewise-constant. Take, for example, the case where we have 3 basis functions:

h1(X) = I(X < ξ1), h2(X) = I(ξ1 ≤ X < ξ2), h3(x) = I(ξ2 ≤ X) (205)

Solving for βi in each derivative of RSS(β) w.r.t. βi, for f(X) =∑3

i=1 βihi(X),

yields βi = Yi, the mean of Y in the ith (of 3) regions. Note that this is thedegree-0 case: piecewise constant.

• Piecewise-linear. Instead of just learning the best-fit constant βi for the ithregion, we can now also learn the best slope for the line in that region. This meanswe have 3 additional parameters to fit, and f(X) becomes:

f(X) = β1I1 + β2I2 + β3I3 + β4I1 ·X + β5I2 ·X + β6I3 ·X (206)

= (β1 + β4X)I1 + (β2 + β5X)I2 + (β3 + β6X)I3 (207)

where I’ve denoted the ith identity function from 205 as Ii.• Continuous piecewise-linear. We would typically prefer the lines to meet (havethe same value) at the region boundaries of X = ξ1 and X = ξ2. In other words,require that:

f(ξ−1 ) = β1 + β4ξ1 = β2 + β5ξ1 = f(ξ+1 ) (208)

f(ξ−2 ) = β2 + β5ξ2 = β3 + β6ξ2 = f(ξ2+) (209)

81

which also reduces our free parameters from 6 to 439. We can express this constraintmore directly by using a basis that incorporates them40.

h1(X) = 1, h2(X) = X, h3(X) = (X − ξ1)+, h4(X) = (X − ξ2)+ (210)

In general, an order-M spline with knots ξj, j = 1, . . . , K is a piecewise-polynomialof order M , and has continuous derivatives up to order M − 2.

B-splines. Let ξ0 and ξK denote two boundary knots, which typically define thedomain over which we wish to evaluate our spline. Define the augmented knot sequenceτ such that

It is customary to makeall τi,i≤M = ξ0 andτj,j≥K+M+1 = ξK+1.

τi ≤ τ2 ≤ · · · ≤ τM ≤ ξ0 (211)

τM+j = ξj, j = 1, · · · , K (212)

ξK+1 ≤ τK+M+1 ≤ τK+M+2 ≤ · · · ≤ τK+2M (213)

The ith B-spline basis function of order m (m ≤M) for the knot sequence τ is denotedby Bi,m(x), and defined recursively as follows:

Bi,1(x) =

1 τi ≤ x ≤ τi+1

0 otherwisei = 1, . . . , K + 2M − 1 (214)

Bi,m(x) =x−τi

τi+m−1−τiBi,m−1(x) +

τi+m−xτi+m−τi+1

Bi+1,m−1(x) i = 1, . . . , K + 2M −m

(215)

39Because, for example, now β1 = β2 + (β5 − β4)ξ1.40Note: I was not able to derive this form from the previous equations and constraints. Moving on

because not important right now.

82

ESL

Model Assessment and Selection (Ch. 7)Table of Contents Local Written by Brandon McKinzie

Bias, Variance, and Model Complexity. We first define the three main quantities:

Generalization (Test) Error ErrT = E[L(Y, f(X)) | T

](216)

Expected Prediction Error Err = E[L(Y, f(X))

]= E [ErrT ] (217)

Training Error err =1

N

N∑i=1

L(yi, f(xi)) (218)

Bias-Variance Decomposition. Let Y = f(X)+ ε, E [ε] = 0, Var [ε] = σ2ε . We derive

the expected prediction error of a fit f(X) at an input point X = x0, using squared-error

83

loss:

ε and f are independent.

Err(x0) = Eε,f

[(Y − f(x0))

2 | X = x0

](219)

= Eε,f

[(f − f(x0) + ε)2

](220)

= Eε

[ε2]+ Ef

[(f − f(x0))

2]

(221)

= σ2ε + Ef

[f(x0)

2]+ f 2 − 2fEf

[f(x0)

](222)

= σ2ε + Ef

[f(x0)

2]−Ef

[f(x0)

]2+ Ef

[f(x0)

]2+ f 2 − 2fEf

[f(x0)

](223)

= σ2ε +Var

[f(x0)

]+

(Ef

[f(x0)

]2− 2fEf

[f(x0)

]+ f 2

)(224)

= σ2ε +Var

[f(x0)

]+(Ef

[f(x0)

]− f

)2(225)

= σ2ε +Var

[f(x0)

]+ Bias2

(f(x0)

)(226)

Remember, Err(x0) is the MSE of the prediction at X = x0, taken over all possiblemodels f (and the irreducible error from ε). The bias is the difference between the

average estimate, Ef

[f(x0)

], and the true mean of Y, f(x0). Finally, the variance

measures how much the distribution over all possible f deviates from their average,

Ef

[f(x0)

].

Bootstrap Methods. A general tool for assessing statistical accuracy. Seeks to esti-mate the conditional error ErrT but typically estimates well only the expected predictionerror Err. Denote our training set T by Z = (z1, . . . , zN), where zi = (xi, yi). The basicprocedure is as follows:

1. We randomly draw datasets with replacement from Z, with each sampled dataset,Z∗b, having the same size, N , as Z.

- This is done B times, producing B bootstrap datasets.2. Refit the model to each of the bootstrap datasets, and examine the behavior of

the fits over the B replications.

84

Concept Summaries

Contents

4.1 Probability Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2 Commonly Used Matrices and Properties . . . . . . . . . . . . . . . . . . 90

4.3 Midterm Studying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4 Support Vector Machines (CS229) . . . . . . . . . . . . . . . . . . . . . . 92

4.5 Spring 2016 Midterm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.6 Multivariate Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.6.1 Misc. Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.7 Final Studying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7.2 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7.3 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7.4 Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . 99

4.7.5 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.7.6 Risk, Empirical Risk Minimization, and Related . . . . . . . . . . 100

4.7.7 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

85

Concept Summaries

Probability ReviewTable of Contents Local Written by Brandon McKinzie

Notation:

– Sample Space Ω: Set of all outcomes of a random experiment. For six-sided die,Ω = 1, . . . , 6.

– Event Space F : Set whose elements are subsets of Ω. Appears that F is requiredto be complete in a certain sense, i.e. that it should contain all possible events(combinations of possible individual outcomes).

– Probability measure: Function P : F → R. Intuitively, it tells you what fractionof the total space of possibilities that F is in, where if F is the full space, P (F ) =P (Ω) = 1. Also required: P (A) ≥ 0 ∀A ∈ F .

Random Variables:41

– Consider experiment: Flip 10 coins. An example element of Ω would be of the form

ω0 = (H,H, T,H, T,H,H, T,H, T ) ∈ Ω (227)

which is typically a quantity too specific for us to really care about. Instead, we preferreal-valued functions of outcomes, known as random variables.

– R.V. X is defined as a function X : Ω→ R. They are denoted as X(ω), or simply Xif ω dependence is obvious.

– Using our definition of the probability measure, we define the probability that X = kas the probability measure over the space containing all outcomes ω where X(ω) =k.42

P (X = k) := P (ω : X(ω) = k) (228)

– Cumulative Distribution Function: FX : R→ [0, 1] defined as43

FX(x) , P (X ≤ x) (229)41TIL I had no idea what a random variable really was.42Oh my god yes, this is what I came here for.43The symbol , means equal by definition (hnnggg). In continuous case, FX(x) =

∫ x

−∞ pX(u)du.

86

– Probability Mass Function: When X is a discrete RV, it is simpler to representthe probability measure by directly saying the probability of each possible value Xcan assume. It is a function pX : Ω→ R such that

pX(x) , P (X = x) (230)

– Probability Density Function: The derivative of the CDF.

fX(x) ,dFX(x)

dx(231)

P (x ≤ X ≤ x+ δx) ≈ fx(x)δx (232)

Expectation Values

– Discrete X: (PMF pX(x)) Can either take expectations of X (the mean) or of somefunction g(X) : R→ R, also a random variable.

E [g(X)] ,∑

x∈V al(X)

g(x)pX(x) (233)

E [X] ,∑

x∈V al(X)

xpX(x) (234)

– Continuous X: (PDF fX(x)), then

E [g(X)] ,∫ ∞

−∞g(x)fX(x)dx (235)

– Properties:

E [a] = a ∀a ∈ R (236)

E [a f(X)] = aE [f(X)] (237)

E [f(X) + g(X)] = E [f(X)] + E [g(X)] (238)

E [bool(X == k)] = P (X = k) (239)

87

Variance: Measure of how concentrated the dist of a RV is around its mean.

V ar[X] , E[(X − E [X])2

](240)

= E[X2]− E [X]2 (241)

with properties:

V ar[a] = 0 ∀a ∈ R (242)

∆[af(X)] = a2V ar[f(X)] (243)

Covariance

– Recognize that the covariance of two random variables X and Y can be described asa function g : R2 → R. Below we define the expectation value for some multivariablefunction44, and then we can define the covariance as a particular example.

E [g(X,Y )] ,∑

x∈V al(X)

∑y∈V al(Y )

g(x, y)pXY (x, y) (244)

Cov[X,Y ] = E [(X − E [X])(Y − E [Y ])] (245)

– Properties:

Cov[X,Y ] = E [XY ]− E [X]E [Y ] (246)

V ar[X + Y ] = V ar[X] + V ar[Y ] + 2Cov[X,Y ] (247)

44Discrete case shown only. Should be obvious how it would look for continuous.

88

Random Vectors

– Suppose we have n random variables Xi = Xi(ω) all over the same general samplespace Ω. Convenient to put them into a random vector X, defined as X : Ω→ Rn

and with X = [X1X2 . . . Xn]T .

– Let g be some function g : Rn → Rm. We can define expectations with notation laidout below.

g(X) =

g1(X)g2(X)

...gm(X)

E [g(X)] =

E [g1(X)]E [g2(X)]

...E [gm(X)]

(248)

E [gi(X)] =

∫Rn

g(x1, . . . , xn)fX1,...,Xndx1 . . . dxn (249)

– For a given X : Ω → Rn, its covariance matrix Σ is the n × n matrix with Σij =Cov[Xi, Xj]. Also,

Σ = E[XXT

]− E [X]E [X]T (250)

= E[(X − E [X])(X − E [X])T

](251)

and it satisfies: (1) Σ 0 (pos semi-def), (2) Σ is symmetric.

89

Concept Summaries

Commonly Used Matrices and PropertiesTable of Contents Local Written by Brandon McKinzie

Matrix Properties(XXT + µIn

)P.D. with real eigvals λi > when µ > 0.

p.s.d A, B. Product AB NOT guaranteed p.s.d.C = [A B]ij = AijBij C is p.s.d. (A,B still p.s.d from above)

Positive Semi-definite.

• Eigenvalues non-negative.

90

Concept Summaries

Midterm StudyingTable of Contents Local Written by Brandon McKinzie

The eigenvalues of a matrix are the zeros of its characteristic polynomial, defined asf(λ) = det(A− λI). A vector v 6= 0 is an eigenvector iff v ∈ Null(A− λI).

Regardless of offset of plane, the normal vector to ax + by + cz = d is w = (a, b, c).For any point A not on the plane, closest point B to P where B is on the plane, isdetermined by the value of α that solves

(A− α(a, b, c)) · (a, b, c) = d (252)

since, given α satisfies the equation, B = (A − α(a, b, c)) is a point on the plane, con-structed by following the direction of w “backwards” from A.

Direct comparison between MLE and MAP:45

θMLE = argmaxθ

∑i

log(pX(x|θ)

) p(θ|x) ∝ pX(x|θ)p(θ)

θMAP = argmaxθ

∑i

log

(pX(x|θ)p(θ)

)= argmax

θE[log[pX(x|θ)

]+ p(θ)

] (253)

45MAP also known as Bayesian Density Estimation

91

Concept Summaries

Support Vector Machines (CS229)Table of Contents Local Written by Brandon McKinzie

• Note that (logistic regression)

g(θT ) ≥ 0.5 ⇐⇒ θTx ≥ 0 (254)

• Switch to perceptron algorithm where

hw,b(x) = g(wTx+ b) =

1 wTx+ b ≥ 0

−1 otherwise(255)

• Given a training example (x(i), y(i)), define the functional margin of (w, b) w.r.tthe training example as

γ(i) = y(i)(wTx(i) + b) (256)

where γ(i) > 0 means prediction is correct.46 We can also define with respect toS = (x(i), y(i)) : i = 1, . . . ,m to be the smallest of the individual functionalmargins:

γ = mini=1,...,m

γ(i) (257)

• Now we move to geometric margins. First, consider figure 247

46Possible insight relating to regularization: Notice how perceptron classification g(x) only dependson the sign of it’s argument, and not on the magnitude. However, performing x → 2x causes ourfunctional margin to double γ(i) → 2γ(i) and so it seems “we can make the functional margin arbitrarilylarge without really changing anything meaningful. This leads to, perhaps, defining a normalizationcondition that ||w||2 = 1. Hmmm. . .

47Alright retard, time to settle this once and for all. The plane containing point P0 and the vectorn = (a, b, c) consists of all points P with corresponding position vector r such that the vector drawnfrom P0 to P is perpendicular to n, i.e. the plane contains all points r such that n · (r − r0) = 0

92

Figure 2: Decision boundary

• If we consider A as the ith data point, what is value of γ(i)? The point B that isclosest to A on the plane is given by A− τ · w/||w|| where τ is the distance |AB|that we want to solve for. Since B is on the plane, we can solve for τ via

0 = wT

(x(i) − τ

w

||w||

)+ b (258)

τ =wTx(i) + b

||w||(259)

which leads to the general definition for the geometric margin, denoted withoutthe hat γ(i) as

γ(i) = y(i)[(

w

||w||

)T

x(i) +b

||w||

](260)

where clearly if ||w|| = 1 is the same as the functional margin. Also can definethe geometric over the whole training set similarly as was done for the functionalmargin.

• Optimizing (maximizing) the margin.

– Pose the following optimization problem

maxγ,w,b

γ

||w||S.T. (261)

y(i)(wTx(i) + b

)≥ γ (262)

– Due to reasons primarily regarding how computing ||w|| is non-convex/hard,we translate the problem as follows: (1) impose (on the functional margin48)constraint that49, γ = 1 which we can always satisfy with some scaling of w

48Recall that the functional margin alone does NOT tell you a distance.49Also remember that γ is over the WHOLE training set, and evaluates to the smallest γ(i)

93

and b; (2) Instead of maximizing 1/||w||, minimize ||w||2.

minγ,w,b

1

2||w||2 S.T. (263)

y(i)(wTx(i) + b

)≥ 1 (264)

which gives us the optimal margin classifier.

– Lot of subtleties: For SVM (at least in this class) we want maximize themargin, which we define to be 2/||w||. Note that this is not a fixed scalarvalue, it changes as ||w|| changes! The support vectors are any points x(i)

such that y(i)(wTx(i) + b) = 1.

PROOF THAT W IS ORTHOGONAL TO THE HYPERPLANE

• Claim: If w is a vector that classifies according to perceptron algorithm (equation 255), then wis orthogonal to the separating hyperplane.

• Proof: We proceed, using only the definition of a plane, by finding the plane that w is orthogonalto, and show that this plane must be the separating hyperplane.

• If we plug in w to the point-normal form of the equation of a plane, defined as the planecontaining all points r = (x, y, z) such that w is orthogonal to the PLANE50

wxx+ wyy + wzz + d = 0 (265)

wTr + d = 0 (266)

(267)

where, denoting r0 = (x0, y0, z0) as the vector pointing to some arbitrary point P0 in the plane,

d = −(wxx0 + wyy0 + wzz0) (268)

= −(wTr0) (269)

which means that

0 = wTr + d (270)

= wTr − (wTr0) (271)

= wT (r − r0) (272)

QED

50NOTE HOW I SAID PLANE AND NOT ANY VECTOR POINTING TO SOME POINT IN THEPLANE

94

Concept Summaries

Spring 2016 MidtermTable of Contents Local Written by Brandon McKinzie

• Hard-margin SVM and perceptron will not return a classifier if data not linearlyseparable.

• Soft-margin SVM uses yi(Xi · w + α) ≥ 1− ξi, so ξi 6= 0 for both (1) misclassifiedsamples and (2) all samples inside the margin.

• Large value of C in (Soft-margin SVM) |w|2 + C∑

ξi is prone to overfittingtraining data. Interp: Large C means we want most ξi → 0 or small, and thereforethe decision boundary will be sinuous, something we currently don’t know howto do.

• Bayes classifier classifies to the most probable class, using the conditional (dis-crete) distribution P (G|X).

G(x) = argmaxg∈G

Pr(g|X = x) (273)

• Σ1/2 = UΛ1/2UT .

95

Concept Summaries

Multivariate GaussiansTable of Contents Local Written by Brandon McKinzie

• The covariance matrix Σ ∈ Sn++, the space of all symmetric, positive definite nxn

matrices.

• Due to this, and since the inverse of any pos. def matrix is also pos. def, we cansay that, for all x 6= µ:

−1

2(x− µ)TΣ−1(x− µ) < 0 (274)

• Theorem: For any random vector X with mean µ and covariance matrix Σ 51

Σ = E[(X − µ)(X − µ)T

]= E

[XXT

]− µµT (275)

• Theorem: The covariance matrix Σ of any random vector X is symmetric positivesemidefinite.

– In the particular case of Gaussians, which require existence of Σ−1, we alsohave that Σ is then full rank. “Since any full rank symmetric positive semidef-inite matrix is necessarily symmetric positive definite, it follows that Σ mustbe symmetric positive definite.

• The diagonal covariance matrix case. An n-dimensional Gaussian with meanµ ∈ Rn and diagonal Σ = diag(σ2

1, σ22, . . . , σ

2n) is the same as n independent

Gaussian random variables with mean µi and σ2i , respectively. (i.e. P (X) =

P (x1) · P (x2) · · ·P (xn) where each P (xi) is a univariate Gaussian PDF.

• ISOCONTOURS. General intuitions listed below52

– For random vector X ∈ R2 with µ ∈ R2, isocontours are ellipses centered on(µ1, µ2).

– If Σ diagonal, then principal axes lie along x and y axis. Otherwise, in moregeneral case, they are along the covariance eigenvects. (right?)

51Also in Probability Review52Disclaimer: The following were based on an example with n = 2 and diagonal Σ. I’ve done my

best to generalize the arguments they made here. I’m like, pretty sure I’m right, but...you know howthings can go.

96

• Theorem: Let X ∼ N (µ,Σ) for some µ ∈ Rn and Σ ∈ Sn++. Then there exists a

matrix B ∈ Rnxn such that if we define Z = B−1(X − µ), then Z ∼ N (0, I).

Misc. Facts

• The sum of absolute residuals is less sensitive to outliers than the residual sum ofsquares. [Todo: study the flaws of least-squares regression.]

• In LDA, the discriminant functions δk(x) are an equivalent description of the de-cision rule, classifying as G(x) = argmaxk δk(x), where (for LDA),

δk(x) = xTΣ−1µk −1

2µTkΣ

−1µk + log πk (276)

• Large value of C in soft-margin SVM objective function |w|2 + C∑

ξi is likely tooverfit training data. This is because it will drive the ξi very low/zero, whichmeans it constructed a (likely nonlinear) decision boundary such that most pointswere either on or outside the margin. The key here is that changing the ξi asso-ciated with points doesn’t mean you’re ignoring them or something, it means youare manipulating the decision boundary to more closely resemble your trainingdistribution.

• Can’t believe this is necessary, but remember that the sum in the following de-nominator is over y (not x):

P (Y = yi|X = xi) =fi(xi)πi∑

yj∈Y fj(xi)πj

(277)

If binary class classification, decision boundary is at x = x∗ where P (Y = 1|x∗) =P (Y = 0|x∗) = 1

2. If logistic regression, this occurs when the argument h(x∗) to

the exponential in denominator is exp(h(x∗)) = exp(0) = 1. So, to find the valuesof x along decision boundary, in this particular case, you’d solve h(x) = 0.

• [DIS3.2] Ok. First, never forget that

1 =

∫x∈X|Yi

fX|Y=Yi(x)dx (278)

and, therefore, if you’re told that xn sampled

iid and uniformly at random from 2 equiprobable classes, a disk of radius 1(Y = +1) and a ring from 1 to 2 (Y = -1)

then you should be able to see why (hint: the equation I just wrote) fx|Y=+1 = 1/πfor ||X|| ≤ 1 and fx|Y=−1 = 1/3π for 1 ≤ ||X|| ≤ 2. The fact that they areequiprobable mean fY (Y = +1) = fY (Y = −1) = 1

2which means you can write

the density of X, fX .

97

Concept Summaries December 3

Final StudyingTable of Contents Local Written by Brandon McKinzie

PCA

What the heck is PCA? An attempt to remove the many ambiguities.

• Goal. Dimensionality reduction. Want to find the best rank-r approximation tothe data.• Algorithm.

1. Center the data. Xc =[x1 − µx x2 − µx . . . xn − µx

]Why? Because

soon we will want to project along certain axes of the data, which won’t makesense unless they [the axes] pass through the origin.

2. SVD. Compute SVD(Xc) = USV T . Why? Because (1) the columns of Ucontain the principal axes that we want to project X onto and (2), this justends up meaning we return a portion of SV T (more in next step).

3. Return [X = SrVTr , Ur, µx] Why? Because (1) SrV

Tr is our best rank-r

approximation to the data, (2) the columns of Ur give us the principal axes,and (3) µx tells us how to un-center our data if we should want to do that.

• Why maximize variance? We want our rank-r approximation X to resemble theactual X as closely as possible. Intuitively, since dimensionality reduction resultsin (obviously) lost dimensions from the original X, we should only remove thedimensions that provide the least amount of information about how the samples(rows of X) are distributed. In other words, to best represent X in a smallernumber, r, of dimensions, we should choose the dimensions where the data is mostspread out/all over the place, as opposed to all close to the mean.

98

Kernel PCA

• Motivation for Kernels: If we want to map sample points to a very high-dimensional feature space, the kernel trick can save us from having to computethose features explicitly, thereby saving a lot of time.

• Covariance Matrix. Assume zero mean,∑n

i=1 xi = 0. The sample covariancematrix is

Σ =n∑

i=1

xixTi = XTX (279)

where X is our n× d data matrix.

Kernel PCA

• Not really sure why: Any vector w ∈ Rd can be written as w = wn + XTα forsome wn in nullspace of X and some α ∈ Rn.

Bias-Variance Tradeoff

Closed-Form Solutions and Properties. Don’t forget to put these on your cheatsheet:

θols = (XTX)−1XT~y (280)

θridge = (XTX+ λId)−1XT~y (281)

Also, for θols, don’t forget that

Var(θols) = E[θolsθ

Tols

](282)

i.e. E[(θols − E

[θols

])(θols −−E

[θols

])T]= E

[θolsθ

Tols

]which is just a consequence of

the fact that

Var(X + µ) = Var(X) + Var(µ) + 2Cov(X,µ) (283)

= Var(X) (284)

99

Classifiers

Bayes Classifier, Bayes Risk, and Related.

→ The test error rate, Ave(I(y0 6= y0)), is minimized on average by the Bayes classi-fier, which assigns test point x0 to class j for which Pr(Y = j | X = x0) is largest.

→ Assume we have access to posterior distributions Pr(Y = j | X = x).

KNN.

→ Essentially a Bayes classifier that estimates the posterior Pr(Y = j | X = x) as

Pr(Y = j | X = x0) =1

K

∑i∈N0

I(yi = j) (285)

where N0 is the set of K training points closest to x0.

Risk, Empirical Risk Minimization, and Related

Function Properties. A function f(x) is convex iff53 f ′′(x) ≥ 0 for all x in any interval[a, b] in its domain.

Expectations. [SOLVED] My Piazza Post

• Q: Why is ES

[loss(wTxi, yi)

]= R[w]?

• A: Because ES is NOT the empirical expectation. It is the expectation over

S ∼ Dn where Dn :=n×

i=1

D

which means we can state the following:

ES∼Dn [loss(wTxi, yi)] = E(x,y)∼D[loss(wTx, y)]

53If f ′′(x) > 0, then we say it’s strictly convex.

100

https://piazza.com/class/is83id8c49m5at?cid=641

Bayes Classifier, Bayes Risk, and Related.

→ The test error rate, Ave(I(y0 6= y0)), is minimized on average by the Bayes classi-fier, which assigns test point x0 to class j for which Pr(Y = j | X = x0) is largest.

→ Bayes error rate. Below, the expectation is over all values that X can take on.

1− E[max

jPr(Y = j | X)

]→ Assume we have access to posterior distributions Pr(Y = j | X = x).

Trees

Intuition/Conceptual.

• Splits. One useful way of visualizing splits for a given feature, it to sort the[labeled] data along that feature. Then, the split is just the left half of the sorteddata goes to one branch, the other half goes to the other branch.

101

1 Machine Learning CS 189 5 - Brandon...

Documents

Transcript of 1 Machine Learning CS 189 5 - Brandon...