Post on 13-Aug-2021
Advanced Introduction to Machine Learning— Spring Quarter, Week 5 —
https://canvas.uw.edu/courses/1372141
Prof. Je↵ Bilmes
University of Washington, Seattle
Departments of: Electrical & Computer Engineering, Computer Science & Engineering
http://melodi.ee.washington.edu/~bilmes
April 27th/29nd, 2020
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F1/65 (pg.1/170)
Logistics Review
Announcements
HW2 is posted. Due May 1st, 6:00pm via our assignment dropbox(https://canvas.uw.edu/courses/1372141/assignments).
Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F2/65 (pg.2/170)
Logistics Review
Class Road Map
W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.
Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F3/65 (pg.3/170)
Logistics Review
Class (and Machine Learning) overview
1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?
2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning
3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning
4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap
6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods
7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)
12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling
8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction
9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization
10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing
11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization
x6
x3
x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3 x6
x2x4
x5
x1
x3
x2x4
x5
x1
x3
x2x4
x1
x3
x2
x1
x2
x1
x3
x2x4
x5
x1
x2x4
x5
x1
x2
x5
x1
x2
=X
x3
X
x4
X
x5
(x1, x2) (x1, x3) (x3, x4) (x3, x5)X
x6
(x2, x6)
| {z }� 66,2(x2)
=X
x4
X
x5
(x1, x2)� 66,2(x2)X
x3
(x1, x3) (x3, x4) (x3, x5)
| {z }� 63,1,4,5(x1,x4,x5)
=X
x5
(x1, x2)� 66,2(x2)X
x4
� 63,1,4,5(x1, x4, x5)
| {z }� 63, 64,1,5(x1,x5)
= (x1, x2)� 66,2(x2)X
x5
� 63, 64,1,5(x1, x5)
| {z }� 63,65, 64,1(x1)
= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)
p(x1, x2) =X
x3
X
x4
· · ·
X
x6
p(x1, x2, . . . , x6)
X
x3
X
x4
X
x5
(x1, x2) (x1, x3) (x3, x4) (x3, x5)X
x6
(x2, x6)
| {z }� 66,2(x2)
=X
x3
X
x4
(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X
x5
(x3, x5)
| {z }� 65,3(x3)
=X
x3
(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X
x4
(x3, x4)
| {z }� 64,3(x3)
= (x1, x2)� 66,2(x2)X
x3
(x1, x3)� 65,3(x3)� 64,3(x3)
| {z }� 65, 64, 63,1(x1)
= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)
p(x1, x2) =X
x3
X
x4
· · ·
X
x6
p(x1, x2, . . . , x6)
=
Reconstituted Graph Reconstituted Graph
O(r2)
O(r4)
O(r3)
O(r2)
x6
x5
x4
x3
O(r2)
O(r2)
O(r2)
O(r2)
GraphicalTransformation
CorrespondingMarginalization Operation
GraphicalTransformation
CorrespondingMarginalization Operation
Variableto
Eliminateand
Complexity
Variableto
Eliminateand
Complexity
InputLayer
HiddenLayer 1
HiddenLayer 2
HiddenLayer 3
HiddenLayer 4
HiddenLayer 5
HiddenLayer 6
HiddenLayer 7
OutputUnit
5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F4/65 (pg.4/170)
Logistics Review
Multiclass and Voronoi tessellations
` classes, j 2 {1, 2, . . . , `}, parameters ✓(j) 2 Rm, ` linear discriminantfunctions h✓(j)(x) = h✓
(j), xi.
Partition input space x 2 Rm based on who wins
Rm = Rm1 [ Rm
2 [ · · · [ Rm` (5.19)
where Rmi ✓ Rm and Rm
i \ Rmj = ; whenever i 6= j.
Winning block, for all i 2 [`]:
Rmi , {x 2 Rm : h✓(i)(x) > h✓(j)(x), 8j 6= i} (5.20)
Alternatively,
Rmi ,
nx 2 Rm : h✓(i) � ✓
(j), xi > 0, 8j 6= i
o(5.21)
consider when h·, ·i is dot product.Alternatively, under softmax regression model,
Rmi , {x 2 Rm : Pr(Ci|x) > Pr(Cj |x), 8j 6= i} (5.22)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F5/65 (pg.5/170)
Logistics Review
0/1-loss and Bayes error
Let L be 0/1, i.e., L(y, y0) = 1{y 6=y0}.
Probability of error for a given x
Pr(h✓(x) 6= Y ) =
Zp(y|x)1{h✓(x) 6=y}dy = 1� p(h✓(x)|x) (5.7)
To minimize the probability of error, h✓(x) should ideally choose aclass having maximum posterior probability for the current x.
Smallest probability of error is known is Bayes error for x
Bayes error(x) = miny
(1� p(y|x)) = 1�maxy
p(y|x) (5.8)
Bayes classifier (or predictor) predicts using highest probability,assuming have access to p(y|x). I.e.,
BayesPredictor(x) , argmaxy
p(y|x) (5.9)
Bayes predictor has ’Bayes error’ as its error. Irreducible error rate.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F6/65 (pg.6/170)
Logistics Review
Bayes Error, overall di�culty of a problem
For Bayes error, we often take the expected value w.r.t. x. This givesan indication overall of how di�cult a classification problem is, sincewe can never do better than Bayes error.
Bayes error, various equivalent forms (Exercise: show last equality):
Bayes Error = minh
Zp(x, y)1{h(x) 6=y}dydx (5.13)
= minh
Pr(h(X) 6= Y ) (5.14)
= Ep(x)[min(p(y|x), 1� p(y|x))] (5.15)
=1
2�
1
2Ep(x)[|2p(y|x)� 1|] (5.16)
For binary classification, if Bayes error is 1/2, prediction can never bebetter than flipping a coin (i.e., x tells us nothing about y).
Bayes error, property of the distribution p(x, y); can be useful to decidebetween say p(x, y) vs. p(x0, y) for two di↵erent feature sets x and x
0.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F7/65 (pg.7/170)
Logistics Review
ERM and complexity
Empirical risk minimization on data D:
h 2 argminh2H
dRisk(h) = 1
n
nX
i=1
1{h(x(i)) 6=y(i)} (5.13)
h is a random variable, computed based on random data set D.How close is h to true risk, the one for the best hypothesis h⇤?
h⇤2 argmin
h2HRisk(h) =
Zp(x, y)1{h(x) 6=y}dxdy (5.14)
We can sometimes bound the probability (e.g., PAC-style bounds)
Pr(Risk(h)� Risk(h⇤) > ✏) < �(n,VC, ✏) (5.15)
where � is function of sample size n, and complexity (ability, flexibility,expressiveness) of family of models H over which we’re learning.Bound gets worse with increasing complexity of H or decreasing n.Extensive mathematical theory behind this (e.g., Vapnik-Chervonenkis(VC) dimension, or Rademacher complexity).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F8/65 (pg.8/170)
Logistics Review
Generative vs. Discriminative models
Generative models, given data drawn from p(x, y) model the entiredistribution, often factored as p(x|y)p(y) where y is the class label (orprediction) random variable and x is the feature vector randomvariable.
Discriminative model: When goal is classification/regression, cansometimes be simpler to just model p(y|x) since that is su�cient toachieve Bayes error (i.e., argmaxy p(y|x) for classification) or E[Y |x]for regression. (e.g., support vector machines, max-margin learning,conditional likelihood, etc.)
Sometimes they are the same. Naıve Bayes class-conditional Gaussiangenerative model p(x|y) leads to logistic regression.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F9/65 (pg.9/170)
Logistics Review
Generative vs. Discriminative models
Generative models, given data drawn from p(x, y) model the entiredistribution, often factored as p(x|y)p(y) where y is the class label (orprediction) random variable and x is the feature vector randomvariable.
Discriminative model: When goal is classification/regression, cansometimes be simpler to just model p(y|x) since that is su�cient toachieve Bayes error (i.e., argmaxy p(y|x) for classification) or E[Y |x]for regression. (e.g., support vector machines, max-margin learning,conditional likelihood, etc.)
Sometimes they are the same. Naıve Bayes class-conditional Gaussiangenerative model p(x|y) leads to logistic regression.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F9/65 (pg.10/170)
Logistics Review
Generative vs. Discriminative models
Generative models, given data drawn from p(x, y) model the entiredistribution, often factored as p(x|y)p(y) where y is the class label (orprediction) random variable and x is the feature vector randomvariable.
Discriminative model: When goal is classification/regression, cansometimes be simpler to just model p(y|x) since that is su�cient toachieve Bayes error (i.e., argmaxy p(y|x) for classification) or E[Y |x]for regression. (e.g., support vector machines, max-margin learning,conditional likelihood, etc.)
Sometimes they are the same. Naıve Bayes class-conditional Gaussiangenerative model p(x|y) leads to logistic regression.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F9/65 (pg.11/170)
Logistics Review
Accuracy Objectives
Decreasing order of complexity of the task being learnt, while still solving aclassification decision problem. From strongest to weakest goals.
generative modeling, model p(x, y) and compute argmaxy p(x, y) forgiven x. Maximum likelihood principle.
discriminative modeling, model only p(y|x) and computeargmaxy p(y|x). Maximum conditional likelihood principle.
rank modeling, rank function model f(y|x) so thatf(�1|x) � f(�2|x) � · · · � f(�`|x) wheneverp(�1|x) � p(�2|x) � · · · � p(�`|x).
best decision modeling, decision function hy(x) � hy0(x) wheneverp(y|x) � p(y0|x) for all y0 6= y. Discriminant functions.
Weaker model often lives within stronger model (e.g., logistic or softmaxuses linear discriminant functions) as we saw, thereby giving it improvedinterpretation (e.g., probabilities rather than just unnormalized scores).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F10/65 (pg.12/170)
Logistics Review
Naıve Bayes classifier
Recall, x 2 Rm, so m feature values in x = (x1, x2, . . . , xm) used topredict y in joint model p(x, y).
Generative conditional distribution p(x, y) = p(x|y)p(y). Conditionedon y, p(x|y) can be an arbitrary distribution.
Naıve Bayes classifiers make simplifying (factorization) assumption.
p(x|y) =mY
j=1
p(xj |y) (5.21)
or that Xi??Xj |Y , read “Xi and Xj are conditionally independentgiven Y , for all i 6= j. This does not mean that Xi??Xj .
Estimating many 1-dimensional distributions p(xj |y) much easier thanfull joint p(x|y) due to curse of dimensionality.
Naıve Bayes classifiers make a greatly simplifying assumption but oftenworks quite well.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F11/65 (pg.13/170)
Logistics Review
Posterior From Binary Naive Bayes Gaussian
Binary case y 2 {0, 1}, Naıve Bayes Gaussian classifier posteriorprobability:
p(y = 1|x) =p(x|y = 1)p(y = 1)
p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)(5.23)
=1
1 + p(x|y=0)p(y=0)p(x|y=1)p(y=1)
(5.24)
=1
1 + exp⇣� log p(x|y=1)p(y=1)
p(x|y=0)p(y=0)
⌘ (5.25)
= g
⇣ mX
i=1
logp(xi|y = 1)
p(xi|y = 0)+ log
p(y = 1)
p(y = 0)
⌘(5.26)
where g(z) = 1/(1 + e�x) is the logistic function,
log p(y = 1)/p(y = 0) is log odds-ratio.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F12/65 (pg.14/170)
Logistics Review
Binary (` = 2) Naıve Bayes Gaussian
Since p(xi|y) is univariate Gaussian, log p(xi|y=1)p(xi|y=0) + log p(y=1)
p(y=0) takesone of the two following forms:
1 bixi+ ci if �1,i = �2,i 8i (equal class variances). Hence, we have a linearmodel p(y = 1|x) = g(✓|x+ const) like we saw in Logistic regression!!
2 aix2i + bixi + ci otherwise, so we have a quadratic model followed by
the logistic function. p(y = 1|x) = g(Pm
i=1 aix2i + bixi + ci). Unequal
class variances introduce quadratic features. Alternatively, we couldhave just started with features like x
2i , increasing m, without need for
unequal class variances. Still no cross interaction terms of the from xixj
due to Naıve Bayes assumption! But can introduce those xixj crossfeatures as well, further increasing m, feature augmentation.
3 non-Naıve-Bayes with fewer features is identical to doing Naıve Bayeswith more features. This helps to explain why Naıve Bayes often worksso well, since one can remove unfounded assumptions simply by addingfeatures.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F13/65 (pg.15/170)
Logistics Review
Binary Naive Bayes Gaussian
Bottom line: Binary Naive Bayes Gaussian (equivalently, two-classGaussian generative model classifiers) is equivalent to logisticregression when we form the posterior distribution p(y|x) forclassification. We can eliminate the Naive Bayes property in originalfeature space by adding more features (feature augmentation).
In general, feature engineering (producing and choosing right features)is an important part of any machine learning system.
Representation learning - the art and science of using machine learningto learn the features themselves, often using a deep neural network.Self-supervised learning can facilitate this (see final lecture).
Disentanglement - the art and science of taking a featurerepresentation and producing new deduced features where separatephenomena in the signal have separate representation in the deducedfeatures. A modern form of ICA (independent component analysis).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F14/65 (pg.16/170)
Logistics Review
Norms and p-norms
For any real valued p � 1 and vector x 2 Rm, we have a norm:
kxkp ,
0
@mX
j=1
|xj |p
1
A1/p
(5.24)
Properties of any “norm” kxk for x 2 Rm are:1 kxk � 0 for any x 2 Rm.2 kxk = 0 i↵ x = 0.3 Absolutely homogeneous: k↵xk = |↵|kxk for any ↵ 2 R.4 Triangle inequality: For any x, y 2 Rm, we have kx+ yk kxk+ kyk.
Euclidean norm: The 2-norm (where p = 2) is the standard Euclideandistance (from zero) that we use in everyday life.
p-norm is also defined and is a valid norm for p!1 in which case
kxk1 =m
maxj=1
|xj | (5.25)
There are other norms besides p norms.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F15/65 (pg.17/170)
peo
11×1/0 Counting horn .
Counts the
numberof
non- tho
elemrb
in the vector
XEIR"
Logistics Review
The Lasso and Constrained Optimization
Lasso adds penalty to learnt weights
✓ 2 argmin✓
1
2
nX
i=1
1
2(y(i) � ✓
|x(i))2 + �k✓k1 (5.25)
where k · k1 is known as the 1-norm, defined as k✓k1 =Pm
j=1 |✓j |.
Equivalent formulation:
minimize✓
nX
i=1
(y(i) � ✓|x(i))2 (5.26)
subject to k✓k1 < t (5.27)
One-to-one correspondence between � and t yielding identical solution(see Boyd and Vandenberghe’s book on Convex Analysis).
No e↵ect if t >Pm
i=1 |✓i| where ✓ is the least squares estimate.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F16/65 (pg.18/170)
Logistics Review
Sparsity
Consider linear model h✓(x) = h✓, xi, where x 2 Rm.
m might be very big since we don’t know what features to use.
Useful: allow learning system to throw away needless features, throwaway means the resulting learnt model has ✓j = 0. Helps with:
reduce computation, memory, and resources (e.g., sensor requirements)improve prediction (reduce noise variables). Trade o↵ a bit of biasincrease in order to reduce variance (to improve overall accuracy).Helps ease interpretation (what from x is critical to predict y). Removedirty variables (data hygiene).Reduce redundant cancellation, i.e., if typically ✓ixi ⇡ �✓jxj , no e↵ecton prediction even if both ✓i and ✓j are not zero.
E↵ective new model has m0⌧ m features.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F17/65 (pg.19/170)
Logistics Review
Feature selection in ML
Feature Selection: a branch of feature engineering.
Many applications start with massive feature sets, more than isnecessary.
Example: spam detection, natural language processing (NLP) andn-gram features.Cancer detection in computational biology, kmer motifs, gene expressionvia DNA microarrays, mRNA abundance, biopsies.
Important in particular when m > n but potentially useful for any n,m
when features are redundant/superfluous, irrelevant, or expensive.
Overall goal, want least costly but still informative feature set.
Sparsity (via shrinkage methods) is one way to do this, but otherdiscrete optimization methods too (e.g., the greedy procedure).
Many are implemented in python’s sklearn:https://scikit-learn.org/stable/modules/classes.html#
module-sklearn.feature_selection
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F18/65 (pg.20/170)
Logistics Review
Greedy Forward Selection HeuristicLet x 2 Rm. Let the features be indexed integer setU = {1, 2, . . . ,m}, and for A ✓ U let xA = (xa1 , xa2 , . . . , xak) be thefeatures indexed by subset A, where |A| = k. Similarly ✓A is so defined.Let h✓A(xA) be the best learnt model using only features A givingparameters ✓A and accuracy(A) ⇡ 1� Pr(h✓A(XA) 6= Y ) .To find best set (maxA✓U :|A|=k accuracy(A)), need to consider�mk
�= O((me/k)k) subsets by Stirling’s formula, exponential in m.
Greedy forward selection to select k features, achieve sparsity, O(mk).
Algorithm 1: Greedy Forward Selection
1 A ;;2 for i = 1, · · · , k do3 Choose v
02 argmaxv2U\A accuracy(A [ {v}) ;
4 A A [ {v0};
Better than�nk
�, still need to retrain model each time.
One solution model accuracy (e.g., submodularity, later). Anothersolution, Lasso regression (next)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F19/65 (pg.21/170)
Logistics Review
Forward vs. Backwards Selection
Rather than start from the empty set and add one at a time, we can:
Start with all features and greedily remove the feature that hurts usthe least. backwards selection
This is also only a heuristic (might not work well).
Backwards selection, variance is an issue when using many features.
Bidirectional greedy: pick a random order, consider both addingstarting at empty set, and removing starting at all features, based onorder, and randomly choose based on distribution proportional to both:
1 accuracy gain accuracy(A [ {v})� accuracy(A) and2 accuracy loss accuracy(S)� accuracy(S \ {v}).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F20/65 (pg.22/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Back To Lasso
Ridge regression adds sum-of-squares penalty to learnt coe�cients.penalty to learnt weights
✓ 2 argmin✓
Jridge(✓) =nX
i=1
(y(i) � ✓|x(i))2 + �k✓k
22 (5.1)
where k · k2 is known as the 2-norm, defined as k✓k2 =qPm
j=1 ✓2j .
Lasso linear regression
✓ 2 argmin✓
Jlasso(✓) =nX
i=1
(y(i) � ✓|x(i))2 + �k✓k1 (5.2)
where k · k1 is known as the 1-norm, defined as k✓k1 =Pm
j=1 |✓j |.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F21/65 (pg.23/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Back To Lasso
Ridge regression adds sum-of-squares penalty to learnt coe�cients.penalty to learnt weights
✓ 2 argmin✓
Jridge(✓) =nX
i=1
(y(i) � ✓|x(i))2 + �k✓k
22 (5.1)
where k · k2 is known as the 2-norm, defined as k✓k2 =qPm
j=1 ✓2j .
Lasso linear regression
✓ 2 argmin✓
Jlasso(✓) =nX
i=1
(y(i) � ✓|x(i))2 + �k✓k1 (5.2)
where k · k1 is known as the 1-norm, defined as k✓k1 =Pm
j=1 |✓j |.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F21/65 (pg.24/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
p-norm unit ball contours (in 2D)
m{x 2 R : x 2 = 1}
m{x 2 R : x 1 = 1}
{x 2 Rm : x 1 = 1}
x1
x2
We have (consistent w. picture) that if 0 < r < p then kxkr � kxkp, thus:
{x 2 Rm : kxk1 1} ⇢ {x 2 Rm : kxk2 1} ⇢ {x 2 Rm : kxk1 1}
Note, kxk1 = maxi xi is known as the infinity-norm (or max-norm).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F22/65 (pg.25/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
p-norm unit ball contours (in 2D)
m{x 2 R : x 2 = 1}
m{x 2 R : x 1 = 1}
{x 2 Rm : x 1 = 1}
x1
x2
We have (consistent w. picture) that if 0 < r < p then kxkr � kxkp, thus:
{x 2 Rm : kxk1 1} ⇢ {x 2 Rm : kxk2 1} ⇢ {x 2 Rm : kxk1 1}
Note, kxk1 = maxi xi is known as the infinity-norm (or max-norm).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F22/65 (pg.26/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
More p-norm unit ball contours (in 2D)
Right: various p-normballs p � 1 andanalogous p < 1.For p = 0, we get theL0 ball, which countsthe number ofnon-zero entries and ismost sparseencouraging. The L1ball is the convexenvelope of the L0ball, so it is arguablythe best combinationof easy & appropriate.
�1.00 �0.75 �0.50 �0.25 0.00 0.25 0.50 0.75 1.00x1
�1.00
�0.75
�0.50
�0.25
0.00
0.25
0.50
0.75
1.00
x2
Various norm balls
{x : kxk30 = 1}
{x : kxk5 = 1}
{x : kxk2 = 1}
{x : kxk3/2 = 1}
{x : kxk1 = 1}
{x : kxk1/2 = 1}
{x : kxk1/4 = 1}
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F23/65 (pg.27/170)
N
p a
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Regression, Ridge, Lasso
✓ ✓
✓1 ✓1
✓2✓2✓:
n
i=1
(y(i) �✓x(i) )2 =
c
✓:
n
i=1
(y(i) �✓x(i) )2 =
c
{✓ 2 Rm : ✓ 2 t}{✓ 2 Rm : ✓ 1 t}
✓ is the linear least squares solution, with quadratic contours of equal error.For given �, solution trades o↵ regularizer cost w. error cost. From Hastie et. al.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F24/65 (pg.28/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Other norms as regularizers
Using p-norms with p < 1 encourages sparsity even further (but harder tooptimize).
✓
✓1
✓2
✓:
n
i=1
(y(i) �✓x(i) )2 =
c
{✓ : k✓k30
{✓ : k✓k5 = 1}
= 1}
{✓ : k✓k2 = 1}
{✓ : k✓k3/2 = 1}
{✓ : k✓k1 = 1}
{✓ : k✓k1/2 = 1}
{✓ : k✓k1/4 = 1}
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F25/65 (pg.29/170)
" one-
- ( Einstein)'"
I 06••⑧ ¥÷÷÷-
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso: How to solve
Closed form solution for ridge, no closed form for lasso.
We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.
@
@✓|✓|(✓) =
8><
>:
1 if ✓ > 0
�1 if ✓ < 0
?? if ✓ = 0
(5.3)
How to solve? Many possible ways.
Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.
Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).
Results in simple solution, soft thresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.30/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso: How to solve
Closed form solution for ridge, no closed form for lasso.
We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.
@
@✓|✓|(✓) =
8><
>:
1 if ✓ > 0
�1 if ✓ < 0
?? if ✓ = 0
(5.3)
How to solve? Many possible ways.
Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.
Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).
Results in simple solution, soft thresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.31/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso: How to solve
Closed form solution for ridge, no closed form for lasso.
We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.
@
@✓|✓|(✓) =
8><
>:
1 if ✓ > 0
�1 if ✓ < 0
?? if ✓ = 0
(5.3)
How to solve? Many possible ways.
Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.
Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).
Results in simple solution, soft thresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.32/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso: How to solve
Closed form solution for ridge, no closed form for lasso.
We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.
@
@✓|✓|(✓) =
8><
>:
1 if ✓ > 0
�1 if ✓ < 0
?? if ✓ = 0
(5.3)
How to solve? Many possible ways.
Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.
Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).
Results in simple solution, soft thresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.33/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso: How to solve
Closed form solution for ridge, no closed form for lasso.
We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.
@
@✓|✓|(✓) =
8><
>:
1 if ✓ > 0
�1 if ✓ < 0
?? if ✓ = 0
(5.3)
How to solve? Many possible ways.
Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.
Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).
Results in simple solution, soft thresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.34/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso: How to solve
Closed form solution for ridge, no closed form for lasso.
We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.
@
@✓|✓|(✓) =
8><
>:
1 if ✓ > 0
�1 if ✓ < 0
?? if ✓ = 0
(5.3)
How to solve? Many possible ways.
Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.
Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).
Results in simple solution, soft thresholdingProf. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.35/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Coordinate DescentIsolate one coordinate k fix the remainder j 6= k
nX
i=1
(y(i) � ✓|x(i))2 + �k✓k1 =
nX
i=1
0
@y(i)�
mX
j=1
✓jx(i)j
1
A2
+ �k✓k1
(5.4)
=nX
i=1
0
@y(i)� ✓kx
(i)k �
X
j 6=k
✓jx(i)j
1
A2
+ �
X
j 6=k
|✓j |+ �|✓k|
(5.5)
=nX
i=1
(r(i)k � ✓kx(i)k )2 + �
X
j 6=k
|✓j |+ �|✓k| (5.6)
= J�,lasso(✓k; ✓1, ✓2, . . . , ✓k�1, ✓k+1, . . . , ✓m) (5.7)
We then optimize only ✓k holding the remainder fixed, and thenchoose a di↵erent k0 for the next round.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F27/65 (pg.36/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Coordinate DescentIsolate one coordinate k fix the remainder j 6= k
nX
i=1
(y(i) � ✓|x(i))2 + �k✓k1 =
nX
i=1
0
@y(i)�
mX
j=1
✓jx(i)j
1
A2
+ �k✓k1
(5.4)
=nX
i=1
0
@y(i)� ✓kx
(i)k �
X
j 6=k
✓jx(i)j
1
A2
+ �
X
j 6=k
|✓j |+ �|✓k|
(5.5)
=nX
i=1
(r(i)k � ✓kx(i)k )2 + �
X
j 6=k
|✓j |+ �|✓k| (5.6)
= J�,lasso(✓k; ✓1, ✓2, . . . , ✓k�1, ✓k+1, . . . , ✓m) (5.7)
We then optimize only ✓k holding the remainder fixed, and thenchoose a di↵erent k0 for the next round.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F27/65 (pg.37/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Coordinate Descent
General algorithm, one round of coordinate descent to be repeated.
Algorithm 2: Coordinate Descent for Lasso Solution
1 ✓ 0 ;2 for k = 1, · · · ,m do
3 r(i)k y
(i)�P
j 6=k x(i)j ✓j ;
4 ✓k argmin✓kPn
i=1(r(i)k � x
(i)k ✓k)2 + �|✓k|
We repeat the above several times until convergence.
This doesn’t help lack of di↵erentiability. How do we perform theminimization?
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F28/65 (pg.38/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm
! R is convex if for allx, y 2 Rm and � 2 [0, 1]:
f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)
Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g
|(z � x).
Validwhen f is di↵erentiable(e.g., at x1) and when f
is not di↵erentable (e.g.,at x2).
Set of all subgradients of f at x known as the subdi↵erential at x
@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)
If f is di↵erentiable at x then @f (x) = {rf(x)}.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.39/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm
! R is convex if for allx, y 2 Rm and � 2 [0, 1]:
f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)
Convex functions have no local minima other than the global minima
A vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g
|(z � x).
Validwhen f is di↵erentiable(e.g., at x1) and when f
is not di↵erentable (e.g.,at x2).
Set of all subgradients of f at x known as the subdi↵erential at x
@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)
If f is di↵erentiable at x then @f (x) = {rf(x)}.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.40/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm
! R is convex if for allx, y 2 Rm and � 2 [0, 1]:
f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)
Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g
|(z � x).
Validwhen f is di↵erentiable(e.g., at x1) and when f
is not di↵erentable (e.g.,at x2).
x1 x2
f(x1) + gT1 (z � x1)
f(x2) + gT2 (z � x2)
f(x2) + gT3 (z � x2)
f(z)
z
Set of all subgradients of f at x known as the subdi↵erential at x
@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)
If f is di↵erentiable at x then @f (x) = {rf(x)}.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.41/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm
! R is convex if for allx, y 2 Rm and � 2 [0, 1]:
f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)
Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g
|(z � x). Validwhen f is di↵erentiable(e.g., at x1) and when f
is not di↵erentable (e.g.,at x2).
x1 x2
f(x1) + gT1 (z � x1)
f(x2) + gT2 (z � x2)
f(x2) + gT3 (z � x2)
f(z)
z
Set of all subgradients of f at x known as the subdi↵erential at x
@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)
If f is di↵erentiable at x then @f (x) = {rf(x)}.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.42/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm
! R is convex if for allx, y 2 Rm and � 2 [0, 1]:
f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)
Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g
|(z � x). Validwhen f is di↵erentiable(e.g., at x1) and when f
is not di↵erentable (e.g.,at x2).
x1 x2
f(x1) + gT1 (z � x1)
f(x2) + gT2 (z � x2)
f(x2) + gT3 (z � x2)
f(z)
z
Set of all subgradients of f at x known as the subdi↵erential at x
@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)
If f is di↵erentiable at x then @f (x) = {rf(x)}.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.43/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm
! R is convex if for allx, y 2 Rm and � 2 [0, 1]:
f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)
Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g
|(z � x). Validwhen f is di↵erentiable(e.g., at x1) and when f
is not di↵erentable (e.g.,at x2).
x1 x2
f(x1) + gT1 (z � x1)
f(x2) + gT2 (z � x2)
f(x2) + gT3 (z � x2)
f(z)
z
Set of all subgradients of f at x known as the subdi↵erential at x
@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)
If f is di↵erentiable at x then @f (x) = {rf(x)}.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.44/170)
2bet Ctrl
*Tqtz
m X
R-
-
*Af
1/2
-O
↳ (z) = f- ( th tgz (z - ta)= HK) -1 gz
. Z - gz- te
= Zz . Z t ( fan ) - ga- Z ) = m . z +3
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Lasso
Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,
0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)
That is, a minimum means zero is a member of the subdi↵erential.
When f is everywhere di↵erentiable, this is same as finding x⇤ that
satisfies rxf(x⇤) = 0.
Fact: The lasso coordinate objectivePn
i=1(r(i)k � x
(i)k ✓k)2 + �|✓k| is
convex in the parameter ✓k
Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).
With L1 objective, things are thus almost as easy as with L2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.45/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Lasso
Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,
0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)
That is, a minimum means zero is a member of the subdi↵erential.
When f is everywhere di↵erentiable, this is same as finding x⇤ that
satisfies rxf(x⇤) = 0.
Fact: The lasso coordinate objectivePn
i=1(r(i)k � x
(i)k ✓k)2 + �|✓k| is
convex in the parameter ✓k
Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).
With L1 objective, things are thus almost as easy as with L2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.46/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Lasso
Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,
0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)
That is, a minimum means zero is a member of the subdi↵erential.
When f is everywhere di↵erentiable, this is same as finding x⇤ that
satisfies rxf(x⇤) = 0.
Fact: The lasso coordinate objectivePn
i=1(r(i)k � x
(i)k ✓k)2 + �|✓k| is
convex in the parameter ✓k
Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).
With L1 objective, things are thus almost as easy as with L2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.47/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Lasso
Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,
0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)
That is, a minimum means zero is a member of the subdi↵erential.
When f is everywhere di↵erentiable, this is same as finding x⇤ that
satisfies rxf(x⇤) = 0.
Fact: The lasso coordinate objectivePn
i=1(r(i)k � x
(i)k ✓k)2 + �|✓k| is
convex in the parameter ✓k
Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).
With L1 objective, things are thus almost as easy as with L2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.48/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Lasso
Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,
0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)
That is, a minimum means zero is a member of the subdi↵erential.
When f is everywhere di↵erentiable, this is same as finding x⇤ that
satisfies rxf(x⇤) = 0.
Fact: The lasso coordinate objectivePn
i=1(r(i)k � x
(i)k ✓k)2 + �|✓k| is
convex in the parameter ✓k
Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).
With L1 objective, things are thus almost as easy as with L2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.49/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Convexity, Subgradients, and Lasso
Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,
0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)
That is, a minimum means zero is a member of the subdi↵erential.
When f is everywhere di↵erentiable, this is same as finding x⇤ that
satisfies rxf(x⇤) = 0.
Fact: The lasso coordinate objectivePn
i=1(r(i)k � x
(i)k ✓k)2 + �|✓k| is
convex in the parameter ✓k
Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).
With L1 objective, things are thus almost as easy as with L2.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.50/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso’s Subdi↵erential
Define ak = 2Pn
i=1(x(i)k )2 and ck = 2
Pni=1 r
(i)k x
(i)k . Note ak > 0.
Find ✓k making the following contain zero:
@f = @
nX
i=1
(r(i)k � x(i)k ✓k)
2 + �|✓k|
!
=
8><
>:
{ak✓k � ck � �} if ✓k < 0
[�ck � �,�ck + �] if ✓k = 0
{ak✓k � ck + �} if ✓k > 0
The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.
If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).
If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.51/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso’s Subdi↵erential
Define ak = 2Pn
i=1(x(i)k )2 and ck = 2
Pni=1 r
(i)k x
(i)k . Note ak > 0.
Find ✓k making the following contain zero:
@f = @
nX
i=1
(r(i)k � x(i)k ✓k)
2 + �|✓k|
!
=
8><
>:
{ak✓k � ck � �} if ✓k < 0
[�ck � �,�ck + �] if ✓k = 0
{ak✓k � ck + �} if ✓k > 0
slopea k
slopea k
The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.
If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).
If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.52/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso’s Subdi↵erential
Define ak = 2Pn
i=1(x(i)k )2 and ck = 2
Pni=1 r
(i)k x
(i)k . Note ak > 0.
Find ✓k making the following contain zero:
@f = @
nX
i=1
(r(i)k � x(i)k ✓k)
2 + �|✓k|
!
=
8><
>:
{ak✓k � ck � �} if ✓k < 0
[�ck � �,�ck + �] if ✓k = 0
{ak✓k � ck + �} if ✓k > 0
✓k
ck < ��
|ck| �
ck > �
slopea k
slopea k
slopea k
slopea k
@f
The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.
If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).
If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.53/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso’s Subdi↵erential
Define ak = 2Pn
i=1(x(i)k )2 and ck = 2
Pni=1 r
(i)k x
(i)k . Note ak > 0.
Find ✓k making the following contain zero:
@f = @
nX
i=1
(r(i)k � x(i)k ✓k)
2 + �|✓k|
!
=
8><
>:
{ak✓k � ck � �} if ✓k < 0
[�ck � �,�ck + �] if ✓k = 0
{ak✓k � ck + �} if ✓k > 0
✓k
ck < ��
|ck| �
ck > �
slopea k
slopea k
slopea k
slopea k
@f
The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.
If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).
If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.54/170)
I
[anon - en -i anIw ③
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Lasso’s Subdi↵erential
Define ak = 2Pn
i=1(x(i)k )2 and ck = 2
Pni=1 r
(i)k x
(i)k . Note ak > 0.
Find ✓k making the following contain zero:
@f = @
nX
i=1
(r(i)k � x(i)k ✓k)
2 + �|✓k|
!
=
8><
>:
{ak✓k � ck � �} if ✓k < 0
[�ck � �,�ck + �] if ✓k = 0
{ak✓k � ck + �} if ✓k > 0
✓k
ck < ��
|ck| �
ck > �
slopea k
slopea k
slopea k
slopea k
@f
The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.
If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).
If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.55/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Soft Thresholding
This gives us the following solution (note again ak > 0):
✓k =
8><
>:
(ck + �)/ak if ck < �� (making ✓k < 0)
0 if |ck| � (making ✓k = 0)
(ck � �)/ak if ck > � (making ✓k > 0)
(5.11)
= sign(ck)max(0, |ck|� �)/ak (5.12)
Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.
The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.
If ck does get to small (in the middle term), then we just truncate ✓k
to zero, and when we truncate depends on � and ck.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F32/65 (pg.56/170)
- O
:
FCK ) a- oooo
r-b co. 93µm
pal ¥, [flyer -- redSm-µ) = { XEIR
"
: NHL - r} Fan is-morphism
of :Rm→iR"'
tan son xesm.ci i. THR'm
y *← Sm,
Cr),
x=OT( OCH)rt # Rm
-I
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Soft Thresholding
This gives us the following solution (note again ak > 0):
✓k =
8><
>:
(ck + �)/ak if ck < �� (making ✓k < 0)
0 if |ck| � (making ✓k = 0)
(ck � �)/ak if ck > � (making ✓k > 0)
(5.11)
= sign(ck)max(0, |ck|� �)/ak (5.12)
Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.
The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.
If ck does get to small (in the middle term), then we just truncate ✓k
to zero, and when we truncate depends on � and ck.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F32/65 (pg.57/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Soft Thresholding
This gives us the following solution (note again ak > 0):
✓k =
8><
>:
(ck + �)/ak if ck < �� (making ✓k < 0)
0 if |ck| � (making ✓k = 0)
(ck � �)/ak if ck > � (making ✓k > 0)
(5.11)
= sign(ck)max(0, |ck|� �)/ak (5.12)
Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.
The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.
If ck does get to small (in the middle term), then we just truncate ✓k
to zero, and when we truncate depends on � and ck.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F32/65 (pg.58/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Soft Thresholding
This gives us the following solution (note again ak > 0):
✓k =
8><
>:
(ck + �)/ak if ck < �� (making ✓k < 0)
0 if |ck| � (making ✓k = 0)
(ck � �)/ak if ck > � (making ✓k > 0)
(5.11)
= sign(ck)max(0, |ck|� �)/ak (5.12)
Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.
The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.
If ck does get to small (in the middle term), then we just truncate ✓k
to zero, and when we truncate depends on � and ck.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F32/65 (pg.59/170)
non -centered TLE) - Flott d- Ht
- EH,
µregularization
:
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Simple solution: X has orthonormal columns
Let ✓ be the linear least squares solution. When the design matrix X
has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.
Ridge solution becomes ✓j/(1 + �). shrinkage.
Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding
Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a
permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F33/65 (pg.60/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Simple solution: X has orthonormal columns
Let ✓ be the linear least squares solution. When the design matrix X
has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.
Ridge solution becomes ✓j/(1 + �). shrinkage.
Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding
Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a
permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F33/65 (pg.61/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Simple solution: X has orthonormal columns
Let ✓ be the linear least squares solution. When the design matrix X
has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.
Ridge solution becomes ✓j/(1 + �). shrinkage.
Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding
Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a
permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F33/65 (pg.62/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Simple solution: X has orthonormal columns
Let ✓ be the linear least squares solution. When the design matrix X
has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.
Ridge solution becomes ✓j/(1 + �). shrinkage.
Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding
Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a
permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F33/65 (pg.63/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Shrinkage relative to linear regression
(0,0)(0,0) (0,0)
�
ossaLegdiR tesbuStseB
|✓�k |
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F34/65 (pg.64/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Other regularizersOther regularizers
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �R(✓) (5.13)
We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �
gX
j=1
k✓Agk2 (5.14)
where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =
Pm�1j=1 |✓j � ✓j+1|.
Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.65/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Other regularizersOther regularizers
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �R(✓) (5.13)
We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.
p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �
gX
j=1
k✓Agk2 (5.14)
where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =
Pm�1j=1 |✓j � ✓j+1|.
Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.66/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Other regularizersOther regularizers
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �R(✓) (5.13)
We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.
Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �
gX
j=1
k✓Agk2 (5.14)
where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =
Pm�1j=1 |✓j � ✓j+1|.
Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.67/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Other regularizersOther regularizers
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �R(✓) (5.13)
We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.
Grouped Lasso
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �
gX
j=1
k✓Agk2 (5.14)
where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =
Pm�1j=1 |✓j � ✓j+1|.
Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.68/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Other regularizersOther regularizers
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �R(✓) (5.13)
We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �
gX
j=1
k✓Agk2 (5.14)
where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.
Total Variation: R(✓) =Pm�1
j=1 |✓j � ✓j+1|.
Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.69/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Other regularizersOther regularizers
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �R(✓) (5.13)
We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �
gX
j=1
k✓Agk2 (5.14)
where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =
Pm�1j=1 |✓j � ✓j+1|.
Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.70/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Other regularizersOther regularizers
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �R(✓) (5.13)
We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso
✓ 2 argmin✓
nX
i=1
(y(i) � ✓|x(i))2 + �
gX
j=1
k✓Agk2 (5.14)
where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =
Pm�1j=1 |✓j � ✓j+1|.
Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.71/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
As the dimensions grow
dth order linear model, with x 2 Rm
f✓(x) = ✓0 +mX
i=1
✓ixi +mX
i=1
mX
j=1
✓i,jxixj +mX
i=1
mX
j=1
mX
k=1
✓i,j,kxixjxk
+ · · ·+mX
i1
mX
i2
· · ·
mX
id
✓i1,i2,...,id
dY
`=1
xi` (5.15)
number of parameters to learn grows as O(md) exponentially in theorder.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F36/65 (pg.72/170)
=) e-I =/
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Geometric Tessellation. Inner and Surface Volume
Tesselation regions grow exponentially with dimension
r=5, m = 1
r=4, m = 2
r=3, m = 3
With r distinct values (edge length r), we have rm combinations in m
dimensions.
Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface
area to volume is 2mrm�1
/rm = 2m/r
A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.
A lot more surface, relative to inner volume, as m grows. 2m/r !1
as m!1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.73/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Geometric Tessellation. Inner and Surface Volume
Tesselation regions grow exponentially with dimension
r=5, m = 1
r=4, m = 2
r=3, m = 3
With r distinct values (edge length r), we have rm combinations in m
dimensions.
Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface
area to volume is 2mrm�1
/rm = 2m/r
A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.
A lot more surface, relative to inner volume, as m grows. 2m/r !1
as m!1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.74/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Geometric Tessellation. Inner and Surface Volume
Tesselation regions grow exponentially with dimension
r=5, m = 1
r=4, m = 2
r=3, m = 3
With r distinct values (edge length r), we have rm combinations in m
dimensions.
Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface
area to volume is 2mrm�1
/rm = 2m/r
A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.
A lot more surface, relative to inner volume, as m grows. 2m/r !1
as m!1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.75/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Geometric Tessellation. Inner and Surface Volume
Tesselation regions grow exponentially with dimension
r=5, m = 1
r=4, m = 2
r=3, m = 3
With r distinct values (edge length r), we have rm combinations in m
dimensions.
Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface
area to volume is 2mrm�1
/rm = 2m/r
A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.
A lot more surface, relative to inner volume, as m grows. 2m/r !1
as m!1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.76/170)
001
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Geometric Tessellation. Inner and Surface Volume
Tesselation regions grow exponentially with dimension
r=5, m = 1
r=4, m = 2
r=3, m = 3
With r distinct values (edge length r), we have rm combinations in m
dimensions.
Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface
area to volume is 2mrm�1
/rm = 2m/r
A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.
A lot more surface, relative to inner volume, as m grows. 2m/r !1
as m!1.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.77/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Data in high dimensions
From Hastie et. al, 2009
Suppose data D =�x(i) ni=1
is distributed uniformly at random within
m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x
(i)j 1).
If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r
1/m, so em(r) = r1/m.
One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8.
0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.
Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range
In fact, limm!1 em(✏) = 1 for any ✏ > 0.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.78/170)
÷.
i
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Data in high dimensions
From Hastie et. al, 2009
Suppose data D =�x(i) ni=1
is distributed uniformly at random within
m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x
(i)j 1).
If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r
1/m, so em(r) = r1/m.
One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8.
0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.
Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range
In fact, limm!1 em(✏) = 1 for any ✏ > 0.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.79/170)
"
im i
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Data in high dimensions
From Hastie et. al, 2009
Suppose data D =�x(i) ni=1
is distributed uniformly at random within
m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x
(i)j 1).
If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r
1/m, so em(r) = r1/m.
One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8.
0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.
Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range
In fact, limm!1 em(✏) = 1 for any ✏ > 0.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.80/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Data in high dimensions
From Hastie et. al, 2009
Suppose data D =�x(i) ni=1
is distributed uniformly at random within
m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x
(i)j 1).
If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r
1/m, so em(r) = r1/m.
One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8. 0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.
Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range
In fact, limm!1 em(✏) = 1 for any ✏ > 0.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.81/170)
""
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Data in high dimensions
From Hastie et. al, 2009
Suppose data D =�x(i) ni=1
is distributed uniformly at random within
m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x
(i)j 1).
If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r
1/m, so em(r) = r1/m.
One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8. 0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.
Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range
In fact, limm!1 em(✏) = 1 for any ✏ > 0.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.82/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Data in high dimensions
From Hastie et. al, 2009
Suppose data D =�x(i) ni=1
is distributed uniformly at random within
m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x
(i)j 1).
If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r
1/m, so em(r) = r1/m.
One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8. 0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.
Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range
In fact, limm!1 em(✏) = 1 for any ✏ > 0.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.83/170)
¥ :.
.
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
All peel and no core . . .
Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4
3⇡r3, apple is about 180 cm3, while peel volume is
43⇡r
3�
43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.
Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate
m-dimensional value (on HW3).
Fraction of radius 1 apple that is peel (with peel thickness ✏) is
Vm(1)� Vm(1� ✏)
Vm(1)= 1� (1� ✏)m (5.16)
As dimensionality m gets larger, apple becomes all peel and no core.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F39/65 (pg.84/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
All peel and no core . . .
Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4
3⇡r3, apple is about 180 cm3, while peel volume is
43⇡r
3�
43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.
Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate
m-dimensional value (on HW3).
Fraction of radius 1 apple that is peel (with peel thickness ✏) is
Vm(1)� Vm(1� ✏)
Vm(1)= 1� (1� ✏)m (5.16)
As dimensionality m gets larger, apple becomes all peel and no core.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F39/65 (pg.85/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
All peel and no core . . .
Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4
3⇡r3, apple is about 180 cm3, while peel volume is
43⇡r
3�
43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.
Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate
m-dimensional value (on HW3).
Fraction of radius 1 apple that is peel (with peel thickness ✏) is
Vm(1)� Vm(1� ✏)
Vm(1)= 1� (1� ✏)m (5.16)
As dimensionality m gets larger, apple becomes all peel and no core.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F39/65 (pg.86/170)
K 2.5%iwhn m =3
.
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
All peel and no core . . .
Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4
3⇡r3, apple is about 180 cm3, while peel volume is
43⇡r
3�
43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.
Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate
m-dimensional value (on HW3).
Fraction of radius 1 apple that is peel (with peel thickness ✏) is
Vm(1)� Vm(1� ✏)
Vm(1)= 1� (1� ✏)m (5.16)
As dimensionality m gets larger, apple becomes all peel and no core.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F39/65 (pg.87/170)
Lim l - G - aim =/ as m → a
Mea apple becomes
all pe
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Uniformly at random unit vectors on unit sphere
Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.
Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1
Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.
It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:
Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)
which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!
One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.88/170)
too
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Uniformly at random unit vectors on unit sphere
Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.
Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1
Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.
It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:
Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)
which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!
One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.89/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Uniformly at random unit vectors on unit sphere
Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.
Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1
Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.
It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:
Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)
which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!
One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.90/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Uniformly at random unit vectors on unit sphere
Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.
Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1
Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.
It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:
Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)
which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!
One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.91/170)
E = 10-8
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Uniformly at random unit vectors on unit sphere
Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.
Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1
Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.
It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:
Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)
which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!
One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.92/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Inner/Outer Spherical Bounds on Cubic Boxes
Consider a cube in m-dimensions,[0, 1]m having unit volume. Thelargest sphere contained by this cubehas diameter 1 (the inner circle to theright). The smallest sphere contain-ing this cube has diameter
pm (outer
circle).
1
p m
Fraction of inner spherical volume (unit sphere radius 1/2, Vm(r) forradius r = 1/2) to unit cube (which has volume 1 for any m)
Inner spherical vol.
unit cube vol.=
Vm(r)
1=
⇡m/2
�(m2 + 1)rm (5.18)
We have limm!1 Vm(1/2)/1 = 0. Hence, number of volumes of unitspheres that can fit in a cube goes to infinity. More and more unitspheres can be packed into a unit cube as the dimensionality getshigher.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F41/65 (pg.93/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Inner/Outer Spherical Bounds on Cubic Boxes
Consider a cube in m-dimensions,[0, 1]m having unit volume. Thelargest sphere contained by this cubehas diameter 1 (the inner circle to theright). The smallest sphere contain-ing this cube has diameter
pm (outer
circle).
1
p m
Fraction of inner spherical volume (unit sphere radius 1/2, Vm(r) forradius r = 1/2) to unit cube (which has volume 1 for any m)
Inner spherical vol.
unit cube vol.=
Vm(r)
1=
⇡m/2
�(m2 + 1)rm (5.18)
We have limm!1 Vm(1/2)/1 = 0. Hence, number of volumes of unitspheres that can fit in a cube goes to infinity. More and more unitspheres can be packed into a unit cube as the dimensionality getshigher.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F41/65 (pg.94/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Inner/Outer Spherical Bounds on Cubic Boxes
Consider a cube in m-dimensions,[0, 1]m having unit volume. Thelargest sphere contained by this cubehas diameter 1 (the inner circle to theright). The smallest sphere contain-ing this cube has diameter
pm (outer
circle).
1
p m
Fraction of inner spherical volume (unit sphere radius 1/2, Vm(r) forradius r = 1/2) to unit cube (which has volume 1 for any m)
Inner spherical vol.
unit cube vol.=
Vm(r)
1=
⇡m/2
�(m2 + 1)rm (5.18)
We have limm!1 Vm(1/2)/1 = 0. Hence, number of volumes of unitspheres that can fit in a cube goes to infinity. More and more unitspheres can be packed into a unit cube as the dimensionality getshigher.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F41/65 (pg.95/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Using 2D to represent High-D as if 2D was High-D
Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.
Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as
pm/2.
For m = 4 vertex distance ispm/2 = 1.
1
12
p22
1
12
11
12
pm2
Unit radius sphere
�Nearly all the volume
Vertex of hypercube
Illustration of the relationship between the sphere and the cube in 2, 4, and
m-dimensions (from Blum, Hopcroft, & Kannan, 2016).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F42/65 (pg.96/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Using 2D to represent High-D as if 2D was High-D
Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.
Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as
pm/2.
For m = 4 vertex distance ispm/2 = 1.
1
12
p22
1
12
11
12
pm2
Unit radius sphere
�Nearly all the volume
Vertex of hypercube
Illustration of the relationship between the sphere and the cube in 2, 4, and
m-dimensions (from Blum, Hopcroft, & Kannan, 2016).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F42/65 (pg.97/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Using 2D to represent High-D as if 2D was High-D
Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.
Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as
pm/2.
For m = 4 vertex distance ispm/2 = 1.
1
12
p22
1
12
11
12
pm2
Unit radius sphere
�Nearly all the volume
Vertex of hypercube
Illustration of the relationship between the sphere and the cube in 2, 4, and
m-dimensions (from Blum, Hopcroft, & Kannan, 2016).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F42/65 (pg.98/170)
#¥1
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
The Blessing of Dimensionality
More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation
Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!
For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F43/65 (pg.99/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
The Blessing of Dimensionality
More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation
Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!
For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F43/65 (pg.100/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
The Blessing of Dimensionality
More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation
Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!
For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F43/65 (pg.101/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Feature Selection vs. Dimensionality Reduction
We’ve already seen that feature selection is one strategy to reducefeature dimensionality.
Feature selection, each feature is either selected or not, all or nothing.
Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0
, a lower dimensional space, m0< m.
Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA
will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2
works quite well.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F44/65 (pg.102/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Feature Selection vs. Dimensionality Reduction
We’ve already seen that feature selection is one strategy to reducefeature dimensionality.
Feature selection, each feature is either selected or not, all or nothing.
Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0
, a lower dimensional space, m0< m.
Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA
will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2
works quite well.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F44/65 (pg.103/170)
He
00X,Xz
X, a= f-( ti , ta)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Feature Selection vs. Dimensionality Reduction
We’ve already seen that feature selection is one strategy to reducefeature dimensionality.
Feature selection, each feature is either selected or not, all or nothing.
Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0
, a lower dimensional space, m0< m.
Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA
will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2
works quite well.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F44/65 (pg.104/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Feature Selection vs. Dimensionality Reduction
We’ve already seen that feature selection is one strategy to reducefeature dimensionality.
Feature selection, each feature is either selected or not, all or nothing.
Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0
, a lower dimensional space, m0< m.
Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA
will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2
works quite well.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F44/65 (pg.105/170)
e-
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Principle Component Analysis (PCA)
PCA is a linear encoding e(x) = x|W for m⇥m matrix W .
Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.
Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.
Thus, taking first m0 m columns of W giving m⇥m
0 matrix Wk,yields XWk a dimensionality reduction.
When m0< m then the projection is onto a subspace that maximizes
the variance over all possible subspaces of dimension m0. When
m0 = m then we order the columns decreasing by variance explanation
(so taking any length m0 prefix of columns gives the answer).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.106/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Principle Component Analysis (PCA)
PCA is a linear encoding e(x) = x|W for m⇥m matrix W .
Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.
Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.
Thus, taking first m0 m columns of W giving m⇥m
0 matrix Wk,yields XWk a dimensionality reduction.
When m0< m then the projection is onto a subspace that maximizes
the variance over all possible subspaces of dimension m0. When
m0 = m then we order the columns decreasing by variance explanation
(so taking any length m0 prefix of columns gives the answer).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.107/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Principle Component Analysis (PCA)
PCA is a linear encoding e(x) = x|W for m⇥m matrix W .
Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.
Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.
Thus, taking first m0 m columns of W giving m⇥m
0 matrix Wk,yields XWk a dimensionality reduction.
When m0< m then the projection is onto a subspace that maximizes
the variance over all possible subspaces of dimension m0. When
m0 = m then we order the columns decreasing by variance explanation
(so taking any length m0 prefix of columns gives the answer).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.108/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Principle Component Analysis (PCA)
PCA is a linear encoding e(x) = x|W for m⇥m matrix W .
Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.
Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.
Thus, taking first m0 m columns of W giving m⇥m
0 matrix Wk,yields XWk a dimensionality reduction.
When m0< m then the projection is onto a subspace that maximizes
the variance over all possible subspaces of dimension m0. When
m0 = m then we order the columns decreasing by variance explanation
(so taking any length m0 prefix of columns gives the answer).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.109/170)
•
• m'
m'
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Principle Component Analysis (PCA)
PCA is a linear encoding e(x) = x|W for m⇥m matrix W .
Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.
Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.
Thus, taking first m0 m columns of W giving m⇥m
0 matrix Wk,yields XWk a dimensionality reduction.
When m0< m then the projection is onto a subspace that maximizes
the variance over all possible subspaces of dimension m0. When
m0 = m then we order the columns decreasing by variance explanation
(so taking any length m0 prefix of columns gives the answer).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.110/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦
x,w(1)↵projects to 1D).
Assume unit vectors, kw(1)k2 =
q⌦w(1), w(1)
↵= 1.
Mean of data x = 1n
Pni=1 x
(i) so mean of projected data⌦w
(1), x↵.
Variance of projected data
1
n
nX
i=1
⇣hw
(1), x
(i)i � hw
(1), xi
⌘2= w
(1)|Sw
(1) (5.19)
where
S =1
n
nX
i=1
(x(i) � x)(x(i) � x)|
(5.20)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F46/65 (pg.111/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦
x,w(1)↵projects to 1D).
Assume unit vectors, kw(1)k2 =
q⌦w(1), w(1)
↵= 1.
Mean of data x = 1n
Pni=1 x
(i) so mean of projected data⌦w
(1), x↵.
Variance of projected data
1
n
nX
i=1
⇣hw
(1), x
(i)i � hw
(1), xi
⌘2= w
(1)|Sw
(1) (5.19)
where
S =1
n
nX
i=1
(x(i) � x)(x(i) � x)|
(5.20)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F46/65 (pg.112/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦
x,w(1)↵projects to 1D).
Assume unit vectors, kw(1)k2 =
q⌦w(1), w(1)
↵= 1.
Mean of data x = 1n
Pni=1 x
(i) so mean of projected data⌦w
(1), x↵.
Variance of projected data
1
n
nX
i=1
⇣hw
(1), x
(i)i � hw
(1), xi
⌘2= w
(1)|Sw
(1) (5.19)
where
S =1
n
nX
i=1
(x(i) � x)(x(i) � x)|
(5.20)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F46/65 (pg.113/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦
x,w(1)↵projects to 1D).
Assume unit vectors, kw(1)k2 =
q⌦w(1), w(1)
↵= 1.
Mean of data x = 1n
Pni=1 x
(i) so mean of projected data⌦w
(1), x↵.
Variance of projected data
1
n
nX
i=1
⇣hw
(1), x
(i)i � hw
(1), xi
⌘2= w
(1)|Sw
(1) (5.19)
where
S =1
n
nX
i=1
(x(i) � x)(x(i) � x)|
(5.20)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F46/65 (pg.114/170)
projectionof sand n
'
is Lw"'
,XG
')>
mean of the projector data In Lw'",X">
=L we",I '
"'7
. he
outer product
\
In [ sway His>2 I
#-22W
"'it"> 72W
'"
,I >
n > > m
* < w' '',I >2)
usually
#S is Pos. def.
Sum of outer products .Symmetric .
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Goal: Maximize projected variance w(1)|
Sw(1) =
⌦w
(1), Sw
(1)↵subject
to kw(1)
k2 = 1.
Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w
(1), so w(1) is an eigenvector of S with eigenvalue
�1.
Pre-multiply by w(1) and use kw
(1)k2 = 1 yields
Dw
(1), Sw
(1)E= �1 (5.21)
Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).
This is the first principle component.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.115/170)
"
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Goal: Maximize projected variance w(1)|
Sw(1) =
⌦w
(1), Sw
(1)↵subject
to kw(1)
k2 = 1.
Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w
(1), so w(1) is an eigenvector of S with eigenvalue
�1.
Pre-multiply by w(1) and use kw
(1)k2 = 1 yields
Dw
(1), Sw
(1)E= �1 (5.21)
Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).
This is the first principle component.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.116/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Goal: Maximize projected variance w(1)|
Sw(1) =
⌦w
(1), Sw
(1)↵subject
to kw(1)
k2 = 1.
Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w
(1), so w(1) is an eigenvector of S with eigenvalue
�1.
Pre-multiply by w(1) and use kw
(1)k2 = 1 yields
Dw
(1), Sw
(1)E= �1 (5.21)
Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).
This is the first principle component.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.117/170)
= The expression of the
variance .
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Goal: Maximize projected variance w(1)|
Sw(1) =
⌦w
(1), Sw
(1)↵subject
to kw(1)
k2 = 1.
Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w
(1), so w(1) is an eigenvector of S with eigenvalue
�1.
Pre-multiply by w(1) and use kw
(1)k2 = 1 yields
Dw
(1), Sw
(1)E= �1 (5.21)
Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).
This is the first principle component.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.118/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Maximum Variance: 1D projection
Goal: Maximize projected variance w(1)|
Sw(1) =
⌦w
(1), Sw
(1)↵subject
to kw(1)
k2 = 1.
Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w
(1), so w(1) is an eigenvector of S with eigenvalue
�1.
Pre-multiply by w(1) and use kw
(1)k2 = 1 yields
Dw
(1), Sw
(1)E= �1 (5.21)
Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).
This is the first principle component.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.119/170)
:
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Subsequent components
To produce kth component given first k � 1 components, first form
Xk = X �
k�1X
i=1
Xw(i)w
(i)| (5.22)
and then proceed with maximize projected variance⌦w
(k), Skw
(k)↵
subject to kw(k)
k2 = 1 where Sk is computed via Xk.
Overall, the collection�w
(i),�i
are the m eigenvector/eigenvalue
pairs of the matrix S.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F48/65 (pg.120/170)
c , ①'
f:"
"
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Subsequent components
To produce kth component given first k � 1 components, first form
Xk = X �
k�1X
i=1
Xw(i)w
(i)| (5.22)
and then proceed with maximize projected variance⌦w
(k), Skw
(k)↵
subject to kw(k)
k2 = 1 where Sk is computed via Xk.
Overall, the collection�w
(i),�i
are the m eigenvector/eigenvalue
pairs of the matrix S.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F48/65 (pg.121/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA: Minimizing reconstruction error m0 < m
Let ui, i 2 [m] be a set of orthonormal vectors in Rm.
Basis decomposition: any sample x(i) can be written as
x(i) =
Pmj=1 ↵i,juj =
Pmj=1
⌦x(i), uj
↵uj .
Suppose we decide to use only m0< m dimensions, so
x(i) =
Pm0
j=1
⌦x(i), uj
↵uj .
Reconstruction error
Jm0 =1
n
nX
i=1
kx(i)
� x(i)k22 (5.23)
To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i
th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m
0 matrix of the first m0
columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.122/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA: Minimizing reconstruction error m0 < m
Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x
(i) can be written asx(i) =
Pmj=1 ↵i,juj =
Pmj=1
⌦x(i), uj
↵uj .
Suppose we decide to use only m0< m dimensions, so
x(i) =
Pm0
j=1
⌦x(i), uj
↵uj .
Reconstruction error
Jm0 =1
n
nX
i=1
kx(i)
� x(i)k22 (5.23)
To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i
th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m
0 matrix of the first m0
columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.123/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA: Minimizing reconstruction error m0 < m
Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x
(i) can be written asx(i) =
Pmj=1 ↵i,juj =
Pmj=1
⌦x(i), uj
↵uj .
Suppose we decide to use only m0< m dimensions, so
x(i) =
Pm0
j=1
⌦x(i), uj
↵uj .
Reconstruction error
Jm0 =1
n
nX
i=1
kx(i)
� x(i)k22 (5.23)
To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i
th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m
0 matrix of the first m0
columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.124/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA: Minimizing reconstruction error m0 < m
Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x
(i) can be written asx(i) =
Pmj=1 ↵i,juj =
Pmj=1
⌦x(i), uj
↵uj .
Suppose we decide to use only m0< m dimensions, so
x(i) =
Pm0
j=1
⌦x(i), uj
↵uj .
Reconstruction error
Jm0 =1
n
nX
i=1
kx(i)
� x(i)k22 (5.23)
To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i
th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m
0 matrix of the first m0
columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.125/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA: Minimizing reconstruction error m0 < m
Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x
(i) can be written asx(i) =
Pmj=1 ↵i,juj =
Pmj=1
⌦x(i), uj
↵uj .
Suppose we decide to use only m0< m dimensions, so
x(i) =
Pm0
j=1
⌦x(i), uj
↵uj .
Reconstruction error
Jm0 =1
n
nX
i=1
kx(i)
� x(i)k22 (5.23)
To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i
th largest eigenvalue.
Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m
0 matrix of the first m0
columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.126/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA: Minimizing reconstruction error m0 < m
Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x
(i) can be written asx(i) =
Pmj=1 ↵i,juj =
Pmj=1
⌦x(i), uj
↵uj .
Suppose we decide to use only m0< m dimensions, so
x(i) =
Pm0
j=1
⌦x(i), uj
↵uj .
Reconstruction error
Jm0 =1
n
nX
i=1
kx(i)
� x(i)k22 (5.23)
To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i
th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m
0 matrix of the first m0
columns.
Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.127/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA: Minimizing reconstruction error m0 < m
Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x
(i) can be written asx(i) =
Pmj=1 ↵i,juj =
Pmj=1
⌦x(i), uj
↵uj .
Suppose we decide to use only m0< m dimensions, so
x(i) =
Pm0
j=1
⌦x(i), uj
↵uj .
Reconstruction error
Jm0 =1
n
nX
i=1
kx(i)
� x(i)k22 (5.23)
To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i
th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m
0 matrix of the first m0
columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.128/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Example: 2D and two principle directions
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F50/65 (pg.129/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA and discriminability
PCA projects down to the directions of greatest variance. Are thesealso the greatest for classification?
For classification, we would want the samples in di↵erent classes to beas separate as possible. I.e., make it as easy as possible to discern thedi↵erence (i.e., discriminate) between the di↵erent classes.
Any projection that blends together samples from di↵erent classes ontop of each other would be undesirable.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F51/65 (pg.130/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA and discriminability
PCA projects down to the directions of greatest variance. Are thesealso the greatest for classification?
For classification, we would want the samples in di↵erent classes to beas separate as possible. I.e., make it as easy as possible to discern thedi↵erence (i.e., discriminate) between the di↵erent classes.
Any projection that blends together samples from di↵erent classes ontop of each other would be undesirable.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F51/65 (pg.131/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA and discriminability
PCA projects down to the directions of greatest variance. Are thesealso the greatest for classification?
For classification, we would want the samples in di↵erent classes to beas separate as possible. I.e., make it as easy as possible to discern thedi↵erence (i.e., discriminate) between the di↵erent classes.
Any projection that blends together samples from di↵erent classes ontop of each other would be undesirable.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F51/65 (pg.132/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Example: PCA on 2-class data
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F52/65 (pg.133/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Example: Data normalization
From Bishop 2006
2 4 640
50
60
70
80
90
100
ï� 0 2
ï�
0
2
ï� 0 2
ï�
0
2
Left: original data. Center: mean subtraction and variance normalization ineach dimension. Right: decorrelated data (i.e., diagonal covariance).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F53/65 (pg.134/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA and vs. Best axis for class separability
From Bishop 2006
ï5 0 5ï�
��
ï�
��
0
���
1
���
Projecting to greatest variability leads to poor separation, while existsanother 1D projection that retains good separability.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F54/65 (pg.135/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA and lack of class separability
From Bishop 2006
x6
x7
0 ���� ��� ���� 10
���
1
���
2
Left. 2 dimensions of original data. Right: PCA Projection on 2 dimensionsof greatest variability — bad for two reasons: (I) spreads individual classes(e.g., red); and (II) blends distinct classes (blue and green), both of whichmakes classification no easier.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F55/65 (pg.136/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
PCA vs. LDA
From Bishop 2006
PCA vs. Linear Discriminant Analysis (LDA).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F56/65 (pg.137/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Axis parallel projection vs. LDA
From Hastie et. al., 2009
+
++
+
Axis parallel (left) vs. LDA (right).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F57/65 (pg.138/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), 2 classes
Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.
p(x|y) =1
|2⇡Cy|m/2
exp
✓�1
2(x� µy)
|C
�1y (x� µy)
◆(5.24)
Two class case y 2 {0, 1} with equal covariances C0 = C1 = C
(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:
logp(y = 1|x)
p(y = 0|x)= �
1
2(x� µ1)
|C
�11 (x� µ1) +
1
2(x� µ0)
|C
�10 (x� µ0)
(5.25)
= (C�1µ1 � C
�1µ0)
|x+ µ0
|C
�1µ0 � µ1
|C
�1µ1 (5.26)
= ✓|x+ c (5.27)
✓ is a projection (m⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F58/65 (pg.139/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), 2 classes
Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.
p(x|y) =1
|2⇡Cy|m/2
exp
✓�1
2(x� µy)
|C
�1y (x� µy)
◆(5.24)
Two class case y 2 {0, 1} with equal covariances C0 = C1 = C
(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:
logp(y = 1|x)
p(y = 0|x)= �
1
2(x� µ1)
|C
�11 (x� µ1) +
1
2(x� µ0)
|C
�10 (x� µ0)
(5.25)
= (C�1µ1 � C
�1µ0)
|x+ µ0
|C
�1µ0 � µ1
|C
�1µ1 (5.26)
= ✓|x+ c (5.27)
✓ is a projection (m⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F58/65 (pg.140/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), 2 classes
Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.
p(x|y) =1
|2⇡Cy|m/2
exp
✓�1
2(x� µy)
|C
�1y (x� µy)
◆(5.24)
Two class case y 2 {0, 1} with equal covariances C0 = C1 = C
(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:
logp(y = 1|x)
p(y = 0|x)= �
1
2(x� µ1)
|C
�11 (x� µ1) +
1
2(x� µ0)
|C
�10 (x� µ0)
(5.25)
= (C�1µ1 � C
�1µ0)
|x+ µ0
|C
�1µ0 � µ1
|C
�1µ1 (5.26)
= ✓|x+ c (5.27)
✓ is a projection (m⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F58/65 (pg.141/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA
Thus, decision boundary between the two classes is linear.
This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).
Given discriminant functions of the form
hy(x) = �1
2log |Cy|�
1
2(x� µy)
|C
�1y (x� µy) + log p(y) (5.28)
all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.
With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.
In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.142/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA
Thus, decision boundary between the two classes is linear.
This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).
Given discriminant functions of the form
hy(x) = �1
2log |Cy|�
1
2(x� µy)
|C
�1y (x� µy) + log p(y) (5.28)
all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.
With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.
In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.143/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA
Thus, decision boundary between the two classes is linear.
This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).
Given discriminant functions of the form
hy(x) = �1
2log |Cy|�
1
2(x� µy)
|C
�1y (x� µy) + log p(y) (5.28)
all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.
With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.
In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.144/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA
Thus, decision boundary between the two classes is linear.
This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).
Given discriminant functions of the form
hy(x) = �1
2log |Cy|�
1
2(x� µy)
|C
�1y (x� µy) + log p(y) (5.28)
all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.
With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.
In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.145/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA
Thus, decision boundary between the two classes is linear.
This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).
Given discriminant functions of the form
hy(x) = �1
2log |Cy|�
1
2(x� µy)
|C
�1y (x� µy) + log p(y) (5.28)
all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.
With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.
In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.146/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), ` classes
Within class scatter matrix
SW =X
c=1
Sc where Sc =X
i is class c
(x(i) � µc)(x(i)
� µc)|
(5.29)
and µc is the class c mean, µc =1nc
Pi is of class c x
(i)
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.30)
First projection: a1 2 argmaxa2Rma|SBaa|SW a .
Second orthogonal to first,
a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.
In general, directions of greatest variance of the matrix S�1W SB,
maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S
�1W SB to get LDA reduction.
LDA dimensionality reduction: find `0⇥m matrix linear transform on x,
project to subspace corresponding to `0 greatest eigenvalues of S�1
W SB.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.147/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), ` classes
Within class scatter matrix
SW =X
c=1
Sc where Sc =X
i is class c
(x(i) � µc)(x(i)
� µc)|
(5.29)
and µc is the class c mean, µc =1nc
Pi is of class c x
(i)
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.30)
First projection: a1 2 argmaxa2Rma|SBaa|SW a .
Second orthogonal to first,
a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.
In general, directions of greatest variance of the matrix S�1W SB,
maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S
�1W SB to get LDA reduction.
LDA dimensionality reduction: find `0⇥m matrix linear transform on x,
project to subspace corresponding to `0 greatest eigenvalues of S�1
W SB.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.148/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), ` classes
Within class scatter matrix
SW =X
c=1
Sc where Sc =X
i is class c
(x(i) � µc)(x(i)
� µc)|
(5.29)
and µc is the class c mean, µc =1nc
Pi is of class c x
(i)
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.30)
First projection: a1 2 argmaxa2Rma|SBaa|SW a .
Second orthogonal to first,
a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.
In general, directions of greatest variance of the matrix S�1W SB,
maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S
�1W SB to get LDA reduction.
LDA dimensionality reduction: find `0⇥m matrix linear transform on x,
project to subspace corresponding to `0 greatest eigenvalues of S�1
W SB.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.149/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), ` classes
Within class scatter matrix
SW =X
c=1
Sc where Sc =X
i is class c
(x(i) � µc)(x(i)
� µc)|
(5.29)
and µc is the class c mean, µc =1nc
Pi is of class c x
(i)
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.30)
First projection: a1 2 argmaxa2Rma|SBaa|SW a . Second orthogonal to first,
a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a
, and so on.
In general, directions of greatest variance of the matrix S�1W SB,
maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S
�1W SB to get LDA reduction.
LDA dimensionality reduction: find `0⇥m matrix linear transform on x,
project to subspace corresponding to `0 greatest eigenvalues of S�1
W SB.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.150/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), ` classes
Within class scatter matrix
SW =X
c=1
Sc where Sc =X
i is class c
(x(i) � µc)(x(i)
� µc)|
(5.29)
and µc is the class c mean, µc =1nc
Pi is of class c x
(i)
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.30)
First projection: a1 2 argmaxa2Rma|SBaa|SW a . Second orthogonal to first,
a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.
In general, directions of greatest variance of the matrix S�1W SB,
maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S
�1W SB to get LDA reduction.
LDA dimensionality reduction: find `0⇥m matrix linear transform on x,
project to subspace corresponding to `0 greatest eigenvalues of S�1
W SB.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.151/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), ` classes
Within class scatter matrix
SW =X
c=1
Sc where Sc =X
i is class c
(x(i) � µc)(x(i)
� µc)|
(5.29)
and µc is the class c mean, µc =1nc
Pi is of class c x
(i)
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.30)
First projection: a1 2 argmaxa2Rma|SBaa|SW a . Second orthogonal to first,
a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.
In general, directions of greatest variance of the matrix S�1W SB,
maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S
�1W SB to get LDA reduction.
LDA dimensionality reduction: find `0⇥m matrix linear transform on x,
project to subspace corresponding to `0 greatest eigenvalues of S�1
W SB.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.152/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Linear Discriminant Analysis (LDA), ` classes
Within class scatter matrix
SW =X
c=1
Sc where Sc =X
i is class c
(x(i) � µc)(x(i)
� µc)|
(5.29)
and µc is the class c mean, µc =1nc
Pi is of class c x
(i)
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.30)
First projection: a1 2 argmaxa2Rma|SBaa|SW a . Second orthogonal to first,
a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.
In general, directions of greatest variance of the matrix S�1W SB,
maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S
�1W SB to get LDA reduction.
LDA dimensionality reduction: find `0⇥m matrix linear transform on x,
project to subspace corresponding to `0 greatest eigenvalues of S�1
W SB.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.153/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since
SB has rank at most `� 1.
Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.31)
Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have
SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)
i
0
BBB@
n1 0 . . . 00 n2 . . . 0...
.... . .
...0 0 . . . n`
1
CCCA
2
6664
(µ1 � µ)T
(µ2 � µ)T
...(µ` � µ)T
3
7775
(5.32)
and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that
rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.154/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since
SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.31)
Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have
SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)
i
0
BBB@
n1 0 . . . 00 n2 . . . 0...
.... . .
...0 0 . . . n`
1
CCCA
2
6664
(µ1 � µ)T
(µ2 � µ)T
...(µ` � µ)T
3
7775
(5.32)
and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that
rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.155/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since
SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.31)
Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).
However, we have
SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)
i
0
BBB@
n1 0 . . . 00 n2 . . . 0...
.... . .
...0 0 . . . n`
1
CCCA
2
6664
(µ1 � µ)T
(µ2 � µ)T
...(µ` � µ)T
3
7775
(5.32)
and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that
rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.156/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since
SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.31)
Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have
SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)
i
0
BBB@
n1 0 . . . 00 n2 . . . 0...
.... . .
...0 0 . . . n`
1
CCCA
2
6664
(µ1 � µ)T
(µ2 � µ)T
...(µ` � µ)T
3
7775
(5.32)
and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that
rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.157/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since
SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.31)
Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have
SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)
i
0
BBB@
n1 0 . . . 00 n2 . . . 0...
.... . .
...0 0 . . . n`
1
CCCA
2
6664
(µ1 � µ)T
(µ2 � µ)T
...(µ` � µ)T
3
7775
(5.32)
and since rank(AB) min(rank(A), rank(B))
, and sinceP`i=1 ni(µi � µ) = 0 which means that
rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.158/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since
SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.31)
Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have
SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)
i
0
BBB@
n1 0 . . . 00 n2 . . . 0...
.... . .
...0 0 . . . n`
1
CCCA
2
6664
(µ1 � µ)T
(µ2 � µ)T
...(µ` � µ)T
3
7775
(5.32)
and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that
rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1
, thus rank(SB) `� 1.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.159/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since
SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.
SB =X
c=1
nc(µc � µ)(µc � µ)| (5.31)
Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have
SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)
i
0
BBB@
n1 0 . . . 00 n2 . . . 0...
.... . .
...0 0 . . . n`
1
CCCA
2
6664
(µ1 � µ)T
(µ2 � µ)T
...(µ` � µ)T
3
7775
(5.32)
and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that
rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.160/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Repeat slide
The next slide is a repeat from earlier.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F62/65 (pg.161/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Uniformly at random unit vectors on unit sphere
Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.
Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1
Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.
It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:
Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)
which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!
One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F63/65 (pg.162/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Non-linear Dimensionality Reduction: Autoencoders
Construct two mappings, an encoder e✓e : Rm
! Rm0and a decoder
d✓d : Rm0! Rm.
Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].
Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:
min✓
Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓
1
n
nX
i=1
[kx(i) � d✓d(e✓e(x(i)))k] (5.33)
Then e✓e(x) is our learnt representation of x. If m0< m or m0
⌧ m
this is dimensionality reduction, similar to compression.
As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).
Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.163/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Non-linear Dimensionality Reduction: Autoencoders
Construct two mappings, an encoder e✓e : Rm
! Rm0and a decoder
d✓d : Rm0! Rm.
Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].
Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:
min✓
Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓
1
n
nX
i=1
[kx(i) � d✓d(e✓e(x(i)))k] (5.33)
Then e✓e(x) is our learnt representation of x. If m0< m or m0
⌧ m
this is dimensionality reduction, similar to compression.
As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).
Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.164/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Non-linear Dimensionality Reduction: Autoencoders
Construct two mappings, an encoder e✓e : Rm
! Rm0and a decoder
d✓d : Rm0! Rm.
Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].
Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:
min✓
Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓
1
n
nX
i=1
[kx(i) � d✓d(e✓e(x(i)))k] (5.33)
Then e✓e(x) is our learnt representation of x. If m0< m or m0
⌧ m
this is dimensionality reduction, similar to compression.
As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).
Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.165/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Non-linear Dimensionality Reduction: Autoencoders
Construct two mappings, an encoder e✓e : Rm
! Rm0and a decoder
d✓d : Rm0! Rm.
Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].
Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:
min✓
Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓
1
n
nX
i=1
[kx(i) � d✓d(e✓e(x(i)))k] (5.33)
Then e✓e(x) is our learnt representation of x. If m0< m or m0
⌧ m
this is dimensionality reduction, similar to compression.
As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).
Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.166/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Non-linear Dimensionality Reduction: Autoencoders
Construct two mappings, an encoder e✓e : Rm
! Rm0and a decoder
d✓d : Rm0! Rm.
Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].
Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:
min✓
Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓
1
n
nX
i=1
[kx(i) � d✓d(e✓e(x(i)))k] (5.33)
Then e✓e(x) is our learnt representation of x. If m0< m or m0
⌧ m
this is dimensionality reduction, similar to compression.
As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).
Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.167/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
Non-linear Dimensionality Reduction: Autoencoders
Construct two mappings, an encoder e✓e : Rm
! Rm0and a decoder
d✓d : Rm0! Rm.
Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].
Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:
min✓
Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓
1
n
nX
i=1
[kx(i) � d✓d(e✓e(x(i)))k] (5.33)
Then e✓e(x) is our learnt representation of x. If m0< m or m0
⌧ m
this is dimensionality reduction, similar to compression.
As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).
Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.168/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
DNN-based Auto-Encoder
Encoder from x to x = d✓d(e✓e(x) where m = 9 and m0 = 3.
In some sense, a form of non-linear PCA.
InputLayer Hidden
Layer 1 HiddenLayer 2
Code
HiddenLayer 4
HiddenLayer 5
OutputUnit
x x
Encoder Decodere✓e : Rm
! Rm d✓d : Rm! Rm
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F65/65 (pg.169/170)
Lasso Regularizers Curse of Dimensionality Dimensionality Reduction
DNN-based Auto-Encoder
Encoder from x to x = d✓d(e✓e(x) where m = 9 and m0 = 3.
In some sense, a form of non-linear PCA.InputLayer Hidden
Layer 1 HiddenLayer 2
Code
HiddenLayer 4
HiddenLayer 5
OutputUnit
x x
Encoder Decodere✓e : Rm
! Rm d✓d : Rm! Rm
Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F65/65 (pg.170/170)