Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from...
Transcript of Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from...
![Page 1: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/1.jpg)
Linear Regression
![Page 2: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/2.jpg)
Regression
Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Example: Height, Gender, Weight → Shoe Size• Audio features → Song year• Processes, memory → Power consumption• Historical financials → Future stock price• Many more
![Page 3: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/3.jpg)
Linear Least Squares Regression
Example: Predicting shoe size from height, gender, and weight
For each observation we have a feature vector, x, and label, y
We assume a linear mapping between features and label:
x� =�x1 x2 x3
�
y � w0 + w1x1 + w2x2 + w3x3
![Page 4: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/4.jpg)
Linear Least Squares Regression
Example: Predicting shoe size from height, gender, and weight
We can augment the feature vector to incorporate offset:
We can then rewrite this linear mapping as scalar product:
x� =�1 x1 x2 x3
�
y � y =3�
i=0
wixi = w�x
![Page 5: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/5.jpg)
Why a Linear Mapping?
Simple
Often works well in practice
Can introduce complexity via feature extraction
![Page 6: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/6.jpg)
1D Example
Goal: find the line of best fit x coordinate: features y coordinate: labels
x
y
y � y = w0 + w1x
Intercept / Offset Slope
![Page 7: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/7.jpg)
Evaluating Predictions
Can measure ‘closeness’ between label and prediction• Shoe size: better to be off by one size than 5 sizes• Song year prediction: better to be off by a year than by 20 years
What is an appropriate evaluation metric or ‘loss’ function?• Absolute loss:• Squared loss:
|y � y|(y � y)2 ← Has nice mathematical properties
![Page 8: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/8.jpg)
How Can We Learn Model (w)?
Assume we have n training points, where denotes the ith point
Recall two earlier points:• Linear assumption: • We use squared loss:
Idea: Find that minimizes squared loss over training points:
(y � y)2y = w�x
x(i)
y(i)
w
minw
n�
i=1
(w�x(i) � y(i))2
![Page 9: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/9.jpg)
Given n training points with d features, we define:• : matrix storing points• : real-valued labels• : predicted labels, where • : regression parameters / model to learn
y = Xw
X � Rn�d
y � Rn
y � Rn
w � Rd
Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares:
minw
||Xw � y||22
w
Equivalent by definition of Euclidean normminw
n�
i=1
(w�x(i) � y(i))2
![Page 10: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/10.jpg)
Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares:
minw
||Xw � y||22
w
Closed form solution: (if inverse exists)w = (X�X)�1X�y
Find solution by setting derivative to zero
1D: f(w) = ||wx � y||22 =n�
i=1
(wx(i) � y(i))2
dfdw
(w) = 2n�
i=1
x(i)(wx(i) � y(i))
� �� �wx�x�x�y
= 0 �� wx�x � x�y = 0
�� w = (x�x)�1x�y
dfdw
(w) = 2n�
i=1
x(i)(wx(i) � y(i))
� �� �wx�x�x�y
= 0 �� wx�x � x�y = 0
�� w = (x�x)�1x�y
![Page 11: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/11.jpg)
Overfitting and GeneralizationWe want good predictions on new data, i.e., ’generalization’
Least squares regression minimizes training error, and could overfit• Simpler models are more likely to generalize (Occam’s razor)
Can we change the problem to penalize for model complexity?• Intuitively, models with smaller weights are simpler
![Page 12: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/12.jpg)
Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term:
w
minw
||Xw � y||22 + λ||w||22
Training Error Model Complexity
Closed-form solution: w = (X�X + λId)�1X�yfree parameter trades off
between training error and model complexity
Given n training points with d features, we define: • : matrix storing points • : real-valued labels • : predicted labels, where • : regression parameters / model to learn
y = Xw
X � Rn�d
y � Rn
y � Rn
w � Rd
![Page 13: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/13.jpg)
Millionsong Regression Pipeline
![Page 14: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/14.jpg)
full dataset
Obtain Raw Data
Supervised Learning Pipeline
![Page 15: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/15.jpg)
full dataset
Obtain Raw Data
Supervised Learning Pipeline
training set
test set
Split Data
![Page 16: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/16.jpg)
full dataset
Obtain Raw Data
Feature Extraction
Supervised Learning Pipeline
training set
test set
Split Data
![Page 17: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/17.jpg)
full dataset
Obtain Raw Data
Feature Extraction
model
Supervised LearningSupervised Learning Pipeline
training set
test set
Split Data
![Page 18: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/18.jpg)
full dataset
Obtain Raw Data
Feature Extraction
accuracy
Evaluation
model
Supervised LearningSupervised Learning Pipeline
training set
test set
Split Data
![Page 19: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/19.jpg)
full dataset
Obtain Raw Data
Feature Extraction
accuracy
Evaluation
model
Supervised LearningSupervised Learning Pipeline
training set
test set
Split Data
![Page 20: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/20.jpg)
full dataset
Obtain Raw Data
Feature Extraction
new entity
prediction
Predict
accuracy
Evaluation
model
Supervised LearningSupervised Learning Pipeline
training set
test set
Split Data
![Page 21: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/21.jpg)
Goal: Predict song’s release year from audio features Raw Data: Millionsong Dataset from UCI ML Repository • Western, commercial tracks from 1980-2014 • 12 timbre averages (features) and release year (label)
training set
full dataset
test set
new entity
predictionaccuracy
model
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 22: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/22.jpg)
Split Data: Train on training set, evaluate with test set • Test set simulates unobserved data • Test error tells us whether we’ve generalized well
training set
full dataset
test set
new entity
predictionaccuracy
model
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 23: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/23.jpg)
Feature Extraction: Quadratic features • Compute pairwise feature interactions • Captures covariance of initial timbre features • Leads to a non-linear model relative to raw features
training set
full dataset
test set
new entity
predictionaccuracy
model
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 24: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/24.jpg)
Given 2 dimensional data, quadratic features are:
x =�x1 x2
��=� Φ(x) =
�x21 x1x2 x2x1 x22
��
z =�z1 z2
��=� Φ(z) =
�z21 z1z2 z2z1 z22
��
More succinctly:
Φ�(x) =�x21
�2x1x2 x22
�� �(z) =�z21
�2z1z2 z22
��
Φ(x)�Φ(z) =�
x21z21 + 2x1x2z1z2 + x22z22 = Φ�(x)�Φ�(z)
Equivalent inner products:
![Page 25: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/25.jpg)
Supervised Learning: Least Squares Regression • Learn a mapping from entities to continuous
labels given a training set • Audio features → Song year
training set
full dataset
test set
new entity
predictionaccuracy
model
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 26: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/26.jpg)
Closed-form solution:
Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term:
Training Error Model Complexity
w
minw
||Xw � y||22 + λ||w||22
w = (X�X + λId)�1X�y
Given n training points with d features, we define: • : matrix storing points • : real-valued labels • : predicted labels, where • : regression parameters / model to learn
y = Xw
X � Rn�d
y � Rn
y � Rn
w � Rd
![Page 27: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/27.jpg)
How do we choose a good value for this free parameter?• Most methods have free parameters / ‘hyperparameters’ to tuneFirst thought: Search over multiple values, evaluate each on test set• But, goal of test set is to simulate unobserved data• We may overfit if we use it to choose hyperparametersSecond thought: Create another hold out dataset for this search
Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term:
Training Error Model Complexity
w
minw
||Xw � y||22 + λ||w||22free parameter trades off between training
error and model complexity
![Page 28: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/28.jpg)
Evaluation (Part 1): Hyperparameter tuning • Training: train various models • Validation: evaluate various models (e.g., Grid Search) • Test: evaluate final model’s accuracy
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 29: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/29.jpg)
Hype
rparam
eter-1
Hyperparameter-2
Grid Search: Exhaustively search through hyperparameter space• Define and discretize search space (linear or log scale)• Evaluate points via validation error
λRegulariza*on-Parameter-(--)
10-8 10-6 10-4 10-2 1
![Page 30: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/30.jpg)
Evaluating Predictions
How can we compare labels and predictions for n validation points?Least squares optimization involves squared loss, , so it seems reasonable to use mean squared error (MSE):
MSE =
But MSE’s unit of measurement is square of quantity being measured, e.g., “squared years” for song predictionMore natural to use root-mean-square error (RMSE), i.e., MSE
(y � y)2
1n
n�
i=1
(y(i) � y(i))2
�
![Page 31: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/31.jpg)
Evaluation (Part 2): Evaluate final model • Training set: train various models • Validation set: evaluate various models • Test set: evaluate final model’s accuracy
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 32: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/32.jpg)
Predict: Final model can then be used to make predictions on future observations, e.g., new songs
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 33: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/33.jpg)
Distributed ML: Computation and Storage
![Page 34: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/34.jpg)
Classic ML techniques are not always suitable for modern datasets
0"
10"
20"
30"
40"
50"
60"
2010" 2011" 2012" 2013" 2014" 2015"
Moore's"Law"Overall"Data"Par8cle"Accel."DNA"Sequencers"
Data"Grows"faster"than"Moore’s"Law"[IDC%report,%Kathy%Yelick,%LBNL]%
Challenge: Scalability
Machine Learning
Data
Distributed Computing
Data Grows Faster than Moore’s Law [IDC report, Kathy Yelick, LBNL]
![Page 35: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/35.jpg)
Closed form solution: (if inverse exists)
How do we solve this computationally? • Computational profile similar for Ridge Regression
w = (X�X)�1X�y
Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares:
minw
||Xw � y||22
w
![Page 36: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/36.jpg)
w = (X�X)�1X�y
Computing Closed Form Solution
Consider number of arithmetic operations ( +, −, ×, / )
Computational bottlenecks:• Matrix multiply of : O(nd2) operations• Matrix inverse: O(d3) operations
Other methods (Cholesky, QR, SVD) have same complexity
X�X
Computation: O(nd2 + d3) operations
![Page 37: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/37.jpg)
Storage Requirements
Consider storing values as floats (8 bytes)
Storage bottlenecks:• and its inverse: O(d2) floats• : O(nd) floats
X�X
X
Computation: O(nd2 + d3) operationsStorage: O(nd + d2) floats
w = (X�X)�1X�y
![Page 38: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/38.jpg)
Computation: O(nd2 + d3) operations Storage: O(nd + d2) floats
w = (X�X)�1X�y
Big n and Small d
Assume O(d3) computation and O(d2) storage feasible on single machine
Storing and computing are the bottlenecks
Can distribute storage and computation!• Store data points (rows of ) across machines• Compute as a sum of outer products
XX�X
X�XX
![Page 39: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/39.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
� �
�1 23 �52 3
�
� =
�28 1811 9
�
9� 1+ 3� 3+ 5� 2 = 28
Each entry of output matrix is result of inner product of inputs matrices
![Page 40: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/40.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
� �
�1 23 �52 3
�
� =
�28 1811 9
�
Each entry of output matrix is result of inner product of inputs matrices
![Page 41: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/41.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
� �
�1 23 �52 3
�
� =
�28 1811 9
�
Each entry of output matrix is result of inner product of inputs matrices
![Page 42: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/42.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
� �
�1 23 �52 3
�
� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 43: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/43.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
� �
�1 23 �52 3
�
� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 44: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/44.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
� �
�1 23 �52 3
�
� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 45: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/45.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
� �
�1 23 �52 3
�
� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 46: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/46.jpg)
Example: n = 6; 3 workers
O(nd) Distributed Storage
x(1)
…
x(1) …
d
n
n
d
x(2)
x(n)
x(2)
x(n)
=n�
i=1
x(i)
x(i)
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
x(i)
x(i)x(
i)x(i)
x(i)
x(i)
( )-1reduce: �
x(i)
x(i)
X�X =
O(nd2) Distributed
ComputationO(d2) Local
Storage
O(d3) Local Computation
O(d2) Local Storage
![Page 47: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/47.jpg)
Example: n = 6; 3 workers
O(nd) Distributed Storage
x(1)
…
x(1) …
d
n
n
d
x(2)
x(n)
x(2)
x(n)
=n�
i=1
x(i)
x(i)
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
x(i)
x(i)x(
i)x(i)
x(i)
x(i)
( )-1reduce: �
x(i)
x(i)
X�X =
O(nd2) Distributed
ComputationO(d2) Local
Storage
O(d3) Local Computation
O(d2) Local Storage
![Page 48: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/48.jpg)
Distributed ML: Computation and Storage,
Part II
![Page 49: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/49.jpg)
Computation: O(nd2 + d3) operations Storage: O(nd + d2) floats
w = (X�X)�1X�y
Big n and Small d
Assume O(d3) computation and O(d2) storage feasible on single machine
Can distribute storage and computation! • Store data points (rows of ) across machines • Compute as a sum of outer products
XX�X
![Page 50: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/50.jpg)
Computation: O(nd2 + d3) operations Storage: O(nd + d2) floats
w = (X�X)�1X�y
Big n and Small d
Assume O(d3) computation and O(d2) storage feasible on single machine
Can distribute storage and computation! • Store data points (rows of ) across machines • Compute as a sum of outer products
XX�X
![Page 51: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/51.jpg)
Computation: O(nd2 + d3) operations Storage: O(nd + d2) floats
w = (X�X)�1X�y
Big n and Big d
As before, storing and computing are bottlenecksNow, storing and operating on is also a bottleneck
• Can’t easily distribute!
X�XX
X�X
![Page 52: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/52.jpg)
Example: n = 6; 3 workers
x(1)
…
x(1) …
d
n
n
d
x(2)
x(n)
x(2)
x(n)
=n�
i=1
x(i)
x(i)
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
x(i)
x(i)x(
i)x(i)
x(i)
x(i)
( )-1reduce: �
x(i)
x(i)
X�X =
O(nd) Distributed Storage
O(nd2) Distributed
ComputationO(d2) Local
Storage
O(d3) Local Computation
O(d2) Local Storage
![Page 53: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/53.jpg)
Example: n = 6; 3 workers
x(1)
…
x(1) …
d
n
n
d
x(2)
x(n)
x(2)
x(n)
=n�
i=1
x(i)
x(i)
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
x(i)
x(i)x(
i)x(i)
x(i)
x(i)
( )-1reduce: �
x(i)
x(i)
X�X =
O(nd) Distributed Storage
O(nd2) Distributed
ComputationO(d2) Local
Storage
O(d3) Local Computation
O(d2) Local Storage
![Page 54: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/54.jpg)
Computation: O(nd2 + d3) operations Storage: O(nd + d2) floats
w = (X�X)�1X�y
Big n and Big d
As before, storing and computing are bottlenecks Now, storing and operating on is also a bottleneck • Can’t easily distribute!
X�XX
X�X
1st Rule of thumb Computation and storage should be linear (in n, d)
![Page 55: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/55.jpg)
Big n and Big d
Sparse data is prevalent • Text processing: bag-of-words, n-grams • Collaborative filtering: ratings matrix • Graphs: adjacency matrix • Categorical features: one-hot-encoding • Genomics: SNPs, variant calling
dense : 1. 0. 0. 0. 0. 0. 3.
sparse :
8><
>:
size : 7
indices : 0 6
values : 1. 3.
We need methods that are linear in time and space
One idea: Exploit sparsity• Explicit sparsity can provide orders of magnitude storage and
computational gains
![Page 56: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/56.jpg)
Big n and Big d
n
d
≈ ‘Low-rank’
d
n
rr
We need methods that are linear in time and space
One idea: Exploit sparsity • Explicit sparsity can provide orders of magnitude storage and
computational gains • Latent sparsity assumption can be used to reduce dimension,
e.g., PCA, low-rank approximation (unsupervised learning)
![Page 57: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/57.jpg)
Big n and Big d
Another idea: Use different algorithms• Gradient descent is an iterative algorithm
that requires O(nd) computation and O(d) local storage per iteration
We need methods that are linear in time and space
One idea: Exploit sparsity • Explicit sparsity can provide orders of magnitude storage and
computational gains • Latent sparsity assumption can be used to reduce dimension,
e.g., PCA, low-rank approximation (unsupervised learning)
![Page 58: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/58.jpg)
Example: n = 6; 3 workers
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
x(i)
x(i)x(
i)x(i)
x(i)
x(i)
( )-1reduce: �
x(i)
x(i)
O(nd) Distributed Storage
O(nd2) Distributed
ComputationO(d2) Local
Storage
O(d3) Local Computation
O(d2) Local Storage
Closed Form Solution for Big n and Big d
![Page 59: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/59.jpg)
Example: n = 6; 3 workers
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
x(i)
x(i)x(
i)x(i)
x(i)
x(i)
( )-1reduce: �
x(i)
x(i)
O(nd) Distributed Storage
O(nd2) Distributed
ComputationO(d2) Local
Storage
Gradient Descent for Big n and Big d
O(d3) Local Computation
O(d2) Local Storage
![Page 60: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/60.jpg)
Example: n = 6; 3 workers
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
( )-1reduce: �
x(i)
x(i)
O(nd) Distributed Storage
O(nd2) Distributed
ComputationO(d2) Local
Storage
Gradient Descent for Big n and Big d
map: ? ? ?
O(d3) Local Computation
O(d2) Local Storage
O(nd) O(d)
![Page 61: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/61.jpg)
Example: n = 6; 3 workers
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
reduce:
O(nd) Distributed Storage
O(nd2) Distributed
ComputationO(d2) Local
Storage
Gradient Descent for Big n and Big d
map: ? ? ?O(nd)
O(d3) Local Computation
O(d2) Local Storage
O(d)
O(d) O(d)?
![Page 62: Linear Regression - edX · Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label,](https://reader034.fdocuments.in/reader034/viewer/2022050507/5f98a32e4296de7d664ed509/html5/thumbnails/62.jpg)
Example: n = 6; 3 workers
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
reduce:
O(nd) Distributed Storage
O(nd2) Distributed
ComputationO(d2) Local
Storage
Gradient Descent for Big n and Big d
map: ? ? ?O(nd)
O(d3) Local Computation
O(d2) Local Storage
O(d)
O(d) O(d)?