Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear...

28
Linear Regression & Gradient Descent 1 Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) Prof. Mike Hughes

Transcript of Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear...

Page 1: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Linear Regression& Gradient Descent

1

Tufts COMP 135: Introduction to Machine Learninghttps://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to:Erik Sudderth (UCI)Finale Doshi-Velez (Harvard)James, Witten, Hastie, Tibshirani (ISL/ESL books)

Prof. Mike Hughes

Page 2: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

LR & GD Unit Objectives

• Exact solutions of least squares• 1D case without bias• 1D case with bias• General case

• Gradient descent for least squares

3Mike Hughes - Tufts COMP 135 - Spring 2019

Page 3: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

What will we learn?

4Mike Hughes - Tufts COMP 135 - Spring 2019

SupervisedLearning

Unsupervised Learning

Reinforcement Learning

Data, Label PairsPerformance

measureTask

data x

labely

{xn, yn}Nn=1

Training

Prediction

Evaluation

Page 4: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

5Mike Hughes - Tufts COMP 135 - Spring 2019

Task: RegressionSupervisedLearning

Unsupervised Learning

Reinforcement Learning

regression

x

y

y is a numeric variable e.g. sales in $$

Page 5: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Visualizing errors

6Mike Hughes - Tufts COMP 135 - Spring 2019

Page 6: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Regression: Evaluation Metrics

• mean squared error

• mean absolute error

7Mike Hughes - Tufts COMP 135 - Spring 2019

1

N

NX

n=1

|yn � yn|

1

N

NX

n=1

(yn � yn)2

Page 7: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Linear RegressionParameters:

Prediction:

Training:find weights and bias that minimize error

8Mike Hughes - Tufts COMP 135 - Spring 2019

y(xi) ,FX

f=1

wfxif + b

w = [w1, w2, . . . wf . . . wF ]b

weight vector

bias scalar

Page 8: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Sales vs. Ad Budgets

9Mike Hughes - Tufts COMP 135 - Spring 2019

Page 9: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Linear Regression: Training

10Mike Hughes - Tufts COMP 135 - Spring 2019

minw,b

NX

n=1

⇣yn � y(xn, w, b)

⌘2

Optimization problem: “Least Squares”

Page 10: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

11Mike Hughes - Tufts COMP 135 - Spring 2019

Linear Regression: Training

minw,b

NX

n=1

⇣yn � y(xn, w, b)

⌘2

Optimization problem: “Least Squares”

Exact formula for optimal values of w, b exist!

With only one feature (F=1):

w =

PNn=1(xn � x)(yn � y)PN

n=1(xn � x)2b = y � wx

x = mean(x1, . . . xN )

y = mean(y1, . . . yN )

Where does this come from?

Page 11: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

12Mike Hughes - Tufts COMP 135 - Spring 2019

Linear Regression: Training

minw,b

NX

n=1

⇣yn � y(xn, w, b)

⌘2

Optimization problem: “Least Squares”

Exact formula for optimal values of w, b exist!

With many features (F >= 1):

Where does this come from?

[w1 . . . wF b]T = (XT X)�1XT y

X =

2

664

x11 . . . x1F 1x21 . . . x2F 1

. . .

xN1 . . . xNF 1

3

775

Page 12: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Derivation Notes

http://www.cs.tufts.edu/comp/135/2019s/notes/day03_linear_regression.pdf

13Mike Hughes - Tufts COMP 135 - Spring 2019

Page 13: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

When does the Least Squares estimator exist?• Fewer examples than features (N < F)

• Same number of examples and features (N=F)

• More examples than features (N > F)

14Mike Hughes - Tufts COMP 135 - Spring 2019

Optimum exists if X is full rank

Optimum exists if X is full rank

Infinitely many solutions!

Page 14: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

More compact notation

15Mike Hughes - Tufts COMP 135 - Spring 2019

✓ = [b w1 w2 . . . wF ]

xn = [1 xn1 xn2 . . . xnF ]

y(xn, ✓) = ✓

Txn

J(✓) ,NX

n=1

(yn � y(xn, ✓))2

Page 15: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Idea: Optimize via small steps

16Mike Hughes - Tufts COMP 135 - Spring 2019

Page 16: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Derivatives point uphill

17Mike Hughes - Tufts COMP 135 - Spring 2019

Page 17: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

18Mike Hughes - Tufts COMP 135 - Spring 2019

To minimize, go downhill

Step in the opposite direction of the derivative

Page 18: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Steepest descent algorithm

19Mike Hughes - Tufts COMP 135 - Spring 2019

input: initial ✓ 2 Rinput: step size ↵ 2 R+

while not converged:

✓ ✓ � ↵d

d✓J(✓)

Page 19: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Steepest descent algorithm

20Mike Hughes - Tufts COMP 135 - Spring 2019

input: initial ✓ 2 Rinput: step size ↵ 2 R+

while not converged:

✓ ✓ � ↵d

d✓J(✓)

Page 20: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

How to set step size?

21Mike Hughes - Tufts COMP 135 - Spring 2019

Page 21: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

How to set step size?

22Mike Hughes - Tufts COMP 135 - Spring 2019

• Simple and usually effective: pick small constant

• Improve: decay over iterations

• Improve: Line search for best value at each step

↵ = 0.01

↵t =C

t↵t = (C + t)�0.9

Page 22: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

How to assess convergence?

• Ideal: stop when derivative equals zero

• Practical heuristics: stop when …• when change in loss becomes small

• when step size is indistinguishable from zero

23Mike Hughes - Tufts COMP 135 - Spring 2019

↵| dd✓

J(✓)| < ✏

|J(✓t)� J(✓t�1)| < ✏

Page 23: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Visualizing the cost function

24Mike Hughes - Tufts COMP 135 - Spring 2019

“Level set” contours : all points with same function value

Page 24: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

In 2D parameter space

25Mike Hughes - Tufts COMP 135 - Spring 2019

gradient = vector of partial derivatives

Page 25: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Gradient Descent DEMOhttps://github.com/tufts-ml-courses/comp135-19s-assignments/blob/master/labs/GradientDescentDemo.ipynb

26Mike Hughes - Tufts COMP 135 - Spring 2019

Page 26: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Fitting a line isn’t always ideal

27Mike Hughes - Tufts COMP 135 - Spring 2019

Page 27: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

Can fit linear functions tononlinear features

28Mike Hughes - Tufts COMP 135 - Spring 2019

y(xi) = ✓0 + ✓1xi + ✓2x2i + ✓3x

3i

y(�(xi)) = ✓0 + ✓1�(xi)1 + ✓2�(xi)2 + ✓3�(xi)3

A nonlinear function of x:

Can be written as a linear function of �(xi) = [xi x

2i x

3i ]

“Linear regression” means linear in the parameters (weights, biases)

Features can be arbitrary transforms of raw data

Page 28: Tufts COMP 135: Introduction to Machine Learning https ... · “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw

What feature transform to use?

• Anything that works for your data!

• sin / cos for periodic data

• polynomials for high-order dependencies

• interactions between feature dimensions

• Many other choices possible

29Mike Hughes - Tufts COMP 135 - Spring 2019

�(xi) = [xi1xi2 xi3xi4]

�(xi) = [xi x

2i x

3i ]