Machine Learning: Linear Regression - Jarrar€¦ · qPart 1: Motivation (Regression Problems)...

Jarrar © 2018 1

Mustafa Jarrar: Lecture Notes on Linear RegressionBirzeit University, 2018

Mustafa JarrarBirzeit University

Machine Learning Linear Regression

Version 1

http://www.birzeit.edu/

http://www.birzeit.edu/

Jarrar © 2018 2

More Online Courses at: http://www.jarrar.infoCourse Page: http://www.jarrar.info/courses/AI/

Watch this lecture and download the slides

Acknowledgement: This lecture is based on (but not limited to) Andrew Ng’s course about Machine Learning https://www.youtube.com/channel/UCMoXOGX9mgrYNEwpcIQUcag

http://www.jarrar.info/

http://www.jarrar.info/courses/AI/

https://www.youtube.com/channel/UCMoXOGX9mgrYNEwpcIQUcag

Jarrar © 2018 3

In this lecture:

q Part 1: Motivation (Regression Problems)q Part 2: Linear Regression

q Part 3: The Cost Function

q Part 4: The Gradient Descent Algorithm

q Part 5: The Normal Equation

q Part 6: Linear Algebra overview

q Part 7: Using Octave

q Part 8: Using R

Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning Birzeit University, 2018


Jarrar © 2018 4

Motivation

We may assume a linear curve, How much a house of 1250ft2 costs?

Given the following housing prices,

Price(in 1000s of dollars)

Size (feet2)

1250

230

So, conclude that a house with 1250ft2 costs 230K$.

Jarrar © 2018 5

Motivation

Given the following housing prices

Given the right answers for each example in the data (training data)

Supervised Learning: Regression Problem:Predict real-valued outputRemember that classification (not regression) refers to predicting discrete-valued output

Price(in 1000s of dollars)

Size (feet2)

Jarrar © 2018 6

Motivation

Given the following training set of housing prices:

Notation:m = Number of training examplesx’s = “input” variable/featuresy’s = output variable/target variable(x, y) : a training example(xi, yi) : the ith training example

Size in feet2 (x) Price ($) in 1000s (y)2104 4601416 2321534 315852 178… …

Our job is to learn from this data how to predict prices

For example:x1 = 2104y1 = 460

Jarrar © 20187

In this lecture:

q Part 1: Motivation (Regression Problems)

q Part 2: Linear Regression q Part 3: The Cost Function





q Part 8: Using R



Jarrar © 2018 8

y

x

x

xx

xxxx xx

x x

xx

x

x xx

Linear Regression

Training Set

Learning Algorithm

h

Hypothesis

Size of house

Estimated Price

h maps x’s to y’s

How to represent h?

h (x) = q0 + q1x

h(x)

• This is a linear function.

• Also called linear regression with one variable.

Jarrar © 2018 9

Linear Regression with Multiple Features

x1 x2 x3 x4 ySuppose we have the following features

Size ft2 bedrooms floors Age Price2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178

… … … … …

A hypothesis function h(x) might be:h(x) = q0 + q1x1 + q2x2 + q3x3 +…+ qnxn

or h(x) = q0x0 + q1x1 + q2x2 + q3x3 +…+ qnxn (x0=1)e.g., h(x) = 80 + 0.1∙x1 + 0.01∙x2 + 3∙x3 - 2∙x4

Linear regression with multiple features is also called multiple linear regression

Jarrar © 2018 10

Polynomial Regression

Another hypothesis function h(x) might be (polynomial):

h (x) = q0x0 + q1x1 + q2x22 + q2x33 +…+ qnxnn (x0=1)

Suppose we have the following featuresx1 x2 x3 x4 y

Size ft2 bedrooms floors Age Price2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178

… … … … …

Jarrar © 201811

Polynomial Regression

Price(y)

Size (x)

x

x xx

x

x

x

x

x

x xx

x

x

xx

x

x xx

h (x) = q0 + q1x + q2x2

h (x) = q0 + q1x + q2x2 + q3x3

h (x) = q0 + q1x + q2! x

• We may combine features (e.g., size = width * depth).

• We have the option of what features and what models

(quadric, cubic,…) to use.

• Deciding which features and models that best fit our data and

application, is beyond the scope of this course, but there are

several algorithms for this.

Jarrar © 201812

In this lecture:


q Part 2: Linear Regression

q Part 3: The Cost Functionq Part 4: The Gradient Descent Algorithm




q Part 8: Using R



Jarrar © 2018 13

Understanding qs Values

y

x

x

xx

xxxx xx

x x

xx

x

x xx

h (x) = q0 + q1x

qi’s: Parameters

How to choose qi’s?

Jarrar © 2018 14

Understanding qs Values

h(x) = 1 + 0.5x

0321

3

2

1

q0 = 1q1 = 0.5

h(x) =0 + 0.5x

0321

3

2

1

q0 = 0q1 = 0.5

h(x) = 1.5 + 0∙x

0321

3

2

1

q0 = 1.5q1 = 0

Jarrar © 2018 15

The Cost Function

y

x

x

xx

x xx

x

x

Idea: Choose q0,q1 so thath(x) is close to y for our training examples (x,y).

y: is the actual value.h(x): is the estimated value.

q0,q1

12m ∑ ( h(xi) – yi )2J(q0,q1) =

This cost function is also called a Squared Error function

The Cost Function J aims to find q0,q1 that minimizes the error.

i=1

m

Jarrar © 2018 16

Undersetting the Cost Function

0321

3

2

1

Lets try different values of qs and see how the cost function J(q0,q1) behave!

╳╳

╳

0321

3

2

1 ╳╳

╳

0321

3

2

1 ╳╳

╳J(0,1) J(0,0.5) J(0,0)

321

3

2

1

J(q1)

q1╳

J(0.5)= 1/2∙3 [(0.5-1)2 +(1-2)2(1.5-3)2]

≃ 0.68J(1)= 1/2∙3 [(0)2 +(0)2(0)2]

= 0

J(0)= 1/2∙3 [(1)2 +(2)2(3)2]

≃ 2.3

╳

╳

╳

╳╳╳

Given our training data, this is the best value of q1.

But this figure plots only q1, the problem

becomes complicated if we have also (q0. q1)

Jarrar © 2018 17

Cost Function Intuition II

Hypothesis:

Parameters:

Cost Function:

Our Goal:

h (x) = q0 + q1 x1

q0, q1

12m ∑ ( h(xi) – yi )2J(q0,q1) =

i=1

m

minimize J(q0, q1)

Remember that our goal is find the minimum values of q0 and q1

Jarrar © 2018 18


Given any dataset, when we try to draw the cost function J(q0,q1), we may get this 3D shape:

Given a dataset, our goal is find the minimum values of J(q0,q1)

Jarrar © 2018 19


We may draw the cost function also using contour figures:

Jarrar © 2018 20


(q0,q1 )

h (x) = 800 + 0.15 ∙ x

Jarrar © 2018 21


(q0,q1 )

h (x) = 360 + 0∙x

Jarrar © 2018 22


(q0,q1 )

Is there any way/algorithm to find qs automatically?Yes, e.g., the Gradient Descent Algorithm

Jarrar © 2018 23

In this lecture:q Part 1: Motivation (Regression Problems)



q Part 4: The Gradient Descent Algorithmq Part 5: The Normal Equation



q Part 8: Using R



Jarrar © 2018 24

The Gradient Descent Algorithm

!"#$0 ≔ '0 −α . ++,-

. .('0, '1)Repeat until convergence {

}

!"#$1 ≔ '1 −α . ++,3

. .('0, '1)'0:= !"#$0'1:= !"#$1

Starts with some initial values of '0 and '1Keep changing '0 and '1 to reduce .('0, '1)Until hopefully we end up at a minimum

Learning Rate

Partial Derivation

Jis 0or1

Jarrar © 2018 25

The Problem of the Gradient Descent algo.

q1If α is too small, gradient descent can be slow

q1If α is too large, gradient decent can overshoot the minimum. It may fail to converge or even diverge

May converge to global minimum May converge to a local minimum

Jarrar © 2018 26

The Gradient Descent for Linear Regression

!0 ≔ !0 − α . '( .∑*+'( (ℎ(.*) − 0*)

!1 ≔ !1 − α . '( .∑*+'( (ℎ(.*) − 0*) . .*

Repeat until convergence {

}

Simplified version for linear regression

Next, we will try to use Linear Algebra to numericallyminimize !2 (called Normal Equation) without needing to use iterating algorithms like Gradient Descent. However Gradient Descent scales better for bigger datasets.

Jis 0and1

Jarrar © 201827

In this lecture:





q Part 5: The Normal Equationq Part 6: Linear Algebra overview


q Part 8: Using R



Jarrar © 2018 28

Minimizing !" numerically

Convert the features and the target value into matrixes:

2104 5 1 451416 3 2 401534 3 2 30852 2 1 36

X =

x1 x2 x3 x4 ySize ft2 bedrooms floors Age Price2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178

… … … … …

Another way to numerically (using linear algebra) estimate the optimal values of !".Given the following features:

460232315178

y=

Jarrar © 2018 29

Minimizing !" numerically

1 2104 5 1 451 1416 3 2 401 1534 3 2 301 852 2 1 36

X =

x1 x2 x3 x4 ySize ft2 bedrooms floors Age Price2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178

… … … … …

460232315178

y=

add #0, so to represent !0

Jarrar © 2018 30

The Normal Equation

To obtain the optimal values (minimized) of !s, Use the following Normal Equation:

! = (XT ∙X)-1 ∙ XT ∙ y

In Octave: pinv(X’*X)*X’ *y

!: the set of !s we want to minimizeX: the set of features in the training sety: the output values we try to predict XT: the transpose of XX-1: the inverse of X

Where:

Jarrar © 2018 31

Gradient Descent Vs Normal Equation

Gradient Descent Normal Equation

• Needs to choose the learning rate (α)

• Needs many iterations

• Works well even when nis large

• No need to choose (α)

• Don’t need to iterate• Need to compute (XTX),

which takes about O(n3)

• Slow if n is very large• Some matrices (e.g.,

singular) are non-invertible

m training examples, n features

Recommendation: use the Normal Equation If the number of features in less than 1000, otherwise the Gradient Descent.

Jarrar © 2018 34

Matrix Matrix Multiplication

1 3 24 0 1

1 3 0 15 2

=╳

Multiple with the first column

1 3 24 0 1

105

=╳ 119

Multiple with the second column

1 3 24 0 1

312

=╳ 1014

11 109 14

Jarrar © 201835

Matrix Inverse

3 4

2 16 =╳

If A is an m×m matrix, and if it has an inverse, then

A×A-1 = A-1×A = I

The multiplication of a matrix with its inverse produce an

identity matrix:

0.4 -0.1

-0.05 0.075

1 0

0 1

Some matrices do not have inverses e.g., if all cells are zeros

For more, please review Linear Algebra

Jarrar © 2018 36

Matrix Transpose

1 2 03 5 9A=

1 32 50 9

AT=Example:

Let A be an m×n matrix, and let B = AT

Then B is an n×m matix, and Bij = Aji

Jarrar © 2018 38

Octave

• Download from https://www.gnu.org/software/octave/

• High-level Scientific Programming Language

• Free alternatives to (and compatible with) Matlab

• helps in solving linear and nonlinear problems numerically

Online Tutorial https://www.youtube.com/playlist?list=PLnnr1O8OWc6aAjSc50lzzPVWgjzucZsSD

https://www.gnu.org/software/octave/

https://www.youtube.com/playlist?list=PLnnr1O8OWc6aAjSc50lzzPVWgjzucZsSD

Jarrar © 2018 39

Compute the normal equation using Octave

1 2104 5 1 451 1416 3 2 401 1534 3 2 301 852 2 1 36

X =

460232315178

y=Given

! =

In Octave: load X.txtload y.txtC= pinv(X’*X)*X’ *ySave !.txt

188.40.4-56-93-3.7

These are the values of !s we need to use in our hypothesis function h(x)

h (x) = q0 + q1x + q2x + q3x + q4xh (x) = 188.4 + 0.4x - 56x - 93x – 3.7x

Jarrar © 201841

R (programming language)

Free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.

R comes with many functions that can do sophisticated stuff, and the ability to install additional packages to do much more.

Download R: https://cran.r-project.org/

A very good IDE for R is the RStudio:https://www.rstudio.com/products/rstudio/download/

R basics tutorial:https://www.youtube.com/playlist?list=PLjgj6kdf_snYBkIsWQYcYtUZiDpam7ygg

https://cran.r-project.org/

https://www.rstudio.com/products/rstudio/download/

https://www.youtube.com/playlist?list=PLjgj6kdf_snYBkIsWQYcYtUZiDpam7ygg

Jarrar © 2018 42

Compute the normal equation using R

1 2104 5 1 451 1416 3 2 401 1534 3 2 301 852 2 1 36

X =

460232315178

y=Given

In R:X = as.matrix(read.table("~/x.txt", header=F, sep=","))Y = as.matrix(read.table("~/y.txt", header=F, sep=","))

thetas = solve( t(X) %*% X ) %*% t(X) %*% Y

write.table(thetas, file="~/thetas.txt",row.names=F, col.names=F)

t(X): is the transpose of matrix X.solve(X): is the inverse of matrix X.

Jarrar © 2018 43

References

[1] Andrew Ng’s course about Machine Learning https://www.youtube.com/channel/UCMoXOGX9mgrYNEwpcIQUcag

[2] Sami Ghawi, Mustafa Jarrar: Lecture Notes on Introduction to Machine Learning, Birzeit University, 2018

[3] Mustafa Jarrar: Lecture Notes on Decision Tree Machine Learning, Birzeit University, 2018

[4] Mustafa Jarrar: Lecture Notes on Linear Regression Machine Learning, Birzeit University, 2018

https://www.youtube.com/channel/UCMoXOGX9mgrYNEwpcIQUcag

Jarrar © 2018 44

Extra Example

Year Grape Price Watermelon Price Melon Price2013 5 3 42014 6 2 32015 7 2 3

Given the following price melon in the previous 5 years

1 5 31 6 21 7 2X=

433Y=

Jarrar © 2018 45

X= XT=

X*XT=

1 5 31 6 21 7 2

1 1 15 6 73 2 2

35 37 4237 41 4742 47 54

5.0000 -24.0000 17.0000-24.0000 126.0000 -91.000017.0000 -91.0000 66.0000

Inv(X*XT)=

C=pinv(x2'*x2)*(x2'*y2)

Machine Learning: Linear Regression - Jarrar€¦ · qPart 1: Motivation (Regression Problems)...

Documents

Transcript of Machine Learning: Linear Regression - Jarrar€¦ · qPart 1: Motivation (Regression Problems)...