CSCI 447/547 Machine Learning€¦ · CSCI 447/547 MACHINE LEARNING Linear Regression . Outline...

CSCI 447/547 MACHINE LEARNING

Linear Regression

Outline

Linear Models

1D Ordinary Least Squares (OLS)

Solution of OLS

Interpretation

Anscombe’s Quartet

Multivariate OLS

OLS Pros and Cons

Optional Reading

Terminology

Features (Covariates or predictors)

Labels (Variates or targets)

Regression

Classification

Types of Machine Learning

Unsupervised

Finding structure in data

Supervised

Predict from given data

Height

Weight

Height

Weight Women

Men

Classification categorical output data Logistic Regression

OLS Regression (Prediction) continuous output data

Weight

Height

What is a Linear Model?

Predict Housing Prices

Depends on:

Area

# of bedrooms

# of bathrooms

Hypothesis is that relationship is linear

Price = k1(Area) + k2(#bed) + k3(#bath)

yi = a0 + a1x1 + a2x2 + …

Why Use Linear Models?

Interpretable

Relationships are easy to see

Low Complexity

Prevents overfitting

Scalable

Scale up to more data, larger problems

Baseline

Can benchmark other methods against them

Examples of Use

Example of Use MNIST dataset – handwritten digits Best performance – neural networks and

regularization 99.79% accurate Takes about a day to train More difficult to build

Logistic Regression 92.5% accurate Takes seconds to train Can be built with less expertise

Building Blocks of Later Techniques

Optional Reading

Definition of 1-Dimension OLS The Problem Statement

i is an observation, we have N of them

i = 1…N

x is the independent variable (feature)

y is dependent variable (output variable)

y = ax + b, a,b are constants

yi = axi + b OR yi = axi + b + ε

Two unknowns – want to solve for a and b

ˆ

The Loss Function

L = ∑i=1N(yi – yi)

2

Goal is to minimize this function

Using yi = axi + b, the equation becomes:

L = ∑i=1N(yi – axi - b)2

So this is the equation we want to minimize

ˆ

ˆ

Solution of OLS

Derivation L = ∑i=1

N(yi – axi - b)2

Want to minimize L

Take derivative of loss function wrt each variable

𝑑𝐿

𝑑𝑎 = 0,

𝑑𝐿

𝑑𝑏 = 0

𝑑𝐿

𝑑𝑎 = 0 =>

𝑑𝐿

𝑑𝑎 = ∑i=1

N2(yi – axi - b)(-xi) = 0

=> 𝑑𝐿

𝑑𝑎 = ∑i=1

Nxiyi – a∑i=1Nxi

2 - b∑i=1Nxi = 0

Solution of OLS

Derivation

𝑑𝐿

𝑑𝑏 = 0 =>

𝑑𝐿

𝑑𝑏 = ∑i=1

N2(yi – axi - b)(+1) = 0

=> 𝑑𝐿

𝑑𝑏 = ∑i=1

Nyi –∑i=1Nxi – bN = 0

b = 1

𝑁 ∑i=1

Nyi – 𝑎

𝑁∑i=1

Nxi

This is the closed form solution for b

Solution of OLS

Derivation

From first set,

𝑑𝐿

𝑑𝑎 = ∑i=1

Nxiyi – a∑i=1Nxi

2 - b∑i=1Nxi = 0

=>∑i=1Nxiyi = a∑i=1

Nxi2 + ∑i=1

Nxi(1

𝑁 ∑i=1

Nyi – 𝑎

𝑁∑i=1

Nxi)

a = 𝑥𝑖𝑦𝑖 −

1

𝑁 𝑥𝑖𝑦𝑖𝑁1

𝑁1

𝑥2𝑁1 𝑖

−1

𝑁( 𝑥𝑖)

2 𝑁1

This is the closed form solution for a

Solution of OLS

Optimal Choices

Interpretation

Interpretation of a and b a is the slope of the line

tangent of angle θ

the effect of the independent variable on the dependent

b is the intercept of the line x – independent variable

y – dependent variable

θ

Interpretation

Interpretation of L

L = ∑i=1N(yi – yi)

2

Expresses how well the solution captures the variation in the data

R2 = 1 – MSE/Var(y)

R2 [0, 1]

Interpretation


Same values for mean, variance and best fit line

R2 values are the same for each example

But … linear regression may not be the best for the last three examples

Multivariable OLS

Definition of Model

Data Matrix

The Loss Function

Mutivariable OLS

i = an observation

N = number of observations

i = 1…N

M = number of features

xi = [xi1, xi2, …, xiM]

yi - dependent variable

Data matrix: X = 𝑥11 𝑥12… 𝑥1𝑀… … …𝑥𝑁1 𝑋𝑁2… 𝑋𝑁𝑀

Mutivariable OLS

Data matrix: X = 𝑥11 𝑥12… 𝑥1𝑀… … …𝑥𝑁1 𝑋𝑁2… 𝑋𝑁𝑀

y = ax + b(1)

Add a column of all 1’s to left of data matrix to get bias term included

yi = B0 + B1xi1 + B2xi2 + … + BMxiM

xi . B, B =

𝐵0…𝐵𝑀

, y = XB

ˆ

Multivariable OLS

Loss Function

L = ∑i=1N(yi – yi)

2

Still want to minimize L

L = ∑i=1N(yi – (B0 + B1 xi1 + … + BMxiM))2

L = ∑i=1N(yi – xiB)2

Norm manner – L2 norm of the vector

L = 𝑦 − 𝑋𝐵 22

L = (y – XB)T(y – XB)

ˆ

Optimization

A Few Facts from Matrix Calculus

𝑑(𝑎𝑥)

𝑑𝑥= 𝑎

𝑑 𝑎𝑥2

𝑑𝑥= 2𝑎𝑥

Optimization

Minimizing the Loss L = (y – XB)T(y – XB)

𝑑𝐿

𝑑𝐵= 0

𝑑 𝑦 −𝑋𝐵 𝑇(𝑦−𝑋𝐵)

𝑑𝐵= 0

𝑑(𝑦𝑇𝑦 −𝑦𝑇𝑋𝐵 −𝐵𝑇𝑋𝑇𝑦+𝐵𝑇𝑋𝑇𝑋𝐵)

𝑑𝐵= 0 ((XY)T = YTXT)

-(XTy) – (XTy) + 2(XTX)B = 0 XTy = (XTX)B B = (XTX)-1XTy (assuming XTX is invertible, which

is true if X is a full rank matrix, that is none of its columns are linearly dependent)

OLS Pros and Cons

OLS

Pros

Efficient to compute

Unique minimum

Stable under perturbation of data

Easy to interpret

Cons

Influenced by outliers

(XTX)-1 may not exist

Features may not be linearly independent

Summary

Linear Models

1D Ordinary Least Squares (OLS)

Solution of OLS

Interpretation


Multivariate OLS

OLS Pros and Cons

CSCI 447/547 Machine Learning€¦ · CSCI 447/547 MACHINE LEARNING Linear Regression . Outline...

Documents

Transcript of CSCI 447/547 Machine Learning€¦ · CSCI 447/547 MACHINE LEARNING Linear Regression . Outline...