CSCI 447/547 Machine Learning€¦ · CSCI 447/547 MACHINE LEARNING Linear Regression . Outline...
Transcript of CSCI 447/547 Machine Learning€¦ · CSCI 447/547 MACHINE LEARNING Linear Regression . Outline...
CSCI 447/547 MACHINE LEARNING
Linear Regression
Outline
Linear Models
1D Ordinary Least Squares (OLS)
Solution of OLS
Interpretation
Anscombe’s Quartet
Multivariate OLS
OLS Pros and Cons
Optional Reading
Terminology
Features (Covariates or predictors)
Labels (Variates or targets)
Regression
Classification
Types of Machine Learning
Unsupervised
Finding structure in data
Supervised
Predict from given data
Height
Weight
Height
Weight Women
Men
Classification categorical output data Logistic Regression
OLS Regression (Prediction) continuous output data
Weight
Height
What is a Linear Model?
Predict Housing Prices
Depends on:
Area
# of bedrooms
# of bathrooms
Hypothesis is that relationship is linear
Price = k1(Area) + k2(#bed) + k3(#bath)
yi = a0 + a1x1 + a2x2 + …
Why Use Linear Models?
Interpretable
Relationships are easy to see
Low Complexity
Prevents overfitting
Scalable
Scale up to more data, larger problems
Baseline
Can benchmark other methods against them
Examples of Use
Example of Use MNIST dataset – handwritten digits Best performance – neural networks and
regularization 99.79% accurate Takes about a day to train More difficult to build
Logistic Regression 92.5% accurate Takes seconds to train Can be built with less expertise
Building Blocks of Later Techniques
Optional Reading
Definition of 1-Dimension OLS The Problem Statement
i is an observation, we have N of them
i = 1…N
x is the independent variable (feature)
y is dependent variable (output variable)
y = ax + b, a,b are constants
yi = axi + b OR yi = axi + b + ε
Two unknowns – want to solve for a and b
ˆ
The Loss Function
L = ∑i=1N(yi – yi)
2
Goal is to minimize this function
Using yi = axi + b, the equation becomes:
L = ∑i=1N(yi – axi - b)2
So this is the equation we want to minimize
ˆ
ˆ
Solution of OLS
Derivation L = ∑i=1
N(yi – axi - b)2
Want to minimize L
Take derivative of loss function wrt each variable
𝑑𝐿
𝑑𝑎 = 0,
𝑑𝐿
𝑑𝑏 = 0
𝑑𝐿
𝑑𝑎 = 0 =>
𝑑𝐿
𝑑𝑎 = ∑i=1
N2(yi – axi - b)(-xi) = 0
=> 𝑑𝐿
𝑑𝑎 = ∑i=1
Nxiyi – a∑i=1Nxi
2 - b∑i=1Nxi = 0
Solution of OLS
Derivation
𝑑𝐿
𝑑𝑏 = 0 =>
𝑑𝐿
𝑑𝑏 = ∑i=1
N2(yi – axi - b)(+1) = 0
=> 𝑑𝐿
𝑑𝑏 = ∑i=1
Nyi –∑i=1Nxi – bN = 0
b = 1
𝑁 ∑i=1
Nyi – 𝑎
𝑁∑i=1
Nxi
This is the closed form solution for b
Solution of OLS
Derivation
From first set,
𝑑𝐿
𝑑𝑎 = ∑i=1
Nxiyi – a∑i=1Nxi
2 - b∑i=1Nxi = 0
=>∑i=1Nxiyi = a∑i=1
Nxi2 + ∑i=1
Nxi(1
𝑁 ∑i=1
Nyi – 𝑎
𝑁∑i=1
Nxi)
a = 𝑥𝑖𝑦𝑖 −
1
𝑁 𝑥𝑖𝑦𝑖𝑁1
𝑁1
𝑥2𝑁1 𝑖
−1
𝑁( 𝑥𝑖)
2 𝑁1
This is the closed form solution for a
Solution of OLS
Optimal Choices
Interpretation
Interpretation of a and b a is the slope of the line
tangent of angle θ
the effect of the independent variable on the dependent
b is the intercept of the line x – independent variable
y – dependent variable
θ
Interpretation
Interpretation of L
L = ∑i=1N(yi – yi)
2
Expresses how well the solution captures the variation in the data
R2 = 1 – MSE/Var(y)
R2 [0, 1]
Interpretation
Anscombe’s Quartet
Anscombe’s Quartet
Same values for mean, variance and best fit line
R2 values are the same for each example
But … linear regression may not be the best for the last three examples
Multivariable OLS
Definition of Model
Data Matrix
The Loss Function
Mutivariable OLS
i = an observation
N = number of observations
i = 1…N
M = number of features
xi = [xi1, xi2, …, xiM]
yi - dependent variable
Data matrix: X = 𝑥11 𝑥12… 𝑥1𝑀… … …𝑥𝑁1 𝑋𝑁2… 𝑋𝑁𝑀
Mutivariable OLS
Data matrix: X = 𝑥11 𝑥12… 𝑥1𝑀… … …𝑥𝑁1 𝑋𝑁2… 𝑋𝑁𝑀
y = ax + b(1)
Add a column of all 1’s to left of data matrix to get bias term included
yi = B0 + B1xi1 + B2xi2 + … + BMxiM
xi . B, B =
𝐵0…𝐵𝑀
, y = XB
ˆ
Multivariable OLS
Loss Function
L = ∑i=1N(yi – yi)
2
Still want to minimize L
L = ∑i=1N(yi – (B0 + B1 xi1 + … + BMxiM))2
L = ∑i=1N(yi – xiB)2
Norm manner – L2 norm of the vector
L = 𝑦 − 𝑋𝐵 22
L = (y – XB)T(y – XB)
ˆ
Optimization
A Few Facts from Matrix Calculus
𝑑(𝑎𝑥)
𝑑𝑥= 𝑎
𝑑 𝑎𝑥2
𝑑𝑥= 2𝑎𝑥
Optimization
Minimizing the Loss L = (y – XB)T(y – XB)
𝑑𝐿
𝑑𝐵= 0
𝑑 𝑦 −𝑋𝐵 𝑇(𝑦−𝑋𝐵)
𝑑𝐵= 0
𝑑(𝑦𝑇𝑦 −𝑦𝑇𝑋𝐵 −𝐵𝑇𝑋𝑇𝑦+𝐵𝑇𝑋𝑇𝑋𝐵)
𝑑𝐵= 0 ((XY)T = YTXT)
-(XTy) – (XTy) + 2(XTX)B = 0 XTy = (XTX)B B = (XTX)-1XTy (assuming XTX is invertible, which
is true if X is a full rank matrix, that is none of its columns are linearly dependent)
OLS Pros and Cons
OLS
Pros
Efficient to compute
Unique minimum
Stable under perturbation of data
Easy to interpret
Cons
Influenced by outliers
(XTX)-1 may not exist
Features may not be linearly independent
Summary
Linear Models
1D Ordinary Least Squares (OLS)
Solution of OLS
Interpretation
Anscombe’s Quartet
Multivariate OLS
OLS Pros and Cons