All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science...

50
All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: [email protected] Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford)

Transcript of All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science...

Page 1: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

All you wanted to know about Regression…

COSC 526 Class 9

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]

Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford)

Page 2: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

2

Introducing your guest instructor (Feb 10-12)

• Dr. Sreenivas (Rangan) Sukumar

• Staff member at ORNL:– Leader in graph analytics

approaches

– UTK grad…

– “Healthcare Guru” at ORNLSide bar: • The class website location will be shortly updated. The original links

must work – but will be redirected to a new location at EECS.• Approved for space on the EECS website! • Hadoop server is working (finally) and your accounts are also ready

(utk id). More information on log in procedures as well as access to data forthcoming…

Page 3: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

3

Last class: Classification with SVMs

• We had a class variable: y– Categorical in nature

– {x1, x2, …, xn} could be anything

• Formulated a quadratic programming problem that would eventually allow us to classify– stochastic gradient descent (SGD)

• Alterations for big datasets: – Minimum enclosing ball (MEB)

– Shrinking the optimization problem

– Incremental and decremental SVM learning

Page 4: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

4

This class: predicting a real valued y

• Instead of a categorical class value y, we are going to see how to predict a real valued y

• Various regression algorithms:– Linear regression

– Regression with varying noise

– Non-linear regression

• Adapting regression for big data

Page 5: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

5

Part I: Linear Regression

Page 6: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

6

Regression

Living Area (sq. ft)

Prince ($ 1000s)

2104 400

1600 300

2400 370

1416 200

3000 540

Living area (sq. ft.)

Pri

ce• Can we predict the prices of other

houses as a function of their living area?

Linear regression helps us with this analysis…

As a recent home buyer (or a buyer interested in the market):

Page 7: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

7

Linear regression

• Linear regression assumes that the expected value of the output, given some input is linear

• Simplest way to think about this: y = wx for some unknown w

Living area (sq. ft.)

Pri

ceLiving Area (sq. ft)

Prince ($ 1000s)

2104 400

1600 300

2400 370

1416 200

3000 540

Given the data, how to estimate w….

w

1

Page 8: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

8

Some formalism…

• Assume that our data is formed by:

– Noise signals are independent

– Drawn from a Normal distribution

• p(y | w, x) has a normal distribution:– Mean wx

– variance σ2

Noise

Page 9: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

9

Linear Regression (1)

• we have a bunch of data {(x1, y1), (x2, y2), … (xn, yn)} which are all evidence about w

• How to infer w (from the data)?

• Bayes rule to our rescue:– Maximum likelihood estimate (MLE) of w

– Because you can do it on a computer!

Page 10: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

10

MLE of w

• For which value of w is the data most likely to have this behavior? – i.e., for what w is

maximized?

– i.e., for what w is maximized?

Since we know the distribution, i.e., we assumed that the data came from a normal distribution

2

Page 11: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

11

MLE of w

• Now do the log-likelihood trick…

• Equivalently:

now we are in familiar territory…

Page 12: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

12

All we have to do is …

• Take the derivative of E(w) w.r.t w and set to 0

0

Page 13: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

13

What do we mean by this (graphically)?

• If x=sq. ft., y = price,

is the average price for x = 2014 sq. ft.• If x=height, y = weight,

is the average weight for all people 60 in tall.

Page 14: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

14

Multi-linear Regression

• Now instead of a single x, let’s say we have x, where it comes from a d-dimensional spaceLiving Area (sq. ft)

No. of rooms

Prince ($ 1000s)

2104 2 400

1600 2 300

2400 3 370

1416 2 200

3000 4 540

How do we think of doing regression?

• Remember there are d-dimensions • (2 here)

• Can we visualize our data in a way that is easy to “regress”?

Page 15: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

15

Matrix algebra to our rescue…

• out(x) = wTx = w1x[1] + w2x[2] + … + wdx[d]

• How do we learn w?

• Let’s define a cost-function

Page 16: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

16

MLE is very similar to the simple regression story…

• MLE is given by:

• xTx is a n x n matrix:– where (i,j) th element is

• xTy is a n element vector: – with ith element

Page 17: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

17

How to solve this on a computer?

• Let’s say I have an initial guess for w

• I need to search for a suitable w that will make J(w) smaller

• Idea: use gradient descent!

Repeat until convergence: For every j = 1…m:

Calculate gradient

Update

Page 18: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

18

Problem(s) with gradient descent

• It will converge: for linear regression, since we have a global minimum, GD will converge to the solution!

• Takes a long time if training examples are large in number:– Each iteration scans through the entire training

dataset

– Can do stochastic gradient descent (SGD) in a similar way we discussed last time…

Page 19: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

19

Pesky detail…

• We always talked about the line as if it originated from the origin 0D

• What if this is not the case?

Living area (sq. ft.)

Pri

ce

Living area (sq. ft.)

Pri

ce

Page 20: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

20

Let’s fake it… neat trick!

• Create a fake input x0 with a value of 1

(always)

x1 x2 y

2104 2 400

1600 2 300

2400 3 370

1416 2 200

3000 4 540

x0 x1 x2 y

1 2104 2 400

1 1600 2 300

1 2400 3 370

1 1416 2 200

1 3000 4 540

y = w1x1 + w2x2 y = w0x0 + w1x1 + w2x2

= w0 + w1x1 + w2x2

Page 21: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

21

Let’s say we know something about the noise added to each data point

• E.g.: I know the variance of the noise added to each data point…

xi σi2 y

0.5 4 0.5

1 1 1

2 0.25 1

2 4 3

3 0.25 2

Now, how do we do the MLE?

Page 22: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

22

MLE with varying noise

Assuming independence among noise, plug in the Gaussian equation and simplify;

setting d(LL)/dw = 0 for minimum:

Page 23: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

23

Weighted Regression

• We just saw “weighted regression”

• points that have a “higher confidence” and “lower noise” are important

• Rest are weighted by the variance in their noise

Page 24: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

24

Part II: Non-linear Regression

Page 25: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

25

Non-linear regression…

• Suppose y is related to a function of x in such a way that the predicted values have a non-linear relationship…

xi y1

0.5 0.05

1 2.5

2 3

3 2

3 3

Assume

Page 26: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

26

Non-linear MLE

• Ugly, ugly algebra!!! What do we do?– Line search

– Simulated annealing

– GD and SGD

– Newton’s method

– Expectation Maximization!

Page 27: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

27

Polynomial Regression…

• All this while, we were talking about linear regression

• But, it may not be the best way to describe data

• Be careful about how to fit the data…

Page 28: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

28

Suppose we add an additional term…

• Quadratic regression: Each component is now called a term

• Each column is called a term column

• How many terms in a quadratic regression with p inputs?– 1 constant term

– p linear terms

– (p+1)C2 quadratic terms! => O(m2) terms

Solving our MLE:• Similar to our linear regression w =

(xTx)-1(xTy) • Cost will be O(p6)

Page 29: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

29

Generalizing: p inputs, Qth degree polynomial… how many terms?

• = number of unique terms of the form

• = number of unique terms of the form

• = the number of lists of non-negative integers [q0, q1, …, qp]

• =(Q+p)CQ terms!!

Page 30: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

30

Notes of caution…

• Is a polynomial with p = 2 better than p = 5?

• Linear fit is underfitting the data: – data shows structure not captured by the model

• Polynomial fit is overfitting the data:– data is fit very strongly by the model…

Moral of the story• Selecting model is important• More important is the selection of

the features!!

Page 31: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

31

Locally Weighted Regression (LWR)

• An approach to reduce the dependency on selecting features:– Many datasets don’t have linear descriptors

• We have seen this before: – In the weighted regression model

• How do we choose the right weights?

weights!

Page 32: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

32

Using the Kernel Trick once again…

• where Φ(x) is the kernel function

How do we estimate w?

Page 33: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

33

Using the Kernel Trick once again…

• where Φ(x) is the kernel function

All ci are held constant. We will just initialize them at random or on a uniformly spaced grid in d dimensions…

KW – kernel width is also held constant. It will be some value that ensures good overlap between the basis functions…

Page 34: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

34

How do we estimate w?

• Same as before…– Given the Q basis functions, let’s define a

matrix, Z such that

– Here xk is the the kth input vector…

• Now, we will:

• How to find the ci and KW?– Use BGD / SGD…

– Other methods will work

Also referred to as radial basis functions (RBFs)

Page 35: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

35

What are good radial basis function choices?

• We talked about overlaps…

Living area (sq. ft.)

no.

of

room

s

Living area (sq. ft.)

no.

of

room

s

Living area (sq. ft.)

no.

of

room

s

Just about right overlap…

Too little overlap? Too much overlap?

Page 36: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

36

Robust Regression…

• Best quadratic fit: – what is the problem

here?

• What would we want?– better fit to the varying

data!

– How can we find the better fitting curve?

Page 37: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

37

LOESS-based Robust Regression

• After the initial fit, score each data point to say how well it is fitted

good data point

good data point

Not that bad either

this is a horrible data point

Repeat until convergence: For every k = 1…m:

Let be the kth data point

Let be the estimate of the yk data point

Let wk is large if the data point is

fitted well and very small if it is not fitted well

Redo the regression with the weighted data points

How do we know we have converged? Use expectation maximization (EM)

Page 38: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

38

Multilinear Interpolation

• How to create a piecewise, linear fit to the data?

Create a set of “knot points” equally spaced along the data…

Let’s assume that the data points are generated by a noisy function that is allowed to bend only at these knot points…

We can do a linear regression for every segment identified here…

Page 39: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

39

How to find the best fit?

• With some algebraic manipulations…

q1 q2 q3 q4 q5 q6

h3

h2

Page 40: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

40

Page 41: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

41

Can we do classification with this?

• Map y to be {0, 1} – negative and positive class

• Function: Logistic/Sigmoid function

• Note g(θTx)1 as θTx ∞

• g(θTx)0 as θTx -∞

Page 42: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

42

How do we do MLE on this?

Page 43: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

43

Another approach to maximize L(θ)

• Using Newton’s approach: finding a zero for a function

Hessian: (n x n matrix to keep track of all partial derivatives

Page 44: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

44

Generalizing further…

• Regression:

• Classification:

• Begin by defining an exponential family of distributions:

natural parameter

sufficient statistic

log partition function

Page 45: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

45

Bernoulli and Gaussian as specific GLMs

Page 46: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

46

Softmax Regression

• Instead of a response variable y taking {0, 1}, we can think of having one of k values {1, 2,… k}

• Ex.: Mail classification = {spam, personal mail, work mail, advertisement}

• GLM with multinomial…

Page 47: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

47

Part II: What do we do with Big Data?

Page 48: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

48

Can we make Regression Faster?

• At least O(p2m):– Where p is the number of features (columns)

– m is the number of training examples

• Usually only a subset of p features, k, is relevant k << p

• What can we do to exploit this?– Variance inflation factor (VIF) regression O(pm)

Page 49: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

49

VIF regression

• Evaluation step: – approximate the partial correlation of each

candidate variable (feature xi) with y using a

small pre-sampled set of data

– [stagewise regression]

• Search step:

– Test each xi sequentially using an α-investing

rule

D. Lin, D.P. Foster, L.H. Ungar, VIM Regression, Arxiv 2012

Page 50: All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

50

Other standard approaches also work…• MapReduce

• Gather/Apply/Scatter (GAS) [to be seen in the future]

• Spark!

• What you need to know? – Regression is one of the most commonly used

ML algorithms

– Many flavors and can be generalized using GLMs

– Research still needs to be carried out for big datasets