More Machine Learning

30
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent

description

More Machine Learning. Linear Regression Squared Error L1 and L2 Regularization Gradient Descent. Recall: Key Components of Intelligent Agents. Representation Language: Graph, Bayes Nets Inference Mechanism: A*, variable elimination, Gibbs sampling - PowerPoint PPT Presentation

Transcript of More Machine Learning

Page 1: More Machine Learning

More Machine Learning

Linear RegressionSquared Error

L1 and L2 RegularizationGradient Descent

Page 2: More Machine Learning

Recall: Key Components of Intelligent Agents

Representation Language: Graph, Bayes Nets

Inference Mechanism: A*, variable elimination, Gibbs sampling

Learning Mechanism: Maximum Likelihood, Laplace Smoothing, many more: linear regression, perceptron, k-Nearest Neighbor, …

-------------------------------------Evaluation Metric: Likelihood, many more: squared error, 0-1 loss, conditional likelihood, precision/recall, …

Page 3: More Machine Learning

Recall: Types of LearningThe techniques we have discussed so far are examples of a particular kind of learning:

Supervised: the training examples included the correct labels or outputs.Vs. Unsupervised (or semi-supervised, or distantly-supervised, …): None (or some, or only part, …) of the labels in the training data are known.

Parameter Estimation: We only tried to learn the parameters in the BN, not the structure of the BN graph.Vs. Structure learning: The BN graph is not given as an input, and the learning algorithm’s job is to figure out what the graph should look like.

The distinctions below aren’t actually about the learning algorithm itself, but rather about the type of model being learned:

Classification: the output is a discrete value, like Happy or not Happy, or Spam or Ham.Vs. Regression: the output is a real number.

Generative: The model of the data represents a full joint distribution over all relevant variables.Vs. Discriminative: The model assumes some fixed subset of the variables will always be “inputs” or “evidence”, and it creates a distribution for the remaining variables conditioned on the evidence variables.

Parametric vs. Nonparametric: I will explain this later.

We won’t talk much about structure learning, but we will cover some other kinds of learning (regression, unsupervised, discriminative, nonparameteric, …) in later lectures.

Page 4: More Machine Learning

Regression vs. ClassificationOur NBC spam detector was a classifier: the output Y was one of two options, Ham or Spam.

More generally, classifiers give an output from a (usually small) finite (or countably infinite) set of options.E.g., predicting who will win the presidency in the next election is a classification problem (finite set of possible outcomes: US citizens).

Regression models give a real number as output.E.g., predicting what the temperature will be tomorrow is a regression problem. Any real number greater than or equal to 0 (Kelvin) is a possible outcome.

Page 5: More Machine Learning

Quiz: regression vs. classification

For each prediction task below, determine whether regression or classification is more appropriate.

Task Regression or Classification?

Predict who will win the Super Bowl next year

Predict the gender of a baby when it’s born

Predict the weight of a child one year from now

Predict the average life expectancy of all babies born today

Predict the price of Apple, Inc.’s stock at the close of trading tomorrow.Predict whether Microsoft or Apple will have a higher valuation at the close of trading tomorrow

Page 6: More Machine Learning

Answers: regression vs. classification

For each prediction task below, determine whether regression or classification is more appropriate.

Task Regression or Classification?

Predict who will win the Super Bowl next year C

Predict the gender of a baby when it’s born C

Predict the weight of a child one year from now R

Predict the average life expectancy of all babies born today R

Predict the price of Apple, Inc.’s stock at the close of trading tomorrow.

R

Predict whether Microsoft or Apple will have a higher valuation at the close of trading tomorrow

C

Page 7: More Machine Learning

Concrete Example

600 800 1000 1200 1400 1600 1800 2000 2200 2400 26000

50000

100000

150000

200000

250000

Square Footage

Hous

e Pr

ice,

$

Suppose I want to buy a house that’s 2000 square feet. Predict how much it will cost.

175000

Page 8: More Machine Learning

More realistic data

Percentage of the population under the federal poverty level

Viol

ent C

rime

per C

apita

Reported Crime Statistics for U.S. Counties

Page 9: More Machine Learning

Linear RegressionSuppose there are N input variables, X1, …, XN (all real numbers).

A linear regression is a function that looks like this:

Y = w0 + w1X1 + w2X2 + … + wNXN

The wi variables are called weights or parameters. Each one is a real number.

The set of all functions that look like this (one function for each choice of weights w0 through wN) is called the Hypothesis Class for linear regression.

Page 10: More Machine Learning

Hypotheses

600 800 1000 1200 1400 1600 1800 2000 2200 2400 26000

50000

100000

150000

200000

250000

Square Footage

Hous

e Pr

ice,

$

In this example, there is only one input variable: X1 is square footage.The hypothesis class is all functions Y = w0 + w1 * (square footage).Several example elements of the hypothesis class are drawn.

100+900*X1

55100+900*X1

80000+270*X1

Page 11: More Machine Learning

Learning for Linear RegressionLinear regression tells us a whole set of possible functions to use for prediction.

How do we choose the best one from this set?

This is the learning problem for linear regression:

Input: a set of training examples, where each example contains a value for (X1, …, XN, Y)

Output: a set of weights (w0, …, wN) for the “best-fitting” linear regression model.

Page 12: More Machine Learning

Quiz: Learning for Linear Regression

X Y10 8030 4015 7055 -10

For the data on the left, what’s the best fit linear regression model?

Page 13: More Machine Learning

Answer: Learning for Linear Regression

X Y10 8030 4015 7055 -10

For the data on the left, what’s the best fit linear regression model?

80 = w0 + w1 * 1040 = w0 + w1 * 30

80-40 = w0-w0 + w1 * 10-w1*3040 = w1 * (-20)-2 = w1

80 = w0 + (-2)*10100 = w0

Y= 100 + (-2) * X

Page 14: More Machine Learning

Linear Regression with Noisy Data

600 800 1000 1200 1400 1600 1800 2000 2200 2400 26000

50000

100000

150000

200000

250000

Square Footage

Hous

e Pr

ice,

$

In the previous example, we could use only two points and find a line that passed through all of the remaining points.

In this example, points are only “approximately” linear. No single line passes through all points exactly. We’ll need a more complex algorithm to handle this.

Page 15: More Machine Learning

Quadratic Loss (a.k.a. “Squared Error”)

Let’s write our training data D with this notation:

Define

Intuitively, this is how much error the function makes on the training data.

X11 X12 … X1N Y1

X21 X22 … X2N Y2

… … … … …

XM1 XM2 … XMN YM

Page 16: More Machine Learning

Objective Function

The goal of a linear regression is to find the best linear function. We’ll say that “best” means the one with the least amount of quadratic loss.

Mathematically, we say we want f* that satisfies:)=

We call LOSS the objective function for our training algorithm, since it’s the function we’re trying to minimize.

Page 17: More Machine Learning

Closed-form Solution for 1 input variable

To minimize the LOSS function, we’ll take the partial derivatives, and set them to zero:

Set this expression equal to zero:

Page 18: More Machine Learning

Closed-form Solution for 1 input variable

To minimize the LOSS function, we’ll take the partial derivatives, and set them to zero:

Set this expression equal to zero:

Page 19: More Machine Learning

“Closed-form” Result

Substitute for w0 in the second equation gives:

Page 20: More Machine Learning

Quiz: Learning for Linear Regression

X Y10 8030 4015 7055 -10

Using the closed-form solution for Quadratic Loss, compute w0 and w1 for this dataset.

𝑤1=∑𝑖

𝑋 1 𝑖❑𝑌 𝑖−

1𝑀 (∑𝑖

𝑌 𝑖)(∑𝑖❑

𝑋 1𝑖❑)

∑𝑖

𝑋 1 𝑖2 − 1

𝑀 (∑𝑖❑

𝑋 1 𝑖❑)

2

𝑤0=1𝑀∑

𝑖

𝑌 𝑖−𝑤1

𝑀∑𝑖

𝑋 1𝑖❑

Page 21: More Machine Learning

Answer: Learning for Linear Regression

X Y10 8030 4015 7055 -10

Using the closed-form solution for Quadratic Loss, compute w0 and w1 for this dataset.

𝑤1=∑𝑖

𝑋 1 𝑖❑𝑌 𝑖−

1𝑀 (∑𝑖

𝑌 𝑖)(∑𝑖❑

𝑋 1𝑖❑)

∑𝑖

𝑋 1 𝑖2 − 1

𝑀 (∑𝑖❑

𝑋 1 𝑖❑)

2 =800+1200+1050−550− 1

4(180 ) (110 )

100+900+225+3025− 141102

=−𝟐

𝑤0=1𝑀∑

𝑖

𝑌 𝑖−𝑤1

𝑀∑𝑖

𝑋 1𝑖❑=1

4180− −2

4110=𝟏𝟎𝟎

Note that w1, w0 match what we calculated before!

Page 22: More Machine Learning

Overfitting and RegularizationIt is very common to use a technique called regularization to combat overfitting for linear methods.

Regularization changes the objective function for training by adding a penalty for the size of the weights:

LOSS(f, D) =

When p=1, this is called L1 regularization.When p=2, this is called L2 regularization.1 and 2 are by far the two most commonly-used values of p.

Parameter loss

Page 23: More Machine Learning

Gradient Descent

For more complex loss functions, it is often NOT POSSIBLE to find closed-form solutions.

Instead, people resort to “iterative methods” that iteratively find better and better parameter estimates, until they converge to the best setting.

We’ll go over one example of this kind of method, called “gradient descent”.

Page 24: More Machine Learning

Gradient Descent

Gradient Descent AlgorithmCreate weights , i 01. ( ) some initial values (often zero)2. While |-: for each j:

i i+1

3. Return ( )Learning rate

Page 25: More Machine Learning

Quiz: Gradient

positive About zero

negative

a

b

c

w

LOSS

ab

c

Check the boxes that apply.

Page 26: More Machine Learning

Answer: Gradient

positive About zero

negative

a x

b x

c x

w

LOSS

ab

c

Check the boxes that apply.

Page 27: More Machine Learning

Quiz: Gradient

Where is the largest?

a

b

c

Equal everywhere

w

LOSS

a

b

c

Page 28: More Machine Learning

Answer: Gradient

Where is the largest?

a

b

c

Equal everywhere x

w

LOSS

a

b

c

Page 29: More Machine Learning

Quiz: Gradient Descent

Which point will allow gradient descent to reach the global minimum, if it is used as the initialization for parameter w?

a

b

c

w

LOSS

a

b

c

Page 30: More Machine Learning

Answer: Gradient Descent

Which point will allow gradient descent to reach the global minimum, if it is used as the initialization for parameter w?

a

b

c x

w

LOSS a

b

c