COMP 551 –Applied Machine Learning Lecture 4: Linear ...jpineau/comp551/Lectures/04Linear... ·...

27
COMP 551 – Applied Machine Learning Lecture 4: Linear classification Instructor: Joelle Pineau ([email protected]) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

Transcript of COMP 551 –Applied Machine Learning Lecture 4: Linear ...jpineau/comp551/Lectures/04Linear... ·...

  • COMP 551 – Applied Machine LearningLecture 4: Linear classification

    Instructor: Joelle Pineau ([email protected])

    Class web page: www.cs.mcgill.ca/~jpineau/comp551

    Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.

  • Joelle Pineau2

    Today’s Quiz

    1. What is meant by the term overfitting? What can cause

    overfitting? How can one avoid overfitting?

    2. Which of the following increases the chances of overfitting

    (assuming everything else is held constant):

    a) Reducing the size of the training set.

    b) Increasing the size of the training set.

    c) Reducing the size of the test set.

    d) Increasing the size of the test set.

    e) Reducing the number of features.

    f) Increasing the number of features.

    COMP-551: Applied Machine Learning

  • Joelle Pineau3COMP-551: Applied Machine Learning

    Evaluation

    • Use cross-validation for model selection.

    • Training set is used to select a hypothesis f from a class of

    hypotheses F (e.g. regression of a given degree).

    • Validation set is used to compare best f from each hypothesis

    class across different classes (e.g. different degree regression).• Must be untouched during the process of looking for f within a class F.

    • Test set: Ideally, a separate set of (labeled) data is withheld to get a true estimate of the generalization error.(Often the “validation set” is called “test set”, without distinction.)

  • Joelle Pineau4

    Validation vs Train error38 2. Overview of Supervised Learning

    High Bias

    Low Variance

    Low Bias

    High Variance

    Prediction

    Error

    Model Complexity

    Training Sample

    Test Sample

    Low High

    FIGURE 2.11. Test and training error as a function of model complexity.

    be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

    The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.More generally, as the model complexity of our procedure is increased, the

    variance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.Typically we would like to choose our model complexity to trade bias

    off with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1N

    ∑i(yi − ŷi)2. Unfortunately

    training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

    Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f̂(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

    COMP-551: Applied Machine Learning

    [From Hastie et al. textbook]

  • Joelle Pineau5

    Bias vs Variance

    • Gauss-Markov Theorem says:

    The least-squares estimates of the parameters w have the smallest

    variance among all linear unbiased estimates.

    • Insight: Find lower variance solution, at the expense of some bias.

    E.g. Include penalty for model complexity in error to reduce overfitting.

    Err(w) = ∑i=1:n ( yi - wTxi)2 + λ |model_size|

    λ is a hyper-parameter that controls penalty size.

    COMP-551: Applied Machine Learning

  • Joelle Pineau6

    Ridge regression (aka L2-regularization)

    • Constrains the weights by imposing a penalty on their size:

    ŵridge = argminw { ∑i=1:n( yi - wTxi)2 + λ∑j=0:mwj2 }

    where λ can be selected manually, or by cross-validation.

    • Do a little algebra to get the solution: ŵridge = (XTX+λI)-1XTY

    – The ridge solution is not equivariant under scaling of the data, so typically need to normalize the inputs first.

    – Ridge gives a smooth solution, effectively shrinking the weights, but drives few weights to 0.

    COMP-551: Applied Machine Learning

  • Joelle Pineau7

    Lasso regression (aka L1-regularization)

    • Constrains the weights by penalizing the absolute value of their

    size:

    ŵlasso = argminW { ∑i=1:n( yi - wTxi)2 + λ∑j=1:m|wj| }

    • Now the objective is non-linear in the output y, and there is no

    closed-form solution. Need to solve a quadratic programming

    problem instead.

    – More computationally expensive than Ridge regression.

    – Effectively sets the weights of less relevant input features to zero.

    COMP-551: Applied Machine Learning

  • Joelle Pineau8

    Comparing Ridge and Lasso

    COMP-551: Applied Machine Learning

    Ridge Lasso

    4

    Joelle Pineau 7

    Comparing Ridge and Lasso

    COMP-598: Applied Machine Learning

    Solving L1 regularization

    • The optimization problem is a quadratic program• There is one constraint for each possible sign of the weights (2nconstraints for n weights)

    • For example, with two weights:min

    w1,w2

    mX

    j=1

    (yj � w1x1 � w2x2)2

    such that w1 + w2 ⇥ ⇤w1 � w2 ⇥ ⇤

    �w1 + w2 ⇥ ⇤�w1 � w2 ⇥ ⇤

    • Solving this program directly can be done for problems with a smallnumber of inputs

    COMP-652, Lecture 3 - September 8, 2011 31

    Visualizing L1 regularization

    w1

    w2

    w?

    • If ⌅ is big enough, the circle is very likely to intersect the diamond atone of the corners

    • This makes L1 regularization much more likely to make some weightsexactly 0

    COMP-652, Lecture 3 - September 8, 2011 32

    Ridge Lasso

    L2 Regularization for linear models revisited

    • Optimization problem: minimize error while keeping norm of the weightsbounded

    min

    wJD(w) = min

    w(�w � y)T (�w � y)

    such that wTw ⇥ ⇤

    • The Lagrangian is:

    L(w,⌅) = JD(w)�⌅(⇤�wTw) = (�w�y)T (�w�y)+⌅wTw�⌅⇤

    • For a fixed ⌅, and ⇤ = ⌅�1, the best w is the same as obtained byweight decay

    COMP-652, Lecture 3 - September 8, 2011 27

    Visualizing regularization (2 parameters)

    w1

    w2

    w?

    w⇤ = (�T�+ ⌅I)�1�y

    COMP-652, Lecture 3 - September 8, 2011 28

    Joelle Pineau 8

    A quick look at evaluation functions

    •  We call L(Y,fw(x)) the loss function. –  Least-square / Mean squared-error (MSE) loss:

    L(Y, fw(X)) = = ∑i=1:n ( yi - wTxi)2

    •  Other loss functions? –  Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

    –  0-1 loss for classification: L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

    •  Different loss functions make different assumptions. –  Squared error loss assumes the data can be approximated by a

    global linear model.

    COMP-598: Applied Machine Learning

    Contours of equalregression error

    Contours of equalmodel complexity penalty

  • Joelle Pineau9

    A quick look at evaluation functions

    • We call L(Y,fw(x)) the loss function.

    – Least-square / Mean squared-error (MSE) loss:

    L(Y, fw(X)) = ∑i=1:n ( yi - wTxi)2

    • Other loss functions?

    – Absolute error loss: L(Y, fw(X)) = ∑i=1:n | yi – wTxi |

    – 0-1 loss (for classification): L(Y, fw(X)) = ∑i=1:n I ( yi ≠ fw(xi) )

    • Different loss functions make different assumptions.– Squared error loss assumes the data can be approximated by a

    global linear model with Gaussian noise.

    COMP-551: Applied Machine Learning

  • Joelle Pineau10

    Next: Linear models for classification

    COMP-551: Applied Machine Learning

    5

    Joelle Pineau 9

    Today: Linear models for classification 2.3 Least Squares and Nearest Neighbors 13Linear Regression of 0/1 Response

    .. . . . . .

    . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .

    . . . . . . . .. .

    oo

    ooo

    o

    o

    o

    o

    o

    o

    o

    o

    oo

    o

    o o

    oo

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    oo

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    oo

    oo

    o

    o

    o

    o

    o

    o

    o

    o

    o

    oo o

    oo

    oo

    o

    oo

    o

    o

    o

    oo

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    oo

    o

    o

    ooo

    o

    o

    o

    o

    o

    oo

    o

    o

    o

    o

    o

    o

    o

    oo

    o

    o

    o

    o

    o

    o

    o

    o oooo

    o

    ooo o

    o

    o

    o

    o

    o

    o

    o

    ooo

    ooo

    ooo

    o

    o

    ooo

    o

    o

    o

    o

    o

    o

    o

    o o

    o

    o

    o

    o

    o

    o

    oo

    ooo

    o

    o

    o

    o

    o

    o

    ooo

    oo oo

    o

    o

    o

    o

    o

    o

    o

    o

    o

    o

    FIGURE 2.1. A classification example in two dimensions. The classes are codedas a binary variable (BLUE = 0, ORANGE = 1), and then fit by linear regression.The line is the decision boundary defined by xT β̂ = 0.5. The orange shaded regiondenotes that part of input space classified as ORANGE, while the blue region isclassified as BLUE.

    The set of points in IR2 classified as ORANGE corresponds to {x : xT β̂ > 0.5},indicated in Figure 2.1, and the two predicted classes are separated by thedecision boundary {x : xT β̂ = 0.5}, which is linear in this case. We seethat for these data there are several misclassifications on both sides of thedecision boundary. Perhaps our linear model is too rigid— or are such errorsunavoidable? Remember that these are errors on the training data itself,and we have not said where the constructed data came from. Consider thetwo possible scenarios:

    Scenario 1: The training data in each class were generated from bivariateGaussian distributions with uncorrelated components and differentmeans.

    Scenario 2: The training data in each class came from a mixture of 10 low-variance Gaussian distributions, with individual means themselvesdistributed as Gaussian.

    A mixture of Gaussians is best described in terms of the generativemodel. One first generates a discrete variable that determines which of

    COMP-598: Applied Machine Learning

    Joelle Pineau 10

    Classification problems

    •  Given a data set D=, i=1:n, where yi’s are discrete, find a

    hypothesis which “best fits” the data

    •  If yi {0, 1} this is binary classification (common special case).

    •  If yi can take more than two values, the problem is called multi-

    class classification.

    •  We usually study binary classifiers. Multi-class versions of many

    binary classification algorithms can be developed in a fairly

    straightforward way (we will see examples).

    COMP-598: Applied Machine Learning

  • Joelle Pineau11

    Classification problems

    Given data set D=, i=1:n, with

    discrete yi, find a hypothesis which

    “best fits” the data.

    – If yi∈ {0, 1} this is binaryclassification.

    – If yi can take more than two values, the problem is called multi-class classification.

    COMP-551: Applied Machine Learning

  • Joelle Pineau12

    Applications of classification

    • Text classification (spam filtering, news filtering, building web

    directories, etc.)

    • Image classification (face detection, object recognition, etc.)

    • Prediction of cancer recurrence.

    • Financial forecasting.

    • Many, many more!

    COMP-551: Applied Machine Learning

  • Joelle Pineau13

    Simple example

    • Given “nucleus size”, predict cancer recurrence.

    • Univariate input: X = nucleus size.

    • Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

    • Try: Minimize the least-square error.

    COMP-551: Applied Machine Learning

    NoRecurrence

    Recurrence

    6

    Joelle Pineau 11

    Applications of classification

    •  Text classification (spam filtering, news filtering, building web

    directories, etc.)

    –  Features? Classes? Error (loss) function?

    •  Image classification (face detection, object recognition, etc.)

    •  Prediction of cancer recurrence.

    •  Financial forecasting.

    •  Many, many more!

    COMP-598: Applied Machine Learning

    Joelle Pineau 12

    Simple example

    •  Given “nucleus size”, predict cancer recurrence.

    •  Univariate input: X = nucleus size.

    •  Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

    •  Try: Minimize the least-square error.

    COMP-598: Applied Machine Learning

    NoRecurrence Example: Given “nucleus size” predict cancer

    recurrrence

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    20

    25

    30

    35

    nucleus size

    no

    nre

    cu

    rre

    nce

    co

    un

    t

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    nucleus size

    recu

    rre

    nce

    co

    un

    t

    COMP-652, Lecture 4 - September 18, 2012 29

    Example: Solution by linear regression

    • Univariate real input: nucleus size• Output coding: non-recurrence = 0, recurrence = 1• Sum squared error minimized by the red line

    ! "# "! $# $! %#!#&$

    #

    #&$

    #&'

    #&(

    #&)

    "

    "&$

    *+,-.+/0/12.

    *3*4.,+405#603404.,+405"6

    COMP-652, Lecture 4 - September 18, 2012 30

    Example: Given “nucleus size” predict cancer

    recurrrence

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    20

    25

    30

    35

    nucleus size

    no

    nre

    cu

    rre

    nce

    co

    un

    t

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    nucleus size

    recu

    rre

    nce

    co

    un

    t

    COMP-652, Lecture 4 - September 18, 2012 29

    Example: Solution by linear regression

    • Univariate real input: nucleus size• Output coding: non-recurrence = 0, recurrence = 1• Sum squared error minimized by the red line

    ! "# "! $# $! %#!#&$

    #

    #&$

    #&'

    #&(

    #&)

    "

    "&$

    *+,-.+/0/12.

    *3*4.,+405#603404.,+405"6

    COMP-652, Lecture 4 - September 18, 2012 30

    Recurrence

    6

    Joelle Pineau 11

    Applications of classification

    •  Text classification (spam filtering, news filtering, building web

    directories, etc.)

    –  Features? Classes? Error (loss) function?

    •  Image classification (face detection, object recognition, etc.)

    •  Prediction of cancer recurrence.

    •  Financial forecasting.

    •  Many, many more!

    COMP-598: Applied Machine Learning

    Joelle Pineau 12

    Simple example

    •  Given “nucleus size”, predict cancer recurrence.

    •  Univariate input: X = nucleus size.

    •  Binary output: Y = {NoRecurrence = 0; Recurrence = 1}

    •  Try: Minimize the least-square error.

    COMP-598: Applied Machine Learning

    NoRecurrence Example: Given “nucleus size” predict cancer

    recurrrence

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    20

    25

    30

    35

    nucleus size

    nonre

    curr

    ence c

    ount

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    nucleus size

    recurr

    ence c

    ount

    COMP-652, Lecture 4 - September 18, 2012 29

    Example: Solution by linear regression

    • Univariate real input: nucleus size• Output coding: non-recurrence = 0, recurrence = 1• Sum squared error minimized by the red line

    ! "# "! $# $! %#!#&$

    #

    #&$

    #&'

    #&(

    #&)

    "

    "&$

    *+,-.+/0/12.

    *3*4.,+405#603404.,+405"6

    COMP-652, Lecture 4 - September 18, 2012 30

    Example: Given “nucleus size” predict cancer

    recurrrence

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    20

    25

    30

    35

    nucleus size

    nonre

    curr

    ence c

    ount

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    nucleus size

    recurr

    ence c

    ount

    COMP-652, Lecture 4 - September 18, 2012 29

    Example: Solution by linear regression

    • Univariate real input: nucleus size• Output coding: non-recurrence = 0, recurrence = 1• Sum squared error minimized by the red line

    ! "# "! $# $! %#!#&$

    #

    #&$

    #&'

    #&(

    #&)

    "

    "&$

    *+,-.+/0/12.

    *3*4.,+405#603404.,+405"6

    COMP-652, Lecture 4 - September 18, 2012 30

    Recurrence

  • Joelle Pineau14

    Predicting a class from linear regression

    • Here red line is: Y’ = X (XTX)-1 XTY

    • How to get a binary output?

    1. Threshold the output:

    { y t for Recurrence}

    2. Interpret output as probability:

    y = Pr (Recurrence)

    • Can we find a better model?

    COMP-551: Applied Machine Learning

    7

    Joelle Pineau 13

    Predicting a class from linear regression

    •  Here red line is: Y’ = X (XTX)-1 XT Y

    •  How to get a binary output?

    1.  Threshold the output:

    { y t for Recurrence}

    2.  Interpret output as probability:

    y = Pr (Recurrence)

    •  Can we find a better model?

    COMP-598: Applied Machine Learning

    Example: Given “nucleus size” predict cancer

    recurrrence

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    20

    25

    30

    35

    nucleus size

    nonre

    curr

    ence c

    ount

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    nucleus size

    recurr

    ence c

    ount

    COMP-652, Lecture 4 - September 18, 2012 29

    Example: Solution by linear regression

    • Univariate real input: nucleus size• Output coding: non-recurrence = 0, recurrence = 1• Sum squared error minimized by the red line

    ! "# "! $# $! %#!#&$

    #

    #&$

    #&'

    #&(

    #&)

    "

    "&$

    *+,-.+/0/12.

    *3*4.,+405#603404.,+405"6

    COMP-652, Lecture 4 - September 18, 2012 30

    Joelle Pineau 14

    Modeling for binary classification

    •  Two approaches:

    –  Discriminative learning: Directly estimate P(y|x).

    –  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

    –  We will consider both, starting with discriminative learning.

    COMP-598: Applied Machine Learning

    P(y =1| x) = P(x | y =1)P(y =1)P(x)

  • Joelle Pineau15

    Modeling for binary classification

    • Two probabilistic approaches:

    1. Discriminative learning: Directly estimate P(y|x).

    2. Generative learning: Separately model P(x|y) and P(y). Use Bayes rule, to estimate P(y|x):

    COMP-551: Applied Machine Learning

    7

    Joelle Pineau 13

    Predicting a class from linear regression

    •  Here red line is: Y’ = X (XTX)-1 XT Y

    •  How to get a binary output?

    1.  Threshold the output:

    { y t for Recurrence}

    2.  Interpret output as probability:

    y = Pr (Recurrence)

    •  Can we find a better model?

    COMP-598: Applied Machine Learning

    Example: Given “nucleus size” predict cancer

    recurrrence

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    20

    25

    30

    35

    nucleus size

    nonre

    curr

    ence c

    ount

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    nucleus size

    recurr

    ence c

    ount

    COMP-652, Lecture 4 - September 18, 2012 29

    Example: Solution by linear regression

    • Univariate real input: nucleus size• Output coding: non-recurrence = 0, recurrence = 1• Sum squared error minimized by the red line

    ! "# "! $# $! %#!#&$

    #

    #&$

    #&'

    #&(

    #&)

    "

    "&$

    *+,-.+/0/12.

    *3*4.,+405#603404.,+405"6

    COMP-652, Lecture 4 - September 18, 2012 30

    Joelle Pineau 14

    Modeling for binary classification

    •  Two approaches:

    –  Discriminative learning: Directly estimate P(y|x).

    –  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

    –  We will consider both, starting with discriminative learning.

    COMP-598: Applied Machine Learning

    P(y =1| x) = P(x | y =1)P(y =1)P(x)

  • Joelle Pineau16

    Probabilistic view of discriminative learning• Suppose we have 2 classes: y ∊ {0, 1}

    • What is the probability of a given input x having class y = 1?

    • Consider Bayes rule:

    where

    • Here σ has a special form, called the logistic functionand 𝒂 is the log-odds ratio of data being class 1 vs. class 0.

    COMP-551: Applied Machine Learning

    (By Bayes rule; P(x) on top and bottom cancels out.)

    8

    Joelle Pineau 15

    A probabilistic view •  Suppose we have 2 classes: y {0, 1}

    •  What is the probability of a given input x having class y = 1?

    •  Consider Bayes rule:

    where

    •  Here σ has a special form, called a sigmoid function and

    ! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

    COMP-598: Applied Machine Learning

    P(y =1| x) = P(x, y =1)P(x)

    =P(x | y =1)P(y =1)

    P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    = ln P(y =1| x)P(y = 0 | x)

    =1

    1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    =1

    1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    )=

    11+ exp(−a)

    = σ

    (By Bayes rule; P(x) on top and bottom cancels out.)

    Joelle Pineau 16

    Logistic regression

    •  Directly model the log-odds with a linear function:

    = w0 + w1x1 + … + wmxm

    •  The decision boundary here is the set of points for which the

    log-odds (!) is zero. ) is zero.

    •  We have the logistic function:

    σ(wTx) = 1 / (1 + e-wTx)

    How do we find the weights?

    COMP-598: Applied Machine Learning

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    8

    Joelle Pineau 15

    A probabilistic view •  Suppose we have 2 classes: y {0, 1}

    •  What is the probability of a given input x having class y = 1?

    •  Consider Bayes rule:

    where

    •  Here σ has a special form, called a sigmoid function and

    ! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

    COMP-598: Applied Machine Learning

    P(y =1| x) = P(x, y =1)P(x)

    =P(x | y =1)P(y =1)

    P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    = ln P(y =1| x)P(y = 0 | x)

    =1

    1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    =1

    1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    )=

    11+ exp(−a)

    = σ

    (By Bayes rule; P(x) on top and bottom cancels out.)

    Joelle Pineau 16

    Logistic regression

    •  Directly model the log-odds with a linear function:

    = w0 + w1x1 + … + wmxm

    •  The decision boundary here is the set of points for which the

    log-odds (!) is zero. ) is zero.

    •  We have the logistic function:

    σ(wTx) = 1 / (1 + e-wTx)

    How do we find the weights?

    COMP-598: Applied Machine Learning

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    8

    Joelle Pineau 15

    A probabilistic view •  Suppose we have 2 classes: y {0, 1}

    •  What is the probability of a given input x having class y = 1?

    •  Consider Bayes rule:

    where

    •  Here σ has a special form, called a sigmoid function and

    ! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

    COMP-598: Applied Machine Learning

    P(y =1| x) = P(x, y =1)P(x)

    =P(x | y =1)P(y =1)

    P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    = ln P(y =1| x)P(y = 0 | x)

    =1

    1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    =1

    1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    )=

    11+ exp(−a)

    = σ

    (By Bayes rule; P(x) on top and bottom cancels out.)

    Joelle Pineau 16

    Logistic regression

    •  Directly model the log-odds with a linear function:

    = w0 + w1x1 + … + wmxm

    •  The decision boundary here is the set of points for which the

    log-odds (!) is zero. ) is zero.

    •  We have the logistic function:

    σ(wTx) = 1 / (1 + e-wTx)

    How do we find the weights?

    COMP-598: Applied Machine Learning

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

  • Joelle Pineau17

    • Idea: Directly model the log-odds with a linear function:

    = w0 + w1x1 + … + wmxm

    • The decision boundary is the set of points for which 𝒂=0.

    • The logistic function (= sigmoid curve): σ(wTx) = 1 / (1 + e-wTx)

    How do we find the weights?

    Need an optimization function.

    COMP-551: Applied Machine Learning

    8

    Joelle Pineau 15

    A probabilistic view •  Suppose we have 2 classes: y {0, 1}

    •  What is the probability of a given input x having class y = 1?

    •  Consider Bayes rule:

    where

    •  Here σ has a special form, called a sigmoid function and

    ! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

    COMP-598: Applied Machine Learning

    P(y =1| x) = P(x, y =1)P(x)

    =P(x | y =1)P(y =1)

    P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    = ln P(y =1| x)P(y = 0 | x)

    =1

    1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    =1

    1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    )=

    11+ exp(−a)

    = σ

    (By Bayes rule; P(x) on top and bottom cancels out.)

    Joelle Pineau 16

    Logistic regression

    •  Directly model the log-odds with a linear function:

    = w0 + w1x1 + … + wmxm

    •  The decision boundary here is the set of points for which the

    log-odds (!) is zero. ) is zero.

    •  We have the logistic function:

    σ(wTx) = 1 / (1 + e-wTx)

    How do we find the weights?

    COMP-598: Applied Machine Learning

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    8

    Joelle Pineau 15

    A probabilistic view •  Suppose we have 2 classes: y {0, 1}

    •  What is the probability of a given input x having class y = 1?

    •  Consider Bayes rule:

    where

    •  Here σ has a special form, called a sigmoid function and

    ! is the log-odds of the data being class 1 vs. class 0. is the log-odds of the data being class 1 vs. class 0.

    COMP-598: Applied Machine Learning

    P(y =1| x) = P(x, y =1)P(x)

    =P(x | y =1)P(y =1)

    P(x | y =1)P(y =1)+P(x | y = 0)P(y = 0)

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    = ln P(y =1| x)P(y = 0 | x)

    =1

    1+ P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    =1

    1+ exp(ln P(x | y = 0)P(y = 0)P(x | y =1)P(y =1)

    )=

    11+ exp(−a)

    = σ

    (By Bayes rule; P(x) on top and bottom cancels out.)

    Joelle Pineau 16

    Logistic regression

    •  Directly model the log-odds with a linear function:

    = w0 + w1x1 + … + wmxm

    •  The decision boundary here is the set of points for which the

    log-odds (!) is zero. ) is zero.

    •  We have the logistic function:

    σ(wTx) = 1 / (1 + e-wTx)

    How do we find the weights?

    COMP-598: Applied Machine Learning

    a = ln P(x | y =1)P(y =1)P(x | y = 0)P(y = 0)

    Discriminative learning: Logistic regression

  • Joelle Pineau18

    Fitting the weights

    • Recall: σ(wTxi) is the probability that yi=1 (given xi)1-σ(wTxi) be the probability that yi = 0.

    • For y ∊ {0, 1}, the likelihood function, Pr(x1,y1, …, xn,yh | w), is:∏i=1:n σ(wTxi)yi (1- σ(wTxi)) (1-yi) (samples are i.i.d.)

    • Goal: Minimize the log-likelihood (also called cross-entropy error function):

    - ∑i=1:n yi log(σ(wTxi)) + (1-yi )log(1-σ(wTxi))

    COMP-551: Applied Machine Learning

  • Joelle Pineau19

    Gradient descent for logistic regression

    • Error fn: Err(w) = - [ ∑i=1:n yi log(σ(wTxi)) + (1-yi)log(1-σ(wTxi)) ]

    • Take the derivative:∂Err(w)/∂w = - [ ∑i=1:n yi (1/σ(wTxi))(1-σ(wTxi)) σ(wTxi)xi + …

    COMP-551: Applied Machine Learning

    δlog(σ)/δw=1/σ

  • Joelle Pineau20

    Gradient descent for logistic regression

    • Error fn: Err(w) = - [ ∑i=1:n yi log(σ(wTxi)) + (1-yi)log(1-σ(wTxi)) ]

    • Take the derivative:∂Err(w)/∂w = - [ ∑i=1:n yi (1/σ(wTxi))(1-σ(wTxi)) σ(wTxi)xi + …

    COMP-551: Applied Machine Learning

    δσ/δw=σ(1-σ)

  • Joelle Pineau21

    Gradient descent for logistic regression

    • Error fn: Err(w) = - [ ∑i=1:n yi log(σ(wTxi)) + (1-yi)log(1-σ(wTxi)) ]

    • Take the derivative:∂Err(w)/∂w = - [ ∑i=1:n yi (1/σ(wTxi))(1-σ(wTxi)) σ(wTxi)xi + …

    COMP-551: Applied Machine Learning

    δwTx/δw=x

  • Joelle Pineau22

    Gradient descent for logistic regression

    • Error fn: Err(w) = - [ ∑i=1:n yi log(σ(wTxi)) + (1-yi)log(1-σ(wTxi)) ]

    • Take the derivative:∂Err(w)/∂w = - [ ∑i=1:n yi (1/σ(wTxi))(1-σ(wTxi)) σ(wTxi)xi +

    (1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1) xi ]

    COMP-551: Applied Machine Learning

    δ(1-σ)/δw=(1-σ)σ(-1)

  • Joelle Pineau23

    Gradient descent for logistic regression

    • Error fn: Err(w) = - [ ∑i=1:n yi log(σ(wTxi)) + (1-yi)log(1-σ(wTxi)) ]

    • Take the derivative:∂Err(w)/∂w = - [ ∑i=1:n yi (1/σ(wTxi))(1-σ(wTxi)) σ(wTxi)xi +

    (1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1) xi ]= - ∑i=1:n xi (yi (1-σ(wTxi)) - (1-yi)σ(wTxi))= - ∑i=1:n xi (yi - σ(wTxi))

    COMP-551: Applied Machine Learning

  • Joelle Pineau24

    Gradient descent for logistic regression

    • Error fn: Err(w) = - [ ∑i=1:n yi log(σ(wTxi)) + (1-yi)log(1-σ(wTxi)) ]

    • Take the derivative:∂Err(w)/∂w = - [ ∑i=1:n yi (1/σ(wTxi))(1-σ(wTxi)) σ(wTxi)xi +

    (1-yi)(1/(1-σ(wTxi )))(1-σ(wTxi ))σ(wTxi )(-1) xi ]= - ∑i=1:n xi (yi (1-σ(wTxi)) - (1-yi)σ(wTxi))= - ∑i=1:n xi (yi - σ(wTxi))

    • Now apply iteratively: wk+1 = wk + αk ∑i=1:n xi (yi – σ(wkTxi))

    • Can also apply other iterative methods, e.g. Newton’s method, coordinate descent, L-BFGS, etc.

    COMP-551: Applied Machine Learning

  • Joelle Pineau25

    Modeling for binary classification

    • Two probabilistic approaches:

    1. Discriminative learning: Directly estimate P(y|x).

    2. Generative learning: Separately model P(x|y) and P(y). Use Bayes rule, to estimate P(y|x):

    COMP-551: Applied Machine Learning

    7

    Joelle Pineau 13

    Predicting a class from linear regression

    •  Here red line is: Y’ = X (XTX)-1 XT Y

    •  How to get a binary output?

    1.  Threshold the output:

    { y t for Recurrence}

    2.  Interpret output as probability:

    y = Pr (Recurrence)

    •  Can we find a better model?

    COMP-598: Applied Machine Learning

    Example: Given “nucleus size” predict cancer

    recurrrence

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    20

    25

    30

    35

    nucleus size

    nonre

    curr

    ence c

    ount

    10 12 14 16 18 20 22 24 26 28 300

    5

    10

    15

    nucleus size

    recurr

    ence c

    ount

    COMP-652, Lecture 4 - September 18, 2012 29

    Example: Solution by linear regression

    • Univariate real input: nucleus size• Output coding: non-recurrence = 0, recurrence = 1• Sum squared error minimized by the red line

    ! "# "! $# $! %#!#&$

    #

    #&$

    #&'

    #&(

    #&)

    "

    "&$

    *+,-.+/0/12.

    *3*4.,+405#603404.,+405"6

    COMP-652, Lecture 4 - September 18, 2012 30

    Joelle Pineau 14

    Modeling for binary classification

    •  Two approaches:

    –  Discriminative learning: Directly estimate P(y|x).

    –  Generative learning: Separately model P(x|y) and P(y). Use these, through Bayes rule, to estimate P(y|x).

    –  We will consider both, starting with discriminative learning.

    COMP-598: Applied Machine Learning

    P(y =1| x) = P(x | y =1)P(y =1)P(x)

  • Joelle Pineau26COMP-551: Applied Machine Learning

    What you should know

    • Basic definition of linear classification problem.

    • Derivation of logistic regression.

    • Linear discriminant analysis: definition, decision boundary.

    • Quadratic discriminant analysis: basic idea, decision boundary.

    • LDA vs QDA pros/cons.

    • Worth reading further:

    – Under some conditions, linear regression for classification and LDA are the same (Hastie et al., p.109-110).

    – Relation between Logistic regression and LDA (Hastie et al., 4.4.5)

  • Joelle Pineau27

    Final notes

    You don’t yet have a team for Project #1? => Use myCourses.

    You don’t yet have a plan for Project #1? => Start planning!

    Feedback on tutorial 1?

    COMP-551: Applied Machine Learning