Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

58
Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC

Transcript of Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Page 1: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Logistic Regression

Ram Akella Lecture 3February 2, 2011

UC Berkeley Silicon Valley Center/SC

Page 2: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Overview Motivating example Why not ordinary linear regression? The logistic formulation

Probability of “success” Odds of “success” Logit of “success”

The logistic regression model Running the model Interpreting the output Evaluating the goodness of fitting

Page 3: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

The Aim of Classification Methods

Similar to ordinary regression models, except the response, Y, is categorical.

Y indicates the “group” membership of each observation (each category is a group). Y = C1, C2,…

Predictors X1, X2,.., are continuous and/or categorical. Aims:

Profiling (=Explanatory): What are the differences (in terms of X1, X2,…) between the various groups? (as indicated by Y)

Classification (=Prediction): Predict Y (group membership) on the basis of X1, X2,…

Page 4: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Example 1: Classifying Firm Status Financial analysts are interested in predicting

the future solvency of firms. In order to predict whether a firm will go bankrupt in the near future, it is useful to look at different ratio measures of financial health such as : Cash_Debt: cash flow/total debt ROA: net income/total assets Current: current assets/current liabilities Assets_Sales: current assets/net sales Status: bankrupt / solvent

Page 5: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Example 2: Profiling Customers by Beer Preference A beer-maker would like to know the

characteristics that distinguish customers who prefer light beer from those who prefer regular beer. The beer-maker would like to use this information to predict any customer’s beer preference based on:

gender, marital status, income and age.

Page 6: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Example: Beer Preference

Consider the data on beer preferences (light or regular) of 100 customers along with their age, income, gender and marital status.Suppose we code the response variable as

Now we fit the multiple regression model

Regular if

Light if

0

1Y

Y = Gender Married Income Age

Page 7: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Model assumptions: Observations/residuals are independent Residuals are normally distributed Linear model is adequate Variance of residuals is constant

Which assumptions are violated? What about predictions from this

model?

Page 8: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Different Formulation Let π = Prob(Y=1). In the beer example, π is the probability

that a customer prefers _______ beer. It follows that Prob(Y=0) = _________ . In order to get rid of the 0/1 values, we can

look at a function of π and treat it as the response model.

light

1- π

Page 9: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Logistic Regression Logistic regression learns the

conditional distribution P(y | x) We will assume two classes y = 0 and y

= 1 The parametric form for P(y = 1 | x, w) and

P(y=0|x,w) is:

were w is the parameter vector w=[w1, w2, …, wk]

xw

xw

xw

e

expwxyp

expwxyp

11);|0(

1

1);|1(

1

1

Page 10: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Logistic Regression We can represent the probability ratio as a

linear combination of features:

This is known as log odds

wxe

eee

xwyp

xwyp xw

xw

xw

xw

log

1

11

log),|0(

),|1(log

Page 11: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Logistic Regression

A linear function wx which ranges [-∞, ∞] can be transformed to a range [0,1] using the sigmoid function g(x,w)

Page 12: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Logistic RegressionGiven P( y | x) we predict ŷ =1 if the expected loss function of predicting 0 L(0,1) is greater than predicting 1 L(1,0) (for now assume L(0,1) = L(1,0))

Page 13: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Logistic Regression

This assumed L(0,1)=L(1,0)

A similar derivation can be done for arbitrary L(0,1) andL(1,0). (If we decide that one class is more important to be detected than the other)

Page 14: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Maximum Likelihood Learning

The likelihood function is the probability of the data (x,y) given the parameters w – p(x,y|w)

It is a function of the parameters Maximum likelihood learning finds the

parameters that maximize this likelihood function

A common trick is to work with log-likelihood, i.e., take the logarithm of the likelihood function – log p(x,y|w)

Page 15: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Computing the Likelihood

In our framework, we assume each training example (xi , yi) is drawn independently from the same (but unknown) distribution P( x ,y ) (the i. i. d assumption) hence, we can write:

This is the function that we will maximize

ii

ii w)|y ,P(x log w)| y , P(xlogy)P(x, log

i

iii

i

ii

ww

w)| P(x w), x| P(y log max arg

w)| y , P(x log max arg w)|y P(x, log max arg

w

Page 16: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Computing the Likelihood

Further P(x|w)=P(x) because x because it does not depend on w, so:

Page 17: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Computing the Likelihood

This can be written as:

Then the objective learning function is:

Page 18: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Fitting the Logistic Regression with Gradient Ascend

Page 19: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Fitting the Logistic Regression with Gradient Ascend

Page 20: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Gradient Ascend Algorithm

Page 21: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Multi-class Case Choose class K to be the “reference class”

and represent each of the other classes as a logistic function of the odds of k versus class K:

Page 22: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Multiclass Case Conditional probability for class k ≠ K can

be computed as:

For class K the conditional probability is:

Page 23: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

ExampleA 1959 article presents data concerning the proportion of coal miners who exhibit the symptoms of severe pneumoconiosis and the number of years of exposure. y is the proportion of miners who have severe symptoms

# years # severe cases

# of miners

Proportion of severe cases y

5.8 0 98 0

15.0 1 54 0.0185

21.5 3 43 0.00698

27.5 8 48 0.1667

33.5 9 51 0.1765

39.5 8 38 0.2105

46.0 10 28 0.3571

51.5 5 11 0.4545

Page 24: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Example The fitted model is

)0935.07965.4exp(1

xy

Page 25: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Example The covariance matrix is:

002380.00083480.0

0083480.0323283.0)ˆ(wVar

Page 26: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Example

Predictor w Z P Odds Ratio

Conf Interval

Constant -4.7964 -8.44

Years 0.093 6.06 0.00 1.10 1.07-1.13

Logistic Regression Table

Page 27: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Interpretation of the Parameters

Consider we have a single regressor xi

If we increment the value of the regressor in one unit then:

The difference between the two predicted values is:1

10

10

ˆ)()1(

)1(ˆˆ)1(

ˆˆ)(

wxx

xwwx

xwwx

ii

ii

ii

Page 28: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

The odds ratio

The odds ratio can be interpreted as the increase in probability of success associated with a one-unit change in the value of the predictor variable and it is defined as:

)ˆexp( 11 w

odds

oddsO

xi

xiR

Page 29: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Example Following the pneumoconiosis data we have

the model equal to:

The resulting odds ratio is

This implies that every year of additional exposure increases the odds of contracting a severe case of pneumoconiosis by 10%

)0935.07965.4exp(1

xy

10.1)0935.0exp()ˆexp( 1 wOR

Page 30: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Overall Usefulness of the Model

For maximum likelihood estimation, the fit of a model is measured by its deviance D (similar to sum-of-squared-errors in the case of least-squares estimation)

We compare the deviance of a model to the deviance of the naïve model (no explanatory variables: simply classify each observation as belonging to the majority class)

n

i ii

iiii

ii

ii yn

ynyn

yn

yy

FML

saturatedLD

1 )ˆ1(ln)(

ˆln2

)(

)(ln2

Page 31: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Overall Usefulness of the Model If the ratio of

D/n-p, where p is the number of predictors and n the number of samples

is much greater than unity, then the current model is not adequate

Note:This test is similar in intent to the ____-test for overall usefulness of a linear regression model

F

Page 32: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Usefulness of Individual Predictors Each estimated coefficient, , has a

standard error, sbj associated with it. To conduct the hypothesis test

H0: ŵj = 0 vs. Ha: ŵj 0

Use the test statistic / sbj , (called the Wald statistic)

The associated p-value indicates the statistical significance of the predictor xi, or the significance of the contribution of this predictor beyond the other predictors.

iw

iw

Page 33: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Evaluating & Comparing Classifiers

Evaluation of a classifier is necessary for two different purposes:1. To obtain the complete specification of a particular

model i.e., to obtain numerical values of the parameters of a particular method.

2. To compare two or more fully specified classifiers in order to select the “best” classifier.

Useful criteria Reasonableness Accuracy Cost measures

Page 34: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Evaluation Criterion 1: Reasonableness

As in regression, time series, and other models we expect the model to be reasonable: Based on the analyst’s domain knowledge, is there a

reasonable basis for a causal relationship between the predictor variables and Y (group membership)?

Are the predictor variables actually available for prediction in the future?

If the classifier implies a certain order of importance among the predictor variables (indicated by p-values for specific predictors), is this order reasonable?

Page 35: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Evaluation Criterion 2: Accuracy Measures

The idea is to compare the predictions with the actual responses (like forecast errors in time series, or residuals in regression models).

In regression/ time series etc. we displayed these as 3 columns (actual values, predicted/fitted values, errors) or plotted them on a graph.

In classification the predictions and actual values are displayed in a compact format called a classification/confusion matrix.

This can be done for the training and/or validation set.

Page 36: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Classification/confusion matrix

dc

ba

Predicted

Actual

1ˆ CY 2

ˆ CY

1CY

2CY

Example with two groups Y = C1 or C2

# of obs that were classified correctly as group C1

Page 37: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Example: Beer Preference

The following classification matrix results from using a certain classifier on the data

Training Data scoring - Summary Report

0.5

Actual Class Regular Light

Regular 27 3

Light 4 26

Cut off Prob.Val. for Success (Updatable)

Classification Confusion Matrix

Predicted Class

Page 38: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Classification Measures Based on the classification matrix

there are 5 popular measures:1. The overall accuracy of a classifier is

2. The overall error rate of a classifier is

dcba

da

)prediction correct(P

dcba

cb

accuracy overall -1

Page 39: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Accuracy Measures – cont.3. The base accuracy of a dataset is the accuracy

of the naïve rule

4. The base error rate is

5. The lift of a classifier (aka its improvement) is

Proportion of majority class

1 – base accuracy

%100error Base

error overall - error Base

Page 40: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Accuracy Measures – cont. Suppose the two groups are asymmetric in that it is

more important to correctly predict membership in C1 than in C2. E.g., in a bankruptcy example, it may be more important to correctly predict a firm that is going bankrupt than to correctly predict a firm that is going to stay solvent. The classifier is essentially used as a system for detecting or signaling C1.

 In such a case, the overall accuracy is not a good measure for evaluating the classifier.

Page 41: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Accuracy Measures for Unequal “Importance” of Groups

Sensitivity of a classifier = its ability to correctly detect the important group members

=% of C1 members correctly classified

Specificity of a classifier = its ability to correctly rule out C2 members

=% of C2 members correctly classified

ba

a

dc

d

a b

c d

Predicted

Actual

C1 = “important” groupC1

C2

C1 C2

Page 42: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Accuracy Measures for Unequal “Importance” of Groups

= false positive rate of classifier

= false negative rate of classifier 

ca

c

db

b

a b

c d

Predicted

ActualC1 = “important” group C1

C2

C1 C2

Page 43: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Cost Sensitive Learning There are two types of errors

Machine Learning methods usually minimize FP+FN

Direct marketing maximizes TP

Predicted class

Yes No

Actual class

Yes TP: True positive

FN: False negative

No FP: False positive

TN: True negative

Page 44: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Cost Sensitive Learning In practice, true positive and false negative

errors often incur different costs Examples:

Medical diagnostic tests: does X have leukemia? Loan decisions: approve mortgage for X? Web mining: will X click on this link? Promotional mailing: will X buy the product?

Page 45: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Cost Sensitive Learning Most learning schemes do not perform cost-

sensitive learning They generate the same classifier no matter

what costs are assigned to the different classes Example: standard decision tree learner

Simple methods for cost-sensitive learning: Re-sampling of instances according to costs Weighting of instances according to costs

Some schemes are inherently cost-sensitive

Page 46: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Lift Charts

The lift charts help us to see what is the improvement of the classification vs. the random classification

The larger the area, the better the model is

Page 47: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

How to construct a lift chart

Sort the samples by probability

Each point of the chart will consist in the cumulative sum of the actual class

For the random classification line, calculate the average of the classes (yes=1, no=0). Each point of the line is the multiplication of the average by the number of samples

Page 48: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

How to construct a lift chart

Page 49: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

ROC Curves Stands for “receiver operating

characteristic” Used in signal detection to show trade-

off between hit rate and false alarm rate over noisy channel y axis shows percentage of true positives in

sample x axis shows percentage of false positives

in sample

Page 50: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

ROC Curves

Page 51: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Cross-Validation Ideally if we have enough data, we’d set aside a

validation set and use it to assess predictive performance of classifier.

When data are scarce: K-fold validation Split data into K roughly equally-sized parts For each part fit a classifier to the other K-1 parts. Use the K-1 classifiers to classify the data in the left-

out part. Combine the misclassification errors resulting from

using the K-1 classifiers. Typical choice are K=5,10. K=N is known as “leave one

out”. Smaller K will yield unbiased CV estimates, but with

high variance. The opposite true for large K.

Page 52: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Case study (Levin 1998)The case study involves the spring 1996 mailing campaign for an home equity loan (HEL) conducted by a major midwestern bank (MWB). We define as a responder (or a buyer) any customer that ended up taking an HEL and actually paying an interest on it. The continuous choice of concern is the YTD interest that the bank is expected to collect, which vary between customers based on the size and the terms of the HEL.The models were evaluated and compared based on three criteria:ProfitabilityGoodness of fitPrediction accuracy

Page 53: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Case Study Profitability is measured in terms of the resulting total

profit/return for the target mailing and audience and/or the average profit/return

Goodness-of-fit exhibits how well the model is capable of discriminating between responders and non-responders and/or profitable and non-profitable people. In a binary model, it is measured by the actual response rate (the ratio of the number of buyers captured to the size of audience mailed), or the lift in the actual reponse rate attained by the model over a randomly selected mailing; in a continuous model by the average actual profit/return per customer.

Prediction accuracy is measured by the difference in the predicted profit/response measures versus the actual results.

Page 54: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Dataset

The dataset of the study consist in the spring 1996 mailing audience

The number of non-buyers is much greater than the number of buyers It was included only a sample of the non-

buyers and all the buyers in the model training

The log odds ratio should reflect the true proportion of the buyers and non-buyers

Page 55: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Dataset

Responders Non Responders

Total

Training 201 14028 14229

Test 113 6964 7677

Total 314 20992 21306

Page 56: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Method

A logistic regression model was fitted with different loss function for both classes. We change the loss function to detect

correctly the buyers For each model fitted, the profit and the

accuracy of the model was calculated

Page 57: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.
Page 58: Logistic Regression Ram Akella Lecture 3 February 2, 2011 UC Berkeley Silicon Valley Center/SC.

Conclusion

The loss function will depend on: Which class is more important to detect

(if a class is more important than the other the cost to no detect it correctly must be higher)

The accuracy we want to achieve The profit we want to achieve (the

maximum profit may not be achieved with the maximum accuracy)