Logistic regression is used when a few conditions are met:

Logistic regression is used when a few conditions are met: 1. There is a dependent variable. 2. There are two or more independent variables. 3. The dependent variable is binary, ordinal or categorical.

Medical applicationsSymptoms are absent, mild or severePatient lives or diesCancer, in remission, no cancer history

Marketing Applications

Buys pickle / does not buy pickleWhich brand of pickle is purchasedBuys pickles never, monthly or daily

GLM and LOGISTIC are similar in syntaxPROC GLM DATA = dsname;CLASS class_variable ; model dependent = indep_var class_variable ;PROC LOGISTIC DATA = dsname;CLASS class_variable ; MODEL dependent = indep_var class_variable ;

That was easy . So, why arent we done and going for coffee now?

Why its a little more complicatedThe output from PROC LOGISTIC is quite different from PROC GLMIf you arent familiar with PROC GLM, the similarities dont help you, now do they?

Important Logistic Output Model fit statistics Global Null Hypothesis tests Odds-ratios Parameter estimates

& a useful plot

A word from an unknown person on the Chronicle of Higher Ed ForumBeing able to find SPSS in the start menu does not qualify you to run a multinomial logistic regression

Logarithms, probability & odds ratiosIn five minutes or less

Points justifying the use of logistic regressionReally, if you look at the relationship of a dichotomous dependent variable and a continuous predictor, often the best-fitting line isnt a straight line at all. Its a curve.

You could try predicting the probability of an event say, passing a course. That would be better than nothing, but the problem with that is probability goes from 0 to 1, again, restricting your range.

Maybe use the odds ratio ?which is the ratio of the odds of an event happening versus not happening given one condition compared to the odds given another condition. However, that only goes from 0 to infinity.

When to use logistic regression: Basic example #1

Your dependent variable (Y) :There are two probabilities, married or not. We are modeling the probability that an individual is married, yes or no.

Your independent variable (X):Degree in computer science field =1, degree in French literature = 0

Step #1

A. Find the PROBABILITY of the value of Y being a certain value divided by ONE MINUS THE PROBABILITY, for when X = 1

p / (1- p)

Step #2

B. Find the PROBABILITY of the value of Y being a certain value divided by ONE MINUS THE PROBABILITY, for when X = 0

Step #3

B. Divide A by B

That is, take the odds of Y given X = 1 and divide it by odds of Y given X = 2

Example!100 people in computer science & 100 in French literature90 computer scientists are marriedOdds = 90/10 = 945 French literature majors are marriedOdds = 45/55 = .818Divide 9 by .818 and you get your odds ratio of 11 because that is 9/.818

Just because that wasnt complicated enough

Now that you understand what the odds ratio is The dependent variable in logistic regression is the LOG of the odds ratio (hence the name)

Which has the nice property of extending from negative infinity to positive infinity.

A table (try to contain your excitement) The natural logarithm (ln) of 11 is 2.398.I dont think this is a coincidence

BS.E.WaldDfSigExp(B)CS 2.398.38937.9491.00011.00Constant-.201.201.9971.318.818

If the reference value for CS =1 , a positive coefficient means when cs =1, the outcome is more likely to occurHow much more likely? Look to your right

The ODDS of getting married are 11 times GREATER If you are a computer science major

BS.E.WaldDfSigExp(B)CS 2.398.38937.9491.00011.00Constant-.201.201.9971.318.818

Actual SyntaxThank God!Picture of God not available

PROC LOGISTIC data = datasetname descending ;

By default the reference group is the first category.

What if data are scored 0 = not dead1 = died

CLASS categorical variables ;

Any variables listed here will be treated as categorical variables, regardless of the format in which they are stored in SAS

MODEL dependent = independents ;

Dependent = Employed (0,1)

IndependentsCounty# Visits to programGenderAge

PROC LOGISTIC DATA = stats1 DESCENDING ;

CLASS gender county ;MODEL job = gender county age visits ;

We will now enter real life

Table 1

Probability modeled is job=1.Note: 50 observations were deleted due to missing values for the response or explanatory variables.

This is badModel Convergence StatusQuasi-complete separation of data points detected.

Warning:The maximum likelihood estimate may not exist.Warning:The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

Complete separation

XGroup0010203041516171

If you dont go to church you will never die

Quasi-complete separation Like complete separation BUT one or more points where the points have both values1 12 13 14 14 05 06 0

there is not a unique maximum likelihood estimate

For any dichotomous independent variable in a logistic regression, if there is a zero in the 2 x 2 table formed by that variable and the dependent variable, the ML estimate for the regression coefficient does not exist.

Depressing words from Paul Allison

What the hell happened?

Solution?Collect more data. Figure out why your data are missing and fix that. Delete the category that has the zero cell..Delete the variable that is causing the problem

Nothing was significant

& I was sad

Lets try something else!Hey, theres still money in the budget!

Maybe its the clients faultProc logistic descending data = stats ;Class difficulty gender ;Model job = gender age difficulty ;

Oh, joy !

This sort of sucks

Yep. Sucks.

Sucks. Totally.

Conclusion Sometimes, even when you do the right statistical techniques the data dont predict well. My hypothesis would be that employment is determined by other variables, say having particular skills, like SAS programming.

Take 2Predicting passing grades

Proc Logistic data = nidrr ;Class group ;Model passed = group education ;

Yay! Better than nothing!

& we have a significant predictor

WHY is education negative?

Higher education, less failure

Now its laterComparing model fit statistics

The Mathematical WayComparing models

Akaike Information CriterionUsed to compare modelsThe SMALLER the better when it comes to AIC.

New variable improves model

CriterionIntercept OnlyIntercept and CovariatesAIC193.107178.488SC196.131187.560-2 Log L191.107172.488

CriterionInterceptOnlyInterceptandCovariatesAIC193.107141.250SC196.131153.346-2 Log L191.107133.250

The Visual WayComparing models

ReminderSensitivity is the percent of true positives, for example, the percentage of people you predicted would die who actually died. Specificity is the percent of true negatives, for example, the percentage of people you predicted would NOT die who survived.

PerfectUseless

WHICH MODEL IS BETTER ?

Im unimpressedYeah, but can you do it again?

Data miningWith SAS logistic regression

Data mining sample & testSelect sampleCreate estimates from sampleApply to hold out groupAssess effectiveness

Create sample

proc surveyselect data = visual out = samp300 rep = 1method = SRS seed = 1958 sampsize = 315 ;

Create Test Datasetproc sort data = samp300 ;by caseid ;proc sort data = visual ;by caseid ; data testdata ; merge samp300 (in =a ) visual (in =b) ;if a and b then delete ;

Create Test Datasetdata testdata ; merge samp300 (in =a ) visual (in =b) ;if a and b then delete ;

*** Deletes record if it is in the sample ;

Create estimates

ods graphics on proc logistic data = samp300 outmodel = test_estimates plots = all ;model vote = q6 totincome pctasian / stb rsquare ; weight weight ;

Test estimates

proc logistic inmodel = test_estimates plots = all ;score data = testdata ; weight weight ;

*** If no dataset is named, outputs to dataset named Data1, Data2 etc. ;

Validate estimates proc freq data = data1; tables vote* i_vote ;proc sort data = data1 ; by i_vote ;

What is stepwise logistic regression ?Thats a good question. Usually, all the independent variables are entered in a model simultaneously.In a stepwise model, the variable that has the largest zero-order correlation with the dependent is entered first. The variable that has the highest correlation with the remaining variance enters second.

ABDependentvariable

Variance of dependent variableExplained by A onlyVariance of dependent variableExplained by B only

Common varianceWith A and B

Example: Death certificatesThe death certificate is an important medical document.Resident physician accuracy in completing death certificates is poor. Participants were randomized into interactive workshop or provided with printed instruction material. A total of 200 residents completed the study, with 100 in each group.

ExampleDependent: Cause of death medical student is correct or incorrectIndependent: GroupIndependent: Awareness of guidelines for death certificate completion

AwarenessPrinted handoutsDeath certificate score

Variance of dependent variableExplained by Awareness onlyVariance of dependent variableExplained by print intervention only

Common varianceWith awareness & intervention

Bottom lineStepwise methods assign all of the shared variance to the first variable to enter the modelThey take advantage of chance to maximize explained varianceCoefficients are not as stable as non-stepwise models& this is all well have to say about stepwise today

Logistic regression applications are common in medicine, with binary outcomes of died/survived or ordinal outcomes of symptoms which are absent, mild or severe. Most people might start with the underlying mathematics and go through a lot of proofs. Im not most people so I am just going to go over a few points justifying the use of logistic regression.Really, if you look at the relationship of a dichotomous dependent variable and a continuous predictor it often looks more logistic than linear. The best-fitting line isnt a straight line at all. Its a curve.[Logistic regression is NOT what you would use to model how long a marriage lasted. That would be survival analysis.]Logistic regression is the statistical technique of choice when you have a single dependent variable and multiple independent variables from which you would like to predict it.With logistic regression, the first step in the dependent variable you are modeling is the PROBABILITY of the value of Y being a certain value divided by ONE MINUS THE PROBABILITY. Further, lets say that we have sampled 100 people in computer science and 100 people in French literature. We find that 90 of the computer scientists are married and 45 of the French literature people.So, if the probability of marriage is 90/100 and the probability of not married is 10/100 then the odds ratio of 9:1 for the computer scientists. = 9For the French literature people, the probability of marriage is 45/ 100 and the probability of not being married is 55/100 = .818So, 9/.818 = 11.00This tells you that the odds of a computer scientist being married versus single are 11 times that of a French literature professor. Also, that you should study computer science instead of French.Lets take a look at two examples of logistic regression. The previous example of computer scientists and French literature professors used data I made up. This is a really good way to insure that your results turn out nice, neat and highly significant. Unfortunately, making up your data does not sit well with journal re viewers, tenure committees or the National Institutes on Health. Proc logistic invokes the logistic procedure, data = gives the dataset name. What about the descending? By default the reference group is the first category. Often, data are scored 0 or 1, for example, 0 = survived, 1 = died. The values of the parameters are then predicting have not died. While your total model chi-square, model significance, and significance of your regression coefficients will all be the same regardless, it is a little counterintuitive for most people to think of this is the probability of them not having died and it gets a little convoluted explaining this to everyone. So, the descending option uses one as your reference group.Unintentionally listing a numeric variable here makes each value be treated as a separate category, so make sure it is what you want to do.Just like in Ordinary Least Squares regression, you give the variable you want to predict and the variables you want to use as predictors.This is the result of the DESCENDING option. The probability of 1, or being employed, is modeled. Also, just in case I missed the fact that a substantial amount of my data is missing, SAS gives me a little note. Complete separation occurs when your data are all in one group above some X value and all in another group below that value. Something like this:and so on. You might think this is great, you can perfectly predict which group someone falls in. If the score of X is 3 then theyre dead. Otherwise, theyre alive.Lets say that X is how many times a week people went to church in 2010 and the GROUP variable is if they died or not. You could come out and say that people who go to church less than three days a week never die but you would look pretty stupid saying that. From a mathematical point of view, this is a problem because The model cannot converge to a maximum likelihood estimate because a single maximum does not exist.So, what happened? Well, when I looked at County by Job by Gender, I found that about a third of my sample were missing gender. When I looked at those who HAD data for gender, there were only two who had . for county, that is, did not report the county where they lived. Both of those two people had a job, so there was a zero cell.Collect more data. This usually solves the problem of having a zero cell. Its not always an option, though.Figure out why your data are missing and fix that. This is something you should always do whenever you can, regardless of whether you have problems in quasi-complete or complete separation. In this case, it was because one file we merged had names in all upper case, so the two did not match. When I re-ran the merge, our missing data were miraculously non-missing.Delete the category that has the zero cell. This CAN work if, as in our example, the cell that had a zero was people who did not give their county and did not get a job. Thats kind of a garbage category anyway, not listed. If it was a lot, and related to the dependent variable, it could tell us our data was not missing at random. In this case it was only a few people. So, toss it.Delete the variable that is causing the problem. (If you have a dichotomous variable, it should be noted that the option above essentially means deleting the variable, too.) This may be advisable to do if you have a problem with the variable, for example, if none of the women in your sample are working, how likely is that? BUT . Then you have to ask if there is a problem with your sample - did you collect data from fighters in the UFC or Catholic priests ? Does the variable even make sense for your research question?

What you should not do is just ignore that note about quasi-separation and the model failing to converge. First of all, I should point out that these data werent really from the counties named. I changed the data here for confidentiality purposes. This is the point where if this was most statistical presentations that I would correct the problem doing A, B, C and/ or D, thereby resolving the problem and getting a useful model with all parameters significant except for that one non-significant effect that I can use to illustrate non-significance.Unfortunately, these were real data from a real study where we were trying to prove something. We did, in fact, collect more data, fix the problem of the missing data, delete the category of missing data for county and still nothing was significant. I was sad.AIC - This is the Akaike Information Criterion. .. AIC is used for the comparison of models from different samples or nonnested models. Ultimately, the model with the smallest AIC is considered the best. SC - This is the Schwarz Criterion. .. Like AIC, SC penalizes for the number of predictors in the model and the smallest SC is most desireable. -2 Log L - This is negative two times the log likelihood. The -2 Log L is used in hypothesis tests for nested models.Lets move on to a project where we are predicting if people pass or fail a course. This is another model with actual data, and a real point. We received funding to train direct care staff working with individuals with disabilities and chronic illness. As our end outcome we have whether or not they passed a test of knowledge and best practices related to disability. It included everything from how to give a bed bath to nutrition. As our first attempt, we used education and which group (experimental or control) as the independent variables. Lets look at several tables. Well skip over the first few tables on model fit statistics and look at them later. Here is our table that tests the null hypothesis that all of our regression coefficients are zero. If you are familiar with linear regression or Analysis of Variance, think of this as similar to the F-test. If you arent familiar with those, just so you know, this is simply the test of whether ANY of your predictor variables are significantly different from zero. There are three different types of chi-square tests, and, in this case as usually occurs, the three agree. The model is significant.Here we see the estimates of our coefficient values. Notice that education has a NEGATIVE estimate. Why? Because the default value modeled is 0, that is, the person did not pass. The probability of failing is negatively related to education. Being in the group, CONTROL is positively related to failing. As you can see from the column on the far-right, education is significantly related to the dependent variable but group is not (p > .05) .In the following table, we see the Odds Ratio estimates. The point estimate for education is less than one, this tells us that people with more education are less likely to fail. Also, as you can see, 1.0 does not fall within the 95% confidence limits so the odds are significantly different. The odds for the group variable are greater than one, and, again, 1.0 DOES fall within the confidence limits. Since the odds are greater than 1.0 it means that people in the control group are more likely to fail than those in the experimental group, but since 1.0 falls within the confidence interval, this is not significantly greater.Remember that I said wed talk about those other tables later? Well, now its later. We want to compare models. Below are the fit statistics for our model with Education and Group.The first column uses only the intercept to predict. Your model with the two covariates gives a smaller AIC than the intercept alone. This tells you that your model is better than having no predictors. If the AIC for Intercept Only and Intercept Only and Covariates models are almost the same, that is bad.

Lets say, because it happens to be true, that we gave everyone a pre-test before they were trained. We then tried a different model with pretest, education and group. Check out our fit statistics below:The first column is the exact same as our previous model, because it is how well the data fit for the intercept only. The second column, including covariates, is now 141.25. Recall that in the previous model it was 178.49 . Also recall that smaller is better. So, we conclude here that my second model is much better than the first. Wont it always be that way, with adding more variables making the model better? No. It wont.

The death certificate is an important medical document that impacts mortality statistics and health care policy. Resident physician accuracy in completing death certificates is poor. We assessed the impact of two educational interventions on the quality of death certificate completion by resident physicians. Two-hundred and nineteen internal medicine residents were asked to complete a cause of death statement using a sample case of in-hospital death. Participants were randomized into one of two educational interventions: either an interactive workshop (group I) or provided with printed instruction material (group II). A total of 200 residents completed the study, with 100 in each group. At baseline, competency in death certificate completion was poor. Only 19% of residents achieved an optimal test score.Improving Death Certificate Completion: A Trial of Two Training InterventionsDhanunjaya R. Lakkireddy, MD,1 Krishnamohan R. Basarakodu, MD,2 James L. Vacek, MD,1,8 Ashok K. Kondur, MD,3 Srikanth K. Ramachandruni, MD,4 Dennis J. Esterbrooks, MD,5 Ronald J. Markert, PhD,6 and Manohar S. Gowda, MDGiven limited time, in this session, we are only going to cover non-stepwide logistic regression

Logistic regression is used when a few conditions are met:

Documents

Transcript of Logistic regression is used when a few conditions are met: