Week 2: Logistic Regression category

6
Week 2: Logistic Regression Centering - Can be done a number of ways; the most common method is to subtract each value of the score from the average for that variable. - It can be done to make the values more interpretable In a logistic regression - The DV is binary - The IVS can be Continuous and/or binary - Can be used to answer different questions to those in multiple regression In this plot, we can see where the dominant number of responses are / where responses are clustering - so most people lie below the cut point of 18 for this assessment - We could use this to assess: what are the predictors for someone with scores above the cut point of 12? - i.e does have depression vs does not have depression Logistic regression is used to predict group membership: Sometimes our primary interest is in predicting a categorical outcome. Yes or No outcomes Has heart disease (CHD) Drinks more than 5 days a week Has a score above a cutpoint on a scale Depressive symptoms: SMF cutpoint =12 Used to predict a categorical DV on the basis of one or more IVs IVs can be categorical or continuous DV must be categorical Used for predicting group membership (yes/no) or outcome (yes/no) Analyses for categorical DV: What are key differences between ANOVA and logistic regression? = The DV is binary for LR and not for ANOVA When there is only one predictor: Chi square Ttest / ANOVA Correlation - can also be done with a dichotomous variable (tetrachoric correlation) When there are multiple predictors: - Linear regression (continuous DV) - Logistic regression (binary DV) FOR DICHOTOMOUS/CATEGORICAL/BINARY DVS — Same as multiple regression, (still predicting outcome), except now we are predicting which category people fall in on the DV (eg depression or no depression)

Transcript of Week 2: Logistic Regression category

Page 1: Week 2: Logistic Regression category

Week 2: Logistic Regression

Centering - Can be done a number of ways; the most common method is to subtract each value of

the score from the average for that variable.- It can be done to make the values more interpretable

In a logistic regression- The DV is binary - The IVS can be Continuous and/or binary- Can be used to answer different questions to those in multiple regression

In this plot, we can see where the dominant number of responses are / where responses are clustering-so most people lie below the cut point of 18 for this assessment - We could use this to assess: what are the predictors for someone with scores above the cut point of 12?

- i.e does have depression vs does not have depression

Logistic regression is used to predict group membership:Sometimes our primary interest is in predicting a categorical outcome.

Yes or No outcomes• Has heart disease (CHD)• Drinks more than 5 days a week

Has a score above a cut‐point on a scale• Depressive symptoms: SMF cut‐point =12

Used to predict a categorical DV on the basis of one or more IVs• IVs can be categorical or continuous• DV must be categorical• Used for predicting group membership (yes/no) or outcome (yes/no)

Analyses for categorical DV:

What are key differences between ANOVA and logistic regression? = The DV is binary for LR and not for ANOVA

When there is only one predictor:• Chi square• T‐test / ANOVA• Correlation - can also be done with a

dichotomous variable (tetrachoric correlation)

When there are multiple predictors:- Linear regression (continuous DV)- Logistic regression (binary DV)

FOR DICHOTOMOUS/CATEGORICAL/BINARY DVS— Same as multiple regression, (still predicting outcome), except now we are predicting which category people fall in on the DV (eg depression or no depression)

Page 2: Week 2: Logistic Regression category

Formulas (function) behind a logistic regression

Linear functionThe function of a straight line:Y= mx+ c or Y= c+bx

Where: - Y= IV, X= DV- M or B is the gradient (slope), the rate Y changes for

every 1 unit change in X - C = intercept (y‐axis), or constant

Logistic function• For different values on the x‐axis.• Predicts membership on the Y

axis (0 or 1).

Logistic function:A mathematical way of explaining the shape of line of best fit for an outcome with 0 or 1• Y is a Logarithmic scale• Non‐linear• The coefficients are interpreted

differently because it is non-linear: shows the probability of getting either a 0 or 1

• If the coefficient is positive, the y value increases

• If the coefficient is negative, the y value decreases

• Because it is a logarithmic scale, a 1 unit change on the IV doesn’t mean a specific change in the DV. To get around this, we convert the coefficients in a logistic model to odds ratios

Quadratic Function (Y = a + B1X + B2X2)(parabola)

Linear Function: graph

Page 3: Week 2: Logistic Regression category

Forms of Logistic Regression• Binomial form (where the DV is either one of two values - 1 or 0) - either has the

disorder, or does not. This is what we will look at in this unit• Multinomial form categorical DV outcome can be a 1,2 or 3, or more - categories are

given a number so they can be categorised• Ordinal form: where the categorical DV moves up in a progressive way (i.e low,

medium, high)

Can the approach used for MR model building be used for LR? = Yes, it can use standard, sequential and statistical model building

CodingRelevant for categorical variables in LR (IVs and DV)

For DV, SPSS automatically codes the target outcome(s): uses the DV in a particular way:• For binary logistic regression, the higher value on the DV in your dataset is assigned the

target outcome.• SPSS predicts the higher value (so if 1 is assigned items of 12+ on the

questionnaire and 0 is assigned less than 12, the model will produce results that predict the 1 category, and the 0 category is used as a reference group

• The highest numeric value will be used as the target category • It doesn’t matter how the categories are coded, as long as it is interpreted correctly

The lower value on the DV is thus assigned the reference group.• Can also manually recode the DV into your dataset if you are unhappy with the default

option for target outcome.

Manual Recoding in SPSSBest to recode into a different (new) variable instead of overwriting original variable.Goal is to assign the higher value to the group of interest.

Categorical IVsSPSS codes the IV in the analysis setup stageIf you have 2 levels of a categorical IV (e.g., gender) SPSS will assign a reference variable: automatically

But if you have more than 2 levels of a categorical IV, you need to do this manually via dummy coding.

First, identify a suitable reference category. Usually the ref group is not of primary interest - the one with the least interest (e.g., control group vs intervention).1. Low SES will be the referent here 2. Middle SES3. High SES

Standard:Variables are entered into the model together

Sequential/hierarchical:Binary and continuous variables are entered in blocks, as specified by the researcher, and based on theory

Statistical:Can base on forward entry or backward, based on some statistical criteria

Page 4: Week 2: Logistic Regression category

Examples of categorical IVs:- Continent of birth (1. Asia, 2. Europe, 3. Africa 4. South America) - Asia is the referent

category. So each of the other categories will be individually compared to this- Type of society: 1. Government, 2. Independent, 3. Catholic — Independent is the

referent category. Each of the other categories will be individually compared to this (coefficients: government compared to independent, and catholic compared to independent)

Key point:Understand your choice of reference group, because that impacts interpretation of the DV, and your interpretation of the coefficient of the IVs

What does target variable mean? = This is the outcome category of the dependent variable that is the focus of the research question

What is the reference category in logistic regression analyses? = It is the reference category used to interpret the categories of a categorical variable

Can a categorical DV have more than one category in logistic regression? = Categorical DVs in logistic regression can only have two values

Can a categorical IV have more than one category? = Categorical IVs in a logistic regression can have two or more values

Assumptions when performing a Logistic Regression1. Independent

errors

The response for one individual/and their error must not influence the response of another individual/must not be correlated highly with another personSo if we choose one person, their response must not be influenced by similar things than another person. This often happens as samples has common factors and is called Clustered data – multilevel modelling

2. Multicollinearity and Singularity

Collinearity: variables that are highly correlated

• Correlations between predictor variables can either be:

• Tetrachoric: (for correlations between binary variables)

Or • Pearsons (for

correlation of continuous variables)

We want variables that are low in correlations so they are adding something unique to the model

3. Linearity between

continuous IVs & log of

DV

There should be linearity/a linear relationship between continuous predictors/IVs, and the log of the DV- so an S shape

Can use these to examine this association:• Scatter plot• Examine line

of best fit

4. Outliers

Are there extreme cases which could be influential?Can assess this using:• Cooks distance

& Standardised residual (.3)

• Produced as SPSS output

• Can either remove or transform, and report this in results section

5. Sufficient data/information

• Need to ensure there is enough data in the set

• Need sufficient data that represents all possible combinations for the research questions

• Use a contingency table to check

• Collect more data (at least 5 per cell) or collapse levels of IV

• Example: smoking and tomato consumption & risk of cancer

Page 5: Week 2: Logistic Regression category

Key Output

1. How good is the model overall?

a) Model improvement χ2 = (‐2LLbase) – (‐2LLnew)Deviance statistic= ‐2LL (log likelihood)This statistic follows a χ2 distributionThis statistic tests model improvement from base when comparing nested models: compares deviance value from one model to another and assesses how much it has improved - and if the difference is significant using a chi2 distribution test, it is significantly improved.

It’s simply subtracting one model from the other The chi2 test of significance says that putting those variables in the model is a significant improvement to the model compared to without the variables

‐2Log Likelihood is sometimes referred to as variance• This is not accurate (see Andy Field)• Think of it as a pseudo R‐squared

Can also be calculated using:X2 model / ‐2LLbase= 29.310/136.663= .214= 21% improvement in the model

Nagelkerke’s and Cox & Snell• These statistics increase/are more inflated

as sample size increases - not ideal• Measures given by SPSS are flawed • Instead we can calculate our own using the

equation above (chi2/-2LL base)

But do report “Model fit improved by 21%”Can always report -2LL, and change.

b) Classification accuracyLR predicts group membership; we can compare against actual membership - we want to know how accurate is this

This table tells us how many values that actually were in each group, were predicted to be in the correct group

Predicted membership based on: (1) predicted probability of Y=1 for person X based on his/her scores on predictors, and (user‐defined) cut value.

Page 6: Week 2: Logistic Regression category

Is the classification accuracy good?• The closer to 100%, the better/ But if its below 50%, the model isn’t particularly good• Can evaluate this for each DV category – does the model predict well for both groups?

Compare against base rate (no predictors in model)• proportion of additional classification accuracy• = (hits + correct rejections – nmax group) / (nsample – nmax group)• So we’re saying: what would be the improvement in the model if we assigned everyone

to the 0 category, compared to the model where we predict?

2. How good are the individual predictors?

a) SignificanceIn MR: b, beta, sr2, t‐test for significance Y =b0 + b1X1 + b2X2..+bnXn

b = change in DV for 1 unit change in IV

In LR: b weights/coefficients Odds Ratios, Wald test for significance of those coefficientsb = change in log odds of Y = 1 for 1 unit change in IV— This is hard to apply because a logarithmic scale isn’t linear. So we convert the beta coefficient to an odds ratio: so we can interpret them more easily

What the model correctly predicted vs what it incorrectly predicted. (Green section vs red section) ** THIS IS PURELY BASED ON HOW MANY PARTICIPANTS ARE IN EACH GROUP AT BASELINE: With no other information except participant grouping distribution. SO: what’s the chance that a random person you pick will belong to which outcome group based on the limited information you have. Compare this to the classification table WITH IVs.