Week 2: Logistic Regression category
Transcript of Week 2: Logistic Regression category
![Page 1: Week 2: Logistic Regression category](https://reader033.fdocuments.in/reader033/viewer/2022042013/6255a6178f7892488744de95/html5/thumbnails/1.jpg)
Week 2: Logistic Regression
Centering - Can be done a number of ways; the most common method is to subtract each value of
the score from the average for that variable.- It can be done to make the values more interpretable
In a logistic regression- The DV is binary - The IVS can be Continuous and/or binary- Can be used to answer different questions to those in multiple regression
In this plot, we can see where the dominant number of responses are / where responses are clustering-so most people lie below the cut point of 18 for this assessment - We could use this to assess: what are the predictors for someone with scores above the cut point of 12?
- i.e does have depression vs does not have depression
Logistic regression is used to predict group membership:Sometimes our primary interest is in predicting a categorical outcome.
Yes or No outcomes• Has heart disease (CHD)• Drinks more than 5 days a week
Has a score above a cut‐point on a scale• Depressive symptoms: SMF cut‐point =12
Used to predict a categorical DV on the basis of one or more IVs• IVs can be categorical or continuous• DV must be categorical• Used for predicting group membership (yes/no) or outcome (yes/no)
Analyses for categorical DV:
What are key differences between ANOVA and logistic regression? = The DV is binary for LR and not for ANOVA
When there is only one predictor:• Chi square• T‐test / ANOVA• Correlation - can also be done with a
dichotomous variable (tetrachoric correlation)
When there are multiple predictors:- Linear regression (continuous DV)- Logistic regression (binary DV)
FOR DICHOTOMOUS/CATEGORICAL/BINARY DVS— Same as multiple regression, (still predicting outcome), except now we are predicting which category people fall in on the DV (eg depression or no depression)
![Page 2: Week 2: Logistic Regression category](https://reader033.fdocuments.in/reader033/viewer/2022042013/6255a6178f7892488744de95/html5/thumbnails/2.jpg)
Formulas (function) behind a logistic regression
Linear functionThe function of a straight line:Y= mx+ c or Y= c+bx
Where: - Y= IV, X= DV- M or B is the gradient (slope), the rate Y changes for
every 1 unit change in X - C = intercept (y‐axis), or constant
Logistic function• For different values on the x‐axis.• Predicts membership on the Y
axis (0 or 1).
Logistic function:A mathematical way of explaining the shape of line of best fit for an outcome with 0 or 1• Y is a Logarithmic scale• Non‐linear• The coefficients are interpreted
differently because it is non-linear: shows the probability of getting either a 0 or 1
• If the coefficient is positive, the y value increases
• If the coefficient is negative, the y value decreases
• Because it is a logarithmic scale, a 1 unit change on the IV doesn’t mean a specific change in the DV. To get around this, we convert the coefficients in a logistic model to odds ratios
Quadratic Function (Y = a + B1X + B2X2)(parabola)
Linear Function: graph
![Page 3: Week 2: Logistic Regression category](https://reader033.fdocuments.in/reader033/viewer/2022042013/6255a6178f7892488744de95/html5/thumbnails/3.jpg)
Forms of Logistic Regression• Binomial form (where the DV is either one of two values - 1 or 0) - either has the
disorder, or does not. This is what we will look at in this unit• Multinomial form categorical DV outcome can be a 1,2 or 3, or more - categories are
given a number so they can be categorised• Ordinal form: where the categorical DV moves up in a progressive way (i.e low,
medium, high)
Can the approach used for MR model building be used for LR? = Yes, it can use standard, sequential and statistical model building
CodingRelevant for categorical variables in LR (IVs and DV)
For DV, SPSS automatically codes the target outcome(s): uses the DV in a particular way:• For binary logistic regression, the higher value on the DV in your dataset is assigned the
target outcome.• SPSS predicts the higher value (so if 1 is assigned items of 12+ on the
questionnaire and 0 is assigned less than 12, the model will produce results that predict the 1 category, and the 0 category is used as a reference group
• The highest numeric value will be used as the target category • It doesn’t matter how the categories are coded, as long as it is interpreted correctly
The lower value on the DV is thus assigned the reference group.• Can also manually recode the DV into your dataset if you are unhappy with the default
option for target outcome.
Manual Recoding in SPSSBest to recode into a different (new) variable instead of overwriting original variable.Goal is to assign the higher value to the group of interest.
Categorical IVsSPSS codes the IV in the analysis setup stageIf you have 2 levels of a categorical IV (e.g., gender) SPSS will assign a reference variable: automatically
But if you have more than 2 levels of a categorical IV, you need to do this manually via dummy coding.
First, identify a suitable reference category. Usually the ref group is not of primary interest - the one with the least interest (e.g., control group vs intervention).1. Low SES will be the referent here 2. Middle SES3. High SES
Standard:Variables are entered into the model together
Sequential/hierarchical:Binary and continuous variables are entered in blocks, as specified by the researcher, and based on theory
Statistical:Can base on forward entry or backward, based on some statistical criteria
![Page 4: Week 2: Logistic Regression category](https://reader033.fdocuments.in/reader033/viewer/2022042013/6255a6178f7892488744de95/html5/thumbnails/4.jpg)
Examples of categorical IVs:- Continent of birth (1. Asia, 2. Europe, 3. Africa 4. South America) - Asia is the referent
category. So each of the other categories will be individually compared to this- Type of society: 1. Government, 2. Independent, 3. Catholic — Independent is the
referent category. Each of the other categories will be individually compared to this (coefficients: government compared to independent, and catholic compared to independent)
Key point:Understand your choice of reference group, because that impacts interpretation of the DV, and your interpretation of the coefficient of the IVs
What does target variable mean? = This is the outcome category of the dependent variable that is the focus of the research question
What is the reference category in logistic regression analyses? = It is the reference category used to interpret the categories of a categorical variable
Can a categorical DV have more than one category in logistic regression? = Categorical DVs in logistic regression can only have two values
Can a categorical IV have more than one category? = Categorical IVs in a logistic regression can have two or more values
Assumptions when performing a Logistic Regression1. Independent
errors
The response for one individual/and their error must not influence the response of another individual/must not be correlated highly with another personSo if we choose one person, their response must not be influenced by similar things than another person. This often happens as samples has common factors and is called Clustered data – multilevel modelling
2. Multicollinearity and Singularity
Collinearity: variables that are highly correlated
• Correlations between predictor variables can either be:
• Tetrachoric: (for correlations between binary variables)
Or • Pearsons (for
correlation of continuous variables)
We want variables that are low in correlations so they are adding something unique to the model
3. Linearity between
continuous IVs & log of
DV
There should be linearity/a linear relationship between continuous predictors/IVs, and the log of the DV- so an S shape
Can use these to examine this association:• Scatter plot• Examine line
of best fit
4. Outliers
Are there extreme cases which could be influential?Can assess this using:• Cooks distance
& Standardised residual (.3)
• Produced as SPSS output
• Can either remove or transform, and report this in results section
5. Sufficient data/information
• Need to ensure there is enough data in the set
• Need sufficient data that represents all possible combinations for the research questions
• Use a contingency table to check
• Collect more data (at least 5 per cell) or collapse levels of IV
• Example: smoking and tomato consumption & risk of cancer
![Page 5: Week 2: Logistic Regression category](https://reader033.fdocuments.in/reader033/viewer/2022042013/6255a6178f7892488744de95/html5/thumbnails/5.jpg)
Key Output
1. How good is the model overall?
a) Model improvement χ2 = (‐2LLbase) – (‐2LLnew)Deviance statistic= ‐2LL (log likelihood)This statistic follows a χ2 distributionThis statistic tests model improvement from base when comparing nested models: compares deviance value from one model to another and assesses how much it has improved - and if the difference is significant using a chi2 distribution test, it is significantly improved.
It’s simply subtracting one model from the other The chi2 test of significance says that putting those variables in the model is a significant improvement to the model compared to without the variables
‐2Log Likelihood is sometimes referred to as variance• This is not accurate (see Andy Field)• Think of it as a pseudo R‐squared
Can also be calculated using:X2 model / ‐2LLbase= 29.310/136.663= .214= 21% improvement in the model
Nagelkerke’s and Cox & Snell• These statistics increase/are more inflated
as sample size increases - not ideal• Measures given by SPSS are flawed • Instead we can calculate our own using the
equation above (chi2/-2LL base)
But do report “Model fit improved by 21%”Can always report -2LL, and change.
b) Classification accuracyLR predicts group membership; we can compare against actual membership - we want to know how accurate is this
This table tells us how many values that actually were in each group, were predicted to be in the correct group
Predicted membership based on: (1) predicted probability of Y=1 for person X based on his/her scores on predictors, and (user‐defined) cut value.
![Page 6: Week 2: Logistic Regression category](https://reader033.fdocuments.in/reader033/viewer/2022042013/6255a6178f7892488744de95/html5/thumbnails/6.jpg)
Is the classification accuracy good?• The closer to 100%, the better/ But if its below 50%, the model isn’t particularly good• Can evaluate this for each DV category – does the model predict well for both groups?
Compare against base rate (no predictors in model)• proportion of additional classification accuracy• = (hits + correct rejections – nmax group) / (nsample – nmax group)• So we’re saying: what would be the improvement in the model if we assigned everyone
to the 0 category, compared to the model where we predict?
2. How good are the individual predictors?
a) SignificanceIn MR: b, beta, sr2, t‐test for significance Y =b0 + b1X1 + b2X2..+bnXn
b = change in DV for 1 unit change in IV
In LR: b weights/coefficients Odds Ratios, Wald test for significance of those coefficientsb = change in log odds of Y = 1 for 1 unit change in IV— This is hard to apply because a logarithmic scale isn’t linear. So we convert the beta coefficient to an odds ratio: so we can interpret them more easily
What the model correctly predicted vs what it incorrectly predicted. (Green section vs red section) ** THIS IS PURELY BASED ON HOW MANY PARTICIPANTS ARE IN EACH GROUP AT BASELINE: With no other information except participant grouping distribution. SO: what’s the chance that a random person you pick will belong to which outcome group based on the limited information you have. Compare this to the classification table WITH IVs.