Session 10 (Logistic Regression)
Transcript of Session 10 (Logistic Regression)
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 1/19
URBN LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
URBNLOFTS
Logistic R GR SSION
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 2/19
URBN LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
Early uses of logistic regression were in
biomedical studies, for instance, to model whether
subjects have a particular condition such as lung
cancer. The past 25 years have seen much use in
social science research, for modeling opinions and
behavior decisions, and in business applications. In
credit-scoring, logistic regression is used to model
the probability that a subject is credit worthy.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 3/19
URBN LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
For instance, the probability that a subject
pays a bill on time may use predictors such as the
size of the bill, annual income, occupation,
mortgage and debt obligations, percentage of bills
paid on time in the past, and other aspects of an
applicant's credit history.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 4/19
URBN LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
Another area of increasing application is
genetics, such as to estimate quantitative trait loci
effects by modeling the probability that an
offspring inherits an allele of one type instead of
another type as a function of phenotypic values on
various traits for that offspring.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 5/19
URBN LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
For binary response variable Y and an explanatory X, let
= = 1 = = 1 − = 0 = . The logistic
regression model is
Equivalently, the logit (log odds) has the linear relationship
= exp + 1 + exp +
=
1 − = +
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 6/19
URBN LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• The sign determines whether π(x) is increasing ordecreasing as x increases.
• The rate of climb or descent increases as |β|
increases; as β → 0 the curve flattens to a horizontal
straight line.
• When β = 0, Y is independent of X.
• For quantitative x with β > 0, the curve for π( x ) has
the shape of the CDF of the logistic distribution. Sincethe logistic density is symmetric, π( x ) approaches 1
at the same rate that it approaches 0.
β
:
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 7/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• Exponentiating both sides of the equation shows that the oddsare an exponential function of x.
• This provides a basic interpretation for the magnitude of β : The
odds multiply by for every 1 -unit increase in x.
• In other words, is an odds ratio, the odds at X = x + 1 dividedby the odds at X = x.
• The intercept parameter α is not usually of particular interest.
• By centering the predictor about 0 [i.e., replacing x by (x — )],
α becomes the logit at x = , and thus α/(1 + α) = π( ). As in
ordinary regression, centering is also helpful in complex models
containing quadratic or interaction terms to reduce correlations
among model parameter estimates.
β:
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 8/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• To illustrate logistic regression, we re-analyze thehorseshoe crab data introduced last time.
• The binary response is whether a female crab has any
male crabs residing nearby (satellites). For crab i, let
yi = 1 if she has at least one satellite and yi = 0 if shehas none.
• Here, we use as a predictor the female crab's carapace
width.
• In the succeeding graph, it appears that yi = 1 tends to
occur relatively more often at higher x values, in fact,
all crabs with width >29 cm have satellites.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 9/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 10/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• In each of the eight width categories, we computed thesample proportion of crabs having satellites and the
mean width for the crabs in that category.
• A curve based on smoothing the data using the
generalized additive modeling method, assuming abinomial response and logit link is also in the graph
• This curve shows a roughly increasing trend and is more
informative than viewing the binary data alone.
• It suggests that an S-shaped regression function may
describe this relationship relatively well.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 11/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 12/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• The ML fit is
• Substituting x = 26.3 cm, the mean width level in this
sample,π(x) = 0.674.
• The estimated probability equals 0.5 when
• The estimated odds of a satellite multiply by exp( ) =
exp(0.497) = 1.64 for each 1-cm increase in width; that
is, there is a 64% increase.
x = −
=12.351
0.497= 24.8
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 13/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• The statistic z = /s.e. = 0.497/0.102 = 4.89 providesstrong evidence of a positive width effect (P < 0.0001).
• The equivalent Wald chi-squared statistic, z2 = 23.89,
has df = 1.
• The maximized log likelihoods equal —112.88 under H0:
= 0 and —97.23 for the full model.
• The likelihood-ratio statistic equals
—2[
—112.88
—(-97.23)] = 31.31, with df = 1.
• This provides even stronger evidence than the Wald
test.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 14/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• The Wald 95% confidence interval for is 0.497 ±1.96(0.102), or (0.298, 0.697).
• The table reports a likelihood-ratio confidence interval
of (0.308, 0.709), based on the profile likelihood
function.
• The confidence interval for the effect on the odds per
1-cm increase in width equals (e0.308, e0.709) =
(1.36,2.03).
• We infer that a 1-cm increase in width has at least a
36% increase and at most a doubling in the odds of a
satellite.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 15/19
URBNLOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• In practice, there is no guarantee that a certain logistic
regression model fits the data well.
• For any type of binary data, one way to detect lack of fit
uses a likelihood-ratio test to compare the model to more
complex ones.
• A more complex model might contain a nonlinear effect.
• Models with multiple predictors would consider interaction.
If more complex models do not fit better, this provides
some assurance that the model chosen is reasonable.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 16/19
URBN
LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• For models with a continuous explanatory variable, X2 and
G2 do not approximate chi-squared distributions due to very
few counts for each value of x .
• One solution for this is to “bin” the continuous variable
into several categories and then perform a goodness-of-fit
test.
• If the test resulted to a good fit, then it can be inferred
that the model with the continuous explanatory variable
has also a good fit.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 17/19
URBN
LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 18/19
URBN
LOFTS1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• In each width category, the fitted value for a "yes" response
is the sum of the estimated probabilities π(x) for all crabs
having width in that category
• The fitted value for a "no" response is the sum of 1 — π(x)
for those crabs.
• In this table we have an 8 by 2 table, to compute for thechi-squared statistic, we square the difference of the count
for observed and the fitted values.
8/17/2019 Session 10 (Logistic Regression)
http://slidepdf.com/reader/full/session-10-logistic-regression 19/19
URBN LOFTS
1837 LOFT STREET, ANYTOWN, NY 50080
URBN LOFTS
• Their values are X2 = 5.3 and G2 = 6.2.
• The constructed table has eight binomial samples, one for
each width setting.
• The model has two parameters, so df = 8 — 2 = 6.
• Neither X2 nor G2 shows evidence of lack of fit (P-value >
0.4).
• Thus, we can feel more comfortable about using the model
for the original ungrouped data.