Logistic Regression in Stata - danstan.medanstan.me/students/LogistRegression_Danstan.pdf ·...
Transcript of Logistic Regression in Stata - danstan.medanstan.me/students/LogistRegression_Danstan.pdf ·...
Danstan Bagenda, PhD, Jan 2009
Logistic Regression in Stata
Danstan Bagenda, PhD MUSPH
1
1Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Logistic Regression in STATA
The logistic regression programs in STATA use maximum likelihood estimation to generate the logit (the logistic regression coefficient, which corresponds to the natural log of the OR for each one-unit increase in the level of the regressor variable).
The resulting ORs are maximum-likelihood estimates (MLEs) of the uniform effect (OR) across strata of the model covariates. Thus they are pooled (uniform, common) estimates of the OR and in this sense are adjusted for all regressors included in the model.
2Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
STATA Logistic Regression Commands
The “logistic” command in STATA yields odds ratios.
logistic low smoke age
Logistic regression Number of obs = 189 LR chi2(2) = 7.40 Prob > chi2 = 0.0248Log likelihood = -113.63815 Pseudo R2 = 0.0315
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- smoke | 1.997405 .642777 2.15 0.032 1.063027 3.753081 age | .9514394 .0304194 -1.56 0.119 .8936481 1.012968------------------------------------------------------------------------------
3Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
STATA Logistic Regression Commands
The “logit” command in STATA yields the actual beta coefficients. logit low smoke age Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -113.66733 Iteration 2: log likelihood = -113.63815
Logit estimates Number of obs = 189 LR chi2(2) = 7.40 Prob > chi2 = 0.0248Log likelihood = -113.63815 Pseudo R2 = 0.0315
------------------------------------------------------------------------------ low | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- smoke | .6918487 .3218061 2.15 0.032 .0611203 1.322577 age | -.0497793 .031972 -1.56 0.119 -.1124432 .0128846 _cons | .0609055 .7573199 0.08 0.936 -1.423414 1.545225------------------------------------------------------------------------------
4Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
STATA Commands for Multilevel Categorical Variables in Logistic Regression Models
Categorized continuous variables should be entered in regression models as a series of indicator variables for each category a variable is created in which observations falling in
that category are coded “1" and all other observations are coded “0", thus the variable is represented in the model as a series of indicator terms, with the reference category left out of the model.
Any categories of a variable that get left out of the model become part of the reference group (because those observations will be coded “0” for each indicator term left in the model).
5Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
STATA Commands for Multilevel Categorical Variables in Logistic Regression Models
If categorized continuous variables are entered in models as if they were continuous, that is, as one term rather than a series of indicator variables, the program will treat the values as a continuous distribution, with each observation in a category having the same value. The resulting odds ratio will correspond to each one unit increase in the category coding. This will not produce a meaningful result unless the coding can be interpreted as linear increments from one category to another.
STATA has a convenient command that makes it unnecessary to create the indicator terms for multilevel categorical variables. The “xi” command creates a series of indicator variables for variables marked “i.variablename” by recognizing each value as a category. When used with the logistic or logit commands, STATA uses the lowest value as the reference category, which it drops out of the model. It is necessary to make sure that the variable coding reflects the desired categorization and reference level.
6Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
xi:logistic low i.racei.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Logistic regression Number of obs = 189 LR chi2(2) = 5.01 Prob > chi2 = 0.0817Log likelihood = -114.83082 Pseudo R2 = 0.0214
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Irace_2 | 2.327536 1.078613 1.82 0.068 .9385073 5.772385 _Irace_3 | 1.889234 .6571342 1.83 0.067 .9554577 3.735597------------------------------------------------------------------------------
------------------------------------------------------------------------------ Note that the ORs for race from the logistic regression model are the same as the crude ORs from
stratified analysis; this is because they are entered as indicator variables, with each level compared to the reference category
7Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
. tabodds low race, or
--------------------------------------------------------------------------- race | Odds Ratio chi2 P>chi2 [95% Conf. Interval]-------------+------------------------------------------------------------- 1 | 1.000000 . . . . 2 | 2.327536 3.40 0.0652 0.922844 5.870358 3 | 1.889234 3.37 0.0665 0.946595 3.770574---------------------------------------------------------------------------Test of homogeneity (equal odds): chi2(2) = 4.98 Pr>chi2 = 0.0830
Score test for trend of odds: chi2(1) = 3.57 Pr>chi2 = 0.0588
Note that the ORs for race from the logistic regression model are the same as the crude ORs from stratified analysis; this is because they are entered as indicator variables, with each level compared to the reference category
8Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Interpreting logistic regression model coefficients for continuous variables
When a logistic regression model contains a continuous independent variable, interpretation of the estimated coefficient depends on how it is entered into the model and the particular units of the variable
To interpret the coefficient, we assume that the logit is linear in the variable
The slope coefficient gives the change in the log odds for an increase of “1” unit in x.
9Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Interpreting logistic regression model coefficients for continuous variables
Sometimes a unit increase may not be meaningful or considered important
If we are interested in estimating the increased odds instead for every 5 year increase. We can use the formula: OR (c)=Exp(c*ß1) (95% CI=exp(c*ß1+1.96*c*SEß1)
10Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
P-values for Trend
The term trend generally refers to a monotonic, though not necessarily linear, association between increasing levels of exposure and the probability of the outcome.
Although examining effect estimates and confidence intervals over levels of exposure is the most informative manner of evaluating a dose-response trend, it has been conventional to report p-values for the null hypothesis of no monotonic association between exposure and disease, that is, a test for trend.
The p-value corresponding to the coefficient of a variable entered on an appropriate continuous scale in a regression model can be interpreted as a p-value for trend. For statistical hypothesis testing of trend in stratified analysis, see Rothman and Greenland.
11Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
. logit low ptl
Iteration 0: log likelihood = -117.336Iteration 1: log likelihood = -113.96986Iteration 2: log likelihood = -113.94631Iteration 3: log likelihood = -113.94631
Logit estimates Number of obs = 189 LR chi2(1) = 6.78 Prob > chi2 = 0.0092Log likelihood = -113.94631 Pseudo R2 = 0.0289
------------------------------------------------------------------------------ low | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- ptl | .8018058 .3171535 2.53 0.011 .1801964 1.423415 _cons | -.964189 .1749607 -5.51 0.000 -1.307106 -.6212722------------------------------------------------------------------------------
12Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Interpretation
For every increase in the number of premature labors, the risk or odds of LBW increases 2.2 fold (=exp(0.801).
If we are interested in estimating the increased odds instead for every 3 premature labors. We can use the formula:
OR (3)=Exp(3* .801)=11.06 For every increase of 3 additional premature labors, the
odds for LBW increases 11.06 times Be careful.. Validity of statements may be questionable
since additional risk of LBW for 7 premature labors may be quite different for 3 premature labors, but this is an unavoidable dilemma when continuous variables are modeled linearly in the logit.
13Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Significance testing
Does the model that includes the variable in question tell more about the outcome variable than a model that does not include that variable?
In general you are comparing observed values of the response variable to the predicted values obtained from models with and without the variables in question
Log likelihood ratio test is obtained by multiplying the difference between the two values by -2
Follows a chi square distribution with the degrees of freedom = the difference between the number of parameters in the 2 models
14Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
. xi:logistic low i.racei.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Logistic regression Number of obs = 189 LR chi2(2) = 5.01 Prob > chi2 = 0.0817Log likelihood = -114.83082 Pseudo R2 = 0.0214
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Irace_2 | 2.327536 1.078613 1.82 0.068 .9385073 5.772385 _Irace_3 | 1.889234 .6571342 1.83 0.067 .9554577 3.735597------------------------------------------------------------------------------
.estimates store M1
LR Test
15Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
LR Test. xi:logistic low i.race lwti.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Logistic regression Number of obs = 189 LR chi2(3) = 11.41 Prob > chi2 = 0.0097Log likelihood = -111.62955 Pseudo R2 = 0.0486
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_2 | 2.947821 1.438687 2.22 0.027 1.132586 7.672396 _Irace_3 | 1.61705 .5767584 1.35 0.178 .8037528 3.253301 lwt | .9848922 .006342 -2.36 0.018 .9725401 .9974011------------------------------------------------------------------------------
. lrtest M1 -2[L2-L1] =>likelihood-ratio test LR chi2(1) = 6.40(Assumption: M1 nested in .) Prob > chi2 = 0.0114
16Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
LR Test adding LWT to the model
-2(-114.83082-(-111.62955))=-3.2*-2=6.40P=0.011
17Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Wald test
Obtained by comparing the maximum likelihood estimate of the slope parameter, ß1, to an estimate of the standard error
W= ß1/SE (ß1)
A Z test, the distribution is approximately standard normal
Gives approximately equal answer to the LR test in large samples but may be different in smaller samples
LR test seems to perform better in most situations
18Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Variable selection
Once we have obtained a model that contains essential variables, examine the variables in the model more closely
Then check for interactions- For example an interaction between smoking status and race would indicate that the slope coefficient for smoking would be different by race/ethnicity
19Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Interaction assessment
Can take many forms When interaction is present, the association between the
risk factor and the outcome variable differs, or depends in some way on the level of the covariate
The covariate modifies the effect of the risk factor Consider a model with a dichotomous risk factor and the
covariate, age. If the association between the covariate (age) and the outcome variable is the same within each level of the risk factor, then there is no interaction between the covariate and the risk factor
Graphically, no interaction yields a model with parallel lines
20Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Interaction assessment
Need to be HWF Decide whether the effect of each variable varies
importantly across race/ethnic categories. To make this decision, first note the magnitude of the
difference between the ORs across strata. The tests of homogeneity have low power for detecting moderate
effect-measure modification. The p-value on this test indicates whether there is adequate
statistical power in the data to detect a difference, but a high p-value does not mean there is no effect-measure modification.
In general, the significance level for heterogeneity worth exploring further should be set around 0.20-0.25
21Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Observe stratum-specific estimates using stratified analysis.
. cc low smoke, by(race)
RACE | OR [95% Conf. Interval] M-H Weight-----------------+------------------------------------------------- 1 | 5.757576 1.657574 25.1388 1.375 (exact) 2 | 3.3 .4865385 23.45437 .7692308 (exact) 3 | 1.25 .273495 5.278229 2.089552 (exact)-----------------+------------------------------------------------- Crude | 2.021944 1.029092 3.965864 (exact) M-H combined | 3.086381 1.49074 6.389949 -------------------------------------------------------------------Test of homogeneity (M-H) chi2(2) = 3.03 Pr>chi2 = 0.2197
Test that combined OR = 1: Mantel-Haenszel chi2(1) = 9.41 Pr>chi2 = 0.0022
22Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Interaction assessment
Use product terms in a multivariable logistic regression model in order to identify potential effect-measure modification (interaction) while adjusting for confounders.
The p-value on the product term can be interpreted as a test of homogeneity.
To model a product term for two continuous variables, a term must be created for the product of the two variables. The product term is entered into the model, along with each of the two variables.
The “xi” command in STATA creates all of the required product terms for modeling interaction if at least one of the two variables is categorical. With this command the two variables do not have to be entered separately in the model because STATA does it for you.
When entering a product term between a categorical and a continuous variable in a logistic regression model, we evaluate whether the entire dose-response of the continuous variable differs across strata of the categorized variable.
23Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
. xi:logistic low i.smoke i.racei.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted)i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)
Logistic regression Number of obs = 189 LR chi2(3) = 14.70 Prob > chi2 = 0.0021Log likelihood = -109.98736 Pseudo R2 = 0.0626
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 3.052631 1.12711 3.02 0.003 1.480433 6.294481 _Irace_2 | 2.956742 1.448758 2.21 0.027 1.131717 7.724832 _Irace_3 | 3.030001 1.212926 2.77 0.006 1.382618 6.640233------------------------------------------------------------------------------
24Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
. xi:logistic low i.smoke i.race i.race*i.smokei.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted)i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)i.race*i.smoke _IracXsmo_#_# (coded as above)
note: _Irace_2 dropped due to collinearitynote: _Irace_3 dropped due to collinearitynote: _Ismoke_1 dropped due to collinearity
Logistic regression Number of obs = 189 LR chi2(5) = 17.85 Prob > chi2 = 0.0031Log likelihood = -108.40889 Pseudo R2 = 0.0761
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 5.757576 3.444619 2.93 0.003 1.782322 18.59915 _Irace_2 | 4.545455 3.419404 2.01 0.044 1.040507 19.85682 _Irace_3 | 5.714286 3.397819 2.93 0.003 1.781648 18.32745_IracXsm~2_1 | .5731579 .5916338 -0.54 0.590 .0757938 4.334256_IracXsm~3_1 | .2171053 .1916638 -1.73 0.084 .0384784 1.224966------------------------------------------------------------------------------
25Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Interaction assessment Use chunk test for entire collection of interaction terms Use LR test comparing main effects model with fuller model
. lrtest M2 M3likelihood-ratio test LR chi2(2) = 3.16(Assumption: M2 nested in M3) Prob > chi2 = 0.2063
If the interaction term is not significant then drop the interaction term
26Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Interaction Assessment
If the interaction term is retained in the model, the estimated ORs for other variables confounded by race [or another modifier] should not be obtained from a model that enters race [or another modifier] in suboptimal form for the purpose of obtaining stratum-specific estimates.
In an analysis that aims to estimate effects of several variables, we may use several different models to estimate the effects of interest. In this case, our goal is not the elaboration of a “final model”.
27Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Confounding Assessment
Confounding is used to describe a covariate that is associated both with the outcome variable of interest and the primary independent variable or risk factor of interest, but is not an intermediate variable in the causal pathway
When both variables are present, the relationship is said to be confounded
Only appropriate when there is no interaction Decisions regarding whether or not to adjust for potential
confounding variables will depend on a combined assessment of prior knowledge, observed associations in the data, and sample size considerations
28Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Confounding Assessment In practice we look for empirical evidence of confounding in data obtained from
study populations, however, we must keep in mind that what we observe in such data may reflect selection and information bias affecting observed confounder-disease-exposure associations in a similar way to how these biases affect observed exposure-disease associations.
Therefore, it is necessary to rely on prior knowledge of relevant associations in source populations. For example, if a variable is a known confounder but does not appear to be one in the data, this should create uncertainty regarding the validity of the data. Adjusting for this factor may not change the relative risk point estimate, but it may influence the standard error for this estimate, thus appropriately reflecting our uncertainty.
If prior knowledge suggests a variable should not be a confounder but it appears to be one in the data, the confounding may have been introduced by the study methods (eg., as a result of matching in a case-control design). In this case it would be appropriate to adjust for this factor, if it is not an intervening (intermediate) variable.
29Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Confounding Assessment When prior knowledge regarding exposure-covariate associations is
insufficient and the number of covariates to consider is small, it may be desirable to adjust for all variables that appear to be important risk factors for the outcome, as long as they are not intervening variables.
When a large number of potential confounders must be considered, the change-in-estimate variable selection strategy has been shown in simulation studies to produce more valid results for confounder detection than strategies that rely on p-values, unless the significance level for the p-value is raised to 0.2 or higher. There is more than one reasonable approach to variable selection in such situations; the important thing is that the criterion for selection should be explicit and consistently applied.
When examining continuous variables as potential confounders, careful assessment of the dose-response for the purpose of choosing the optimal scale or categorization should be carried out prior to the assessment of confounding. Inappropriate modeling of continuous variables may lead to incorrect decisions about whether or not to adjust for these variables and/or to inadequate control of confounding.
30Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Confounding Assessment
Begin with a model with the exposure-disease relationship For example if we are interested in the association between number of premature labors and LBW
xi:logistic low i.ptli.ptl _Iptl_0-3 (naturally coded; _Iptl_0 omitted)note: _Iptl_3 != 0 predicts failure perfectly _Iptl_3 dropped and 1 obs not used
Logistic regression Number of obs = 188 LR chi2(2) = 15.12 Prob > chi2 = 0.0005Log likelihood = -109.39993 Pseudo R2 = 0.0646
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Iptl_1 | 5.756098 2.702085 3.73 0.000 2.293763 14.44467 _Iptl_2 | 1.918699 1.785729 0.70 0.484 .3095962 11.89099------------------------------------------------------------------------------
31Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Confounding Assessment
. .. xi:logistic low i.smokei.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted)
Logistic regression Number of obs = 189 LR chi2(1) = 4.87 Prob > chi2 = 0.0274Log likelihood = -114.9023 Pseudo R2 = 0.0207
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 2.021944 .6462912 2.20 0.028 1.080668 3.783083------------------------------------------------------------------------------
32Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Confounding Assessment
.. xi:logistic low i.smoke i.hti.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted)i.ht _Iht_0-1 (naturally coded; _Iht_0 omitted)
Logistic regression Number of obs = 189 LR chi2(2) = 8.88 Prob > chi2 = 0.0118Log likelihood = -112.89597 Pseudo R2 = 0.0378
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 2.037802 .6595424 2.20 0.028 1.080606 3.842878 _Iht_1 | 3.421389 2.113223 1.99 0.046 1.019665 11.48015------------------------------------------------------------------------------
33Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Confounding Assessment Examine the OR for the main risk factor to determine whether it is
meaningfully different than the OR when controlling for confounding For previous examples the ORs are 2.02 vs. 2.03 Can examine using the LR test May want to include the confounder if others would not trust the results
if did not perform the adjustment May want to use a decision rule, depending on the subject matter (i.e.,
10% change in the OR) and report the criterion Several have suggested a backward deletion strategy whereby you
enter all main effects and confounders in the model and you eliminate one-by-one confounder which makes the smallest difference in the exposure-effect estimate
Biases resulting from multiple confounders may cancel each other out or produce results that may not be easy to disentangle. Consider the following.
34Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Model Checking
When run diagnostics and why? This depends on what we want the model to do
If our goal is to obtain approximately valid summary estimates of effect for a few key relationships, less rigorous checking is required
If our goal is to predict outcomes for a given set of factors, more detailed checking is required
35Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Model Checking
Assuming the model contains those variables that should be in the model, we want to know how effectively the model we have describes the data (goodness of fit)
A good-fitting model is not the same as a correct model Regression diagnostics can detect discrepancies between a
model and data only within the range of the data and only if there are enough observations for adequate diagnostic power
A model may appear to fit well in the central range of the data, but produce poor predictions for covariate values that are not well represented in the data
36Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Model Checking
Computation and evaluation of overall measures Examination of individual components of the
summary statistics, often graphically Examination of other measures of the difference
between components of the observed values versus the predicted values
37Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Model Checking
Check model results against results from stratified analysis
Log-likelihood Ratio Test (also called Deviance Test)
Tests of Regression Test the hypothesis that all the regression coefficients
(except the intercept) are zero The R2 not recommended o use- may give a distorted
impression
38Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Model Checking Tests of fit (see R&G, pp. 409-10 for test details)
Tests for nonrandom incompatibilities between a model and data Compares the fit of an index model to a more elaborate reference model that
contains it Small p-value suggests that the index model fits poorly relative to the reference
model, that is, that the additional terms in the reference model improve the fit Large p-value does not mean that the index model fits well, only that the test did
not detect an improvement in fit from the additional terms in the reference model Pearson residual (examining obs-pred/SE) Deviance residual Hosmer-Lemeshow goodness of fit statistic
A grouping-based method based on values of the estimated probabilities
Classification tables Table is the result of cross-classifying the outcome variable with a dichotomous
variable whose values are derived from the estimated logistic probabilities
39Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Goodness of fit ..xi:logistic low smoke age i.race ptl ht
Logistic regression Number of obs = 189 LR chi2(6) = 24.86 Prob > chi2 = 0.0004Log likelihood = -104.90441 Pseudo R2 = 0.1059
------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- smoke | 2.645521 1.022067 2.52 0.012 1.240679 5.641093 age | .9541789 .0337091 -1.33 0.184 .8903458 1.022589 _Irace_2 | 2.648104 1.327578 1.94 0.052 .9912907 7.074066 _Irace_3 | 2.778709 1.164658 2.44 0.015 1.222007 6.318482 ptl | 2.117295 .6982914 2.27 0.023 1.109308 4.041203 ht | 3.413581 2.132187 1.97 0.049 1.003537 11.61146------------------------------------------------------------------------------
40Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Goodness of fit - Pearson..lfit Logistic model for low, goodness-of-fit test
number of observations = 189 number of covariate patterns = 108 Pearson chi2(101) = 104.23 Prob > chi2 = 0.3929
41Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Goodness of fit - Hosmer- Lemeshow. . lfit, group(10)
Logistic model for low, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)
number of observations = 189 number of groups = 10 Hosmer-Lemeshow chi2(8) = 6.92 Prob > chi2 = 0.5456
42Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Multicolinearity
Occurs when one or more independent variables can be approximately determined by some other variables in the model
When there is multicolinearity, the estimated regression coefficients of the fitted model can be highly unreliable
43Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Multiple testing
Occurs from the many tests of significance that are typically carried out when selecting or eliminating variables from the model
More likely to obtain statistical significance even is no real association exists
44Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
Influential observations
Refers to data on individuals that may have a large influence on the estimated regression coefficients
Methods assessing the possibility of influential observations should be considered
45Friday, January 22, 2010
Danstan Bagenda, PhD, Jan 2009
** Let's look at some logistic regression diagnostics.
predict pprob /* predicted probabilities */
predict r, resid /* pearson residulas */ predict h, hat /* leverage*/ predict db, dbeta /* pregobon dbeta */ predict dx2, dx2 /* hosmer & lemeshow
influence */ scatter h r, xline(0) msym(Oh) jitter(2)
46
46Friday, January 22, 2010