Logistic Regression in Stata - danstan.medanstan.me/students/LogistRegression_Danstan.pdf ·...

Danstan Bagenda, PhD, Jan 2009

Logistic Regression in Stata

Danstan Bagenda, PhD MUSPH

1

1Friday, January 22, 2010


Logistic Regression in STATA

The logistic regression programs in STATA use maximum likelihood estimation to generate the logit (the logistic regression coefficient, which corresponds to the natural log of the OR for each one-unit increase in the level of the regressor variable).

The resulting ORs are maximum-likelihood estimates (MLEs) of the uniform effect (OR) across strata of the model covariates. Thus they are pooled (uniform, common) estimates of the OR and in this sense are adjusted for all regressors included in the model.



STATA Logistic Regression Commands

The “logistic” command in STATA yields odds ratios.

logistic low smoke age

Logistic regression Number of obs = 189 LR chi2(2) = 7.40 Prob > chi2 = 0.0248Log likelihood = -113.63815 Pseudo R2 = 0.0315

------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- smoke | 1.997405 .642777 2.15 0.032 1.063027 3.753081 age | .9514394 .0304194 -1.56 0.119 .8936481 1.012968------------------------------------------------------------------------------



STATA Logistic Regression Commands

The “logit” command in STATA yields the actual beta coefficients. logit low smoke age Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -113.66733 Iteration 2: log likelihood = -113.63815

Logit estimates Number of obs = 189 LR chi2(2) = 7.40 Prob > chi2 = 0.0248Log likelihood = -113.63815 Pseudo R2 = 0.0315

------------------------------------------------------------------------------ low | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- smoke | .6918487 .3218061 2.15 0.032 .0611203 1.322577 age | -.0497793 .031972 -1.56 0.119 -.1124432 .0128846 _cons | .0609055 .7573199 0.08 0.936 -1.423414 1.545225------------------------------------------------------------------------------



STATA Commands for Multilevel Categorical Variables in Logistic Regression Models

Categorized continuous variables should be entered in regression models as a series of indicator variables for each category a variable is created in which observations falling in

that category are coded “1" and all other observations are coded “0", thus the variable is represented in the model as a series of indicator terms, with the reference category left out of the model.

Any categories of a variable that get left out of the model become part of the reference group (because those observations will be coded “0” for each indicator term left in the model).



STATA Commands for Multilevel Categorical Variables in Logistic Regression Models

If categorized continuous variables are entered in models as if they were continuous, that is, as one term rather than a series of indicator variables, the program will treat the values as a continuous distribution, with each observation in a category having the same value. The resulting odds ratio will correspond to each one unit increase in the category coding. This will not produce a meaningful result unless the coding can be interpreted as linear increments from one category to another.

STATA has a convenient command that makes it unnecessary to create the indicator terms for multilevel categorical variables. The “xi” command creates a series of indicator variables for variables marked “i.variablename” by recognizing each value as a category. When used with the logistic or logit commands, STATA uses the lowest value as the reference category, which it drops out of the model. It is necessary to make sure that the variable coding reflects the desired categorization and reference level.



xi:logistic low i.racei.race _Irace_1-3 (naturally coded; _Irace_1 omitted)


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Irace_2 | 2.327536 1.078613 1.82 0.068 .9385073 5.772385 _Irace_3 | 1.889234 .6571342 1.83 0.067 .9554577 3.735597------------------------------------------------------------------------------

------------------------------------------------------------------------------ Note that the ORs for race from the logistic regression model are the same as the crude ORs from

stratified analysis; this is because they are entered as indicator variables, with each level compared to the reference category



. tabodds low race, or

--------------------------------------------------------------------------- race | Odds Ratio chi2 P>chi2 [95% Conf. Interval]-------------+------------------------------------------------------------- 1 | 1.000000 . . . . 2 | 2.327536 3.40 0.0652 0.922844 5.870358 3 | 1.889234 3.37 0.0665 0.946595 3.770574---------------------------------------------------------------------------Test of homogeneity (equal odds): chi2(2) = 4.98 Pr>chi2 = 0.0830

Score test for trend of odds: chi2(1) = 3.57 Pr>chi2 = 0.0588

Note that the ORs for race from the logistic regression model are the same as the crude ORs from stratified analysis; this is because they are entered as indicator variables, with each level compared to the reference category



Interpreting logistic regression model coefficients for continuous variables

When a logistic regression model contains a continuous independent variable, interpretation of the estimated coefficient depends on how it is entered into the model and the particular units of the variable

To interpret the coefficient, we assume that the logit is linear in the variable

The slope coefficient gives the change in the log odds for an increase of “1” unit in x.



Interpreting logistic regression model coefficients for continuous variables

Sometimes a unit increase may not be meaningful or considered important

If we are interested in estimating the increased odds instead for every 5 year increase. We can use the formula: OR (c)=Exp(c*ß1) (95% CI=exp(c*ß1+1.96*c*SEß1)



P-values for Trend

The term trend generally refers to a monotonic, though not necessarily linear, association between increasing levels of exposure and the probability of the outcome.

Although examining effect estimates and confidence intervals over levels of exposure is the most informative manner of evaluating a dose-response trend, it has been conventional to report p-values for the null hypothesis of no monotonic association between exposure and disease, that is, a test for trend.

The p-value corresponding to the coefficient of a variable entered on an appropriate continuous scale in a regression model can be interpreted as a p-value for trend. For statistical hypothesis testing of trend in stratified analysis, see Rothman and Greenland.



. logit low ptl

Iteration 0: log likelihood = -117.336Iteration 1: log likelihood = -113.96986Iteration 2: log likelihood = -113.94631Iteration 3: log likelihood = -113.94631

Logit estimates Number of obs = 189 LR chi2(1) = 6.78 Prob > chi2 = 0.0092Log likelihood = -113.94631 Pseudo R2 = 0.0289

------------------------------------------------------------------------------ low | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- ptl | .8018058 .3171535 2.53 0.011 .1801964 1.423415 _cons | -.964189 .1749607 -5.51 0.000 -1.307106 -.6212722------------------------------------------------------------------------------



Interpretation

For every increase in the number of premature labors, the risk or odds of LBW increases 2.2 fold (=exp(0.801).

If we are interested in estimating the increased odds instead for every 3 premature labors. We can use the formula:

OR (3)=Exp(3* .801)=11.06 For every increase of 3 additional premature labors, the

odds for LBW increases 11.06 times Be careful.. Validity of statements may be questionable

since additional risk of LBW for 7 premature labors may be quite different for 3 premature labors, but this is an unavoidable dilemma when continuous variables are modeled linearly in the logit.



Significance testing

Does the model that includes the variable in question tell more about the outcome variable than a model that does not include that variable?

In general you are comparing observed values of the response variable to the predicted values obtained from models with and without the variables in question

Log likelihood ratio test is obtained by multiplying the difference between the two values by -2

Follows a chi square distribution with the degrees of freedom = the difference between the number of parameters in the 2 models



. xi:logistic low i.racei.race _Irace_1-3 (naturally coded; _Irace_1 omitted)


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Irace_2 | 2.327536 1.078613 1.82 0.068 .9385073 5.772385 _Irace_3 | 1.889234 .6571342 1.83 0.067 .9554577 3.735597------------------------------------------------------------------------------

.estimates store M1

LR Test



LR Test. xi:logistic low i.race lwti.race _Irace_1-3 (naturally coded; _Irace_1 omitted)


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_2 | 2.947821 1.438687 2.22 0.027 1.132586 7.672396 _Irace_3 | 1.61705 .5767584 1.35 0.178 .8037528 3.253301 lwt | .9848922 .006342 -2.36 0.018 .9725401 .9974011------------------------------------------------------------------------------

. lrtest M1 -2[L2-L1] =>likelihood-ratio test LR chi2(1) = 6.40(Assumption: M1 nested in .) Prob > chi2 = 0.0114



LR Test adding LWT to the model

-2(-114.83082-(-111.62955))=-3.2*-2=6.40P=0.011



Wald test

Obtained by comparing the maximum likelihood estimate of the slope parameter, ß1, to an estimate of the standard error

W= ß1/SE (ß1)

A Z test, the distribution is approximately standard normal

Gives approximately equal answer to the LR test in large samples but may be different in smaller samples

LR test seems to perform better in most situations



Variable selection

Once we have obtained a model that contains essential variables, examine the variables in the model more closely

Then check for interactions- For example an interaction between smoking status and race would indicate that the slope coefficient for smoking would be different by race/ethnicity



Interaction assessment

Can take many forms When interaction is present, the association between the

risk factor and the outcome variable differs, or depends in some way on the level of the covariate

The covariate modifies the effect of the risk factor Consider a model with a dichotomous risk factor and the

covariate, age. If the association between the covariate (age) and the outcome variable is the same within each level of the risk factor, then there is no interaction between the covariate and the risk factor

Graphically, no interaction yields a model with parallel lines




Need to be HWF Decide whether the effect of each variable varies

importantly across race/ethnic categories. To make this decision, first note the magnitude of the

difference between the ORs across strata. The tests of homogeneity have low power for detecting moderate

effect-measure modification. The p-value on this test indicates whether there is adequate

statistical power in the data to detect a difference, but a high p-value does not mean there is no effect-measure modification.

In general, the significance level for heterogeneity worth exploring further should be set around 0.20-0.25



Observe stratum-specific estimates using stratified analysis.

. cc low smoke, by(race)

RACE | OR [95% Conf. Interval] M-H Weight-----------------+------------------------------------------------- 1 | 5.757576 1.657574 25.1388 1.375 (exact) 2 | 3.3 .4865385 23.45437 .7692308 (exact) 3 | 1.25 .273495 5.278229 2.089552 (exact)-----------------+------------------------------------------------- Crude | 2.021944 1.029092 3.965864 (exact) M-H combined | 3.086381 1.49074 6.389949 -------------------------------------------------------------------Test of homogeneity (M-H) chi2(2) = 3.03 Pr>chi2 = 0.2197

Test that combined OR = 1: Mantel-Haenszel chi2(1) = 9.41 Pr>chi2 = 0.0022




Use product terms in a multivariable logistic regression model in order to identify potential effect-measure modification (interaction) while adjusting for confounders.

The p-value on the product term can be interpreted as a test of homogeneity.

To model a product term for two continuous variables, a term must be created for the product of the two variables. The product term is entered into the model, along with each of the two variables.

The “xi” command in STATA creates all of the required product terms for modeling interaction if at least one of the two variables is categorical. With this command the two variables do not have to be entered separately in the model because STATA does it for you.

When entering a product term between a categorical and a continuous variable in a logistic regression model, we evaluate whether the entire dose-response of the continuous variable differs across strata of the categorized variable.



. xi:logistic low i.smoke i.racei.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted)i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 3.052631 1.12711 3.02 0.003 1.480433 6.294481 _Irace_2 | 2.956742 1.448758 2.21 0.027 1.131717 7.724832 _Irace_3 | 3.030001 1.212926 2.77 0.006 1.382618 6.640233------------------------------------------------------------------------------



. xi:logistic low i.smoke i.race i.race*i.smokei.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted)i.race _Irace_1-3 (naturally coded; _Irace_1 omitted)i.race*i.smoke _IracXsmo_#_# (coded as above)

note: _Irace_2 dropped due to collinearitynote: _Irace_3 dropped due to collinearitynote: _Ismoke_1 dropped due to collinearity


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 5.757576 3.444619 2.93 0.003 1.782322 18.59915 _Irace_2 | 4.545455 3.419404 2.01 0.044 1.040507 19.85682 _Irace_3 | 5.714286 3.397819 2.93 0.003 1.781648 18.32745_IracXsm~2_1 | .5731579 .5916338 -0.54 0.590 .0757938 4.334256_IracXsm~3_1 | .2171053 .1916638 -1.73 0.084 .0384784 1.224966------------------------------------------------------------------------------



Interaction assessment Use chunk test for entire collection of interaction terms Use LR test comparing main effects model with fuller model

. lrtest M2 M3likelihood-ratio test LR chi2(2) = 3.16(Assumption: M2 nested in M3) Prob > chi2 = 0.2063

If the interaction term is not significant then drop the interaction term



Interaction Assessment

If the interaction term is retained in the model, the estimated ORs for other variables confounded by race [or another modifier] should not be obtained from a model that enters race [or another modifier] in suboptimal form for the purpose of obtaining stratum-specific estimates.

In an analysis that aims to estimate effects of several variables, we may use several different models to estimate the effects of interest. In this case, our goal is not the elaboration of a “final model”.



Confounding Assessment

Confounding is used to describe a covariate that is associated both with the outcome variable of interest and the primary independent variable or risk factor of interest, but is not an intermediate variable in the causal pathway

When both variables are present, the relationship is said to be confounded

Only appropriate when there is no interaction Decisions regarding whether or not to adjust for potential

confounding variables will depend on a combined assessment of prior knowledge, observed associations in the data, and sample size considerations



Confounding Assessment In practice we look for empirical evidence of confounding in data obtained from

study populations, however, we must keep in mind that what we observe in such data may reflect selection and information bias affecting observed confounder-disease-exposure associations in a similar way to how these biases affect observed exposure-disease associations.

Therefore, it is necessary to rely on prior knowledge of relevant associations in source populations. For example, if a variable is a known confounder but does not appear to be one in the data, this should create uncertainty regarding the validity of the data. Adjusting for this factor may not change the relative risk point estimate, but it may influence the standard error for this estimate, thus appropriately reflecting our uncertainty.

If prior knowledge suggests a variable should not be a confounder but it appears to be one in the data, the confounding may have been introduced by the study methods (eg., as a result of matching in a case-control design). In this case it would be appropriate to adjust for this factor, if it is not an intervening (intermediate) variable.



Confounding Assessment When prior knowledge regarding exposure-covariate associations is

insufficient and the number of covariates to consider is small, it may be desirable to adjust for all variables that appear to be important risk factors for the outcome, as long as they are not intervening variables.

When a large number of potential confounders must be considered, the change-in-estimate variable selection strategy has been shown in simulation studies to produce more valid results for confounder detection than strategies that rely on p-values, unless the significance level for the p-value is raised to 0.2 or higher. There is more than one reasonable approach to variable selection in such situations; the important thing is that the criterion for selection should be explicit and consistently applied.

When examining continuous variables as potential confounders, careful assessment of the dose-response for the purpose of choosing the optimal scale or categorization should be carried out prior to the assessment of confounding. Inappropriate modeling of continuous variables may lead to incorrect decisions about whether or not to adjust for these variables and/or to inadequate control of confounding.




Begin with a model with the exposure-disease relationship For example if we are interested in the association between number of premature labors and LBW

xi:logistic low i.ptli.ptl _Iptl_0-3 (naturally coded; _Iptl_0 omitted)note: _Iptl_3 != 0 predicts failure perfectly _Iptl_3 dropped and 1 obs not used


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Iptl_1 | 5.756098 2.702085 3.73 0.000 2.293763 14.44467 _Iptl_2 | 1.918699 1.785729 0.70 0.484 .3095962 11.89099------------------------------------------------------------------------------




. .. xi:logistic low i.smokei.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted)


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 2.021944 .6462912 2.20 0.028 1.080668 3.783083------------------------------------------------------------------------------




.. xi:logistic low i.smoke i.hti.smoke _Ismoke_0-1 (naturally coded; _Ismoke_0 omitted)i.ht _Iht_0-1 (naturally coded; _Iht_0 omitted)


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- _Ismoke_1 | 2.037802 .6595424 2.20 0.028 1.080606 3.842878 _Iht_1 | 3.421389 2.113223 1.99 0.046 1.019665 11.48015------------------------------------------------------------------------------



Confounding Assessment Examine the OR for the main risk factor to determine whether it is

meaningfully different than the OR when controlling for confounding For previous examples the ORs are 2.02 vs. 2.03 Can examine using the LR test May want to include the confounder if others would not trust the results

if did not perform the adjustment May want to use a decision rule, depending on the subject matter (i.e.,

10% change in the OR) and report the criterion Several have suggested a backward deletion strategy whereby you

enter all main effects and confounders in the model and you eliminate one-by-one confounder which makes the smallest difference in the exposure-effect estimate

Biases resulting from multiple confounders may cancel each other out or produce results that may not be easy to disentangle. Consider the following.



Model Checking

When run diagnostics and why? This depends on what we want the model to do

If our goal is to obtain approximately valid summary estimates of effect for a few key relationships, less rigorous checking is required

If our goal is to predict outcomes for a given set of factors, more detailed checking is required



Model Checking

Assuming the model contains those variables that should be in the model, we want to know how effectively the model we have describes the data (goodness of fit)

A good-fitting model is not the same as a correct model Regression diagnostics can detect discrepancies between a

model and data only within the range of the data and only if there are enough observations for adequate diagnostic power

A model may appear to fit well in the central range of the data, but produce poor predictions for covariate values that are not well represented in the data



Model Checking

Computation and evaluation of overall measures Examination of individual components of the

summary statistics, often graphically Examination of other measures of the difference

between components of the observed values versus the predicted values



Model Checking

Check model results against results from stratified analysis

Log-likelihood Ratio Test (also called Deviance Test)

Tests of Regression Test the hypothesis that all the regression coefficients

(except the intercept) are zero The R2 not recommended o use- may give a distorted

impression



Model Checking Tests of fit (see R&G, pp. 409-10 for test details)

Tests for nonrandom incompatibilities between a model and data Compares the fit of an index model to a more elaborate reference model that

contains it Small p-value suggests that the index model fits poorly relative to the reference

model, that is, that the additional terms in the reference model improve the fit Large p-value does not mean that the index model fits well, only that the test did

not detect an improvement in fit from the additional terms in the reference model Pearson residual (examining obs-pred/SE) Deviance residual Hosmer-Lemeshow goodness of fit statistic

A grouping-based method based on values of the estimated probabilities

Classification tables Table is the result of cross-classifying the outcome variable with a dichotomous

variable whose values are derived from the estimated logistic probabilities



Goodness of fit ..xi:logistic low smoke age i.race ptl ht


------------------------------------------------------------------------------ low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- smoke | 2.645521 1.022067 2.52 0.012 1.240679 5.641093 age | .9541789 .0337091 -1.33 0.184 .8903458 1.022589 _Irace_2 | 2.648104 1.327578 1.94 0.052 .9912907 7.074066 _Irace_3 | 2.778709 1.164658 2.44 0.015 1.222007 6.318482 ptl | 2.117295 .6982914 2.27 0.023 1.109308 4.041203 ht | 3.413581 2.132187 1.97 0.049 1.003537 11.61146------------------------------------------------------------------------------



Goodness of fit - Pearson..lfit Logistic model for low, goodness-of-fit test

number of observations = 189 number of covariate patterns = 108 Pearson chi2(101) = 104.23 Prob > chi2 = 0.3929



Goodness of fit - Hosmer- Lemeshow. . lfit, group(10)

Logistic model for low, goodness-of-fit test

(Table collapsed on quantiles of estimated probabilities)

number of observations = 189 number of groups = 10 Hosmer-Lemeshow chi2(8) = 6.92 Prob > chi2 = 0.5456



Multicolinearity

Occurs when one or more independent variables can be approximately determined by some other variables in the model

When there is multicolinearity, the estimated regression coefficients of the fitted model can be highly unreliable



Multiple testing

Occurs from the many tests of significance that are typically carried out when selecting or eliminating variables from the model

More likely to obtain statistical significance even is no real association exists



Influential observations

Refers to data on individuals that may have a large influence on the estimated regression coefficients

Methods assessing the possibility of influential observations should be considered



** Let's look at some logistic regression diagnostics.

predict pprob /* predicted probabilities */

predict r, resid /* pearson residulas */ predict h, hat /* leverage*/ predict db, dbeta /* pregobon dbeta */ predict dx2, dx2 /* hosmer & lemeshow

influence */ scatter h r, xline(0) msym(Oh) jitter(2)

46


Logistic Regression in Stata - danstan.medanstan.me/students/LogistRegression_Danstan.pdf ·...

Documents

Transcript of Logistic Regression in Stata - danstan.medanstan.me/students/LogistRegression_Danstan.pdf ·...