REGRESSION MODELS AND POLYTOMOUS VARIABLES Joel Mefford [email protected] 03/02/2012.
-
Upload
roland-gray -
Category
Documents
-
view
220 -
download
2
Transcript of REGRESSION MODELS AND POLYTOMOUS VARIABLES Joel Mefford [email protected] 03/02/2012.
REGRESSION MODELS AND POLYTOMOUS VARIABLES
Joel Mefford [email protected]/02/2012
POLYTOMOUS EXPOSURES AND OUTCOMES
Rothman, Greenland, and Lash: Ch. 17.
Polytomous Exposures and Outcomes
Categorization of continuous variables into polytomous categorical variables has the risks we have already discussed in the context of creating dichotomous variables from continuous variables and misclassification bias
Biologically meaningful categories Residual confounding, especially if wide ranges of continuous measurements
are lumped together into single categories Sparse or empty categories can make analysis difficult Modeling, describing, or adjusting for misclassification become complex
problems
Polytomous Exposures and Outcomes
Tabular analyses using polytomous variables Conduct a series of pair-wise analyses using methods for dichotomous variables Global tests for independence or trends Graphical analyses Move on to regression
Polytomous Exposures and Outcomes
Polytomous Exposures and Outcomes
Fruit and vegetable
servings per day
[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total
cases 49 125 136 178 488
controls 28 111 140 209 488
total 77 236 276 387 976
Polytomous Exposures and Outcomes
Fruit and vegetable servings per day
[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total
cases 49 125 136 178 488
controls 28 111 140 209 488
total 77 236 276 387 976
[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total
cases 488
controls 488
total 77 236 276 387 976
Polytomous Exposures and Outcomes
Fruit and vegetable
servings per day
[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total
cases 49 125 136 178 488
controls 28 111 140 209 488
total 77 236 276 387 976
[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total
cases 488
controls 488
total 77 236 276 387 976
[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total
cases 38.5 118 138 193.5 488
controls 38.5 118 138 193.5 488
total 77 236 276 387 976
Polytomous Exposures and Outcomes
[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total
cases 2.863636364 0.415254237 0.028985507 1.241602067 488
controls 2.863636364 0.415254237 0.028985507 1.241602067 488
total 77 236 276 387 976
df (2-1)(4-1) = 3
P-value 0.003
Polytomous Exposures and Outcomes
GWAS looking at relapse or relapse-free survival after chemotherapy with (busulfan + etoposide) and autologous bone-marrow transplantation (BMT) for AML.
•314 AML patients who had chemotherapy and autologous BMT•78 patients relapsed within 12 months•199 patients did not relapse within 12 months•37 lost of follow up (missing data)
Polytomous Exposures and Outcomes
Analysis 1:
Cox proportional hazards model :Time = months relapse free survival after transplantationEvent = relapseParameter of interest: hazard ratio associated with the addition of a minor
allele at a particular SNPDataset = all subjectsAdjustment covariates: 10 PCs to adjust for ancestry/relatedness2 clinical prognostic scores
Polytomous Exposures and Outcomes
Analysis 2:
Trend test to look for an association between the number of minor alleles and the fraction of subjects who had a relapse of their leukemia within 12 months of transplantation
Polytomous Exposures and Outcomes
The “top hits” or the SNPs with the lowest p-values for the most suggestively significant results from the two analyses are highly overlapping sets. Rank orderings of the “top hits” were different though.
The results from the survival analyses with the adjustment covariates are most interesting going forward, but the simple trend tests did capture some of the same information.
Polytomous Exposures and Outcomes
status\genotype AA AB BB
case N11 N12 N13
control N21 N22 N23
Polytomous Exposures and Outcomes
status\genotype AA AB BB
case N11 N12 N13
control N21 N22 N23
status\genotype AA AB BB total
case N11 N12 N13 R1
control N21 N22 N23 R2
total C1 C2 C3 N
T = sum_{column i}[wi * (N1i*R2 -N2i*R1)]
under null (no assiciation): E[T]=0
There is a formula for V[T]
T/sqrt(V[T]) -> N(0,1)
REGRESSION TOPICS
Rothman, Greenland, and Lash: Ch. 20.
Regression Why use regression models? How about stratified analyses of tabular data?
Control for confounding
Assess effect modification
Summarize disease association of several predictor variables, e.g. ORMH.
Model-freeAssumption: homogeneity within each strata
Regression Limitations of Stratification
Adjustment only for categorical covariates
Categorization of continuous variables: loss of information; residual confounding
Sparse data
Inefficiency
Regression What are regression functions? E [Y|X] or g(E [Y|X])
Y is the outcome variable X is the predictor or a vector of predictors g() is a transformation or “link function”
Need to define Y, X, and population over which the expectation or average is taken:
target populationsource population
sample
Regression
Generally we assume that: the function E[Y|X] has a particular form the errors, the differences between actual observations and their expected
values have particular properties independence Mean = 0 a specified distribution
We may make other assumptions
These assumptions form the “model”.
Regression There are regression models designed for use with many types of outcome variables
and explanatory variables Continuous variables Indicator variables Unordered polytomous variables Ordinal variables …
Regression
Regression There seems to be a relationship between two variables
Regression: E[Y | X ] ?
Regression E[Y | X= xi] for xi strata of X
Regression
Regression
Assume a linear relationship between X and Y (model)
Regression
Regression
Regression A regression model may summarize some aspect of the
relationship between variables without completely describing the relationship
Regression Continuous explanatory variable with categorical outcome variable:
Regression We could use a linear model for a dichotomous outcome:
linear risk modelE [1{outcome=1} ] = Pr(outcome = 1)
Regression We could use a linear model for a dichotomous exposure and
Continuous outcome:
Intervention Effects and Regression Intervention effects:
E[ Y | set(X=x1), Z=z] - E[Y | set(X=x0), Z=z]E[ Y | set(X=x1), Z=z] / E[Y | set(X=x0), Z=z]
where the expectation is over the target population
Intervention Effects and Regression Intervention effects:
E[ Y | set(X=x1), Z=z] - E[Y | set(X=x0), Z=z]E[ Y | set(X=x1), Z=z] / E[Y | set(X=x0), Z=z]where expectation is over target population
In practice, what we can calculate with standard regression analysis is:Ave(Y | X=x1, Z=z) - Ave(Y | X=x0, Z=z')Ave(Y | X=x1, Z=z) / Ave(Y | X=x0, Z=z')
or equivalently:
E[ Y | X=x1, Z=z] - E[Y | X=x0, Z=z’]E[ Y | X=x1, Z=z] / E[Y | X=x0, Z=z’]
where the expectation is over the sample
Intervention Effects and Regression If we want to use the regression association measures as estimates of the potential intervention effects,we need to assume:
E[ Y | X=x, Z=z] = E[ Y | set(X=x), Z=z]
No Confounding Assumption
“no residual confounding of X and Y given Z"
Intervention Effects and Regression
Regression standardization
E[ Y | X=x, Z=z]different values of Z correspond to different strata in which you may consider the Y~X association
You can define a overall measure of the Y~X association by taking a weighted average over the different strata or levels of Zresulting in a marginal or population averaged effect:
EW[Y | X=x] = Σ{z in Z}( w(z) * E[Y | X=x, Z=z] )
Different choices for weights w(z):w(z) = proportion of Z=z in source population...
or in a different target populationor in a standard population
Model Specification and Model Fitting
Specification: What is the functional FORM of the relationship between Y and X
E[Y | X0, X1] = a + b0*X0 + b1*X1
Fitting:Using data to estimate the various constants in the generic functional form of a model.
Building Blocks
vacuous models:E[Y] = a
constant models:E[Y | X] = a
linear models:E[Y | X0, X1] = a + b0*X0 + b1*X1
Is E[Y | X0, X1] = a + b0*X0 + b1*(X0^2)
a linear model?
exponential models:E[Y|X] = exp ( a + bX) = exp(a)*exp(bX)log(E[Y|X]) = a + bX
more generally:g(E[Y|X]) = a + bX
Variable Transformations
Transformations:covariates:
reduce leverage of outlying covariate valueschange units of effect estimates
outcome variables:change scale of model (e.g. loglinear models)make outcome distribution more “Normal” (t-
tests)
Variable Transformations
Millns et al(1995) Is it necessary to transform nutrient variables prior to statistical analysis? AJ Epi 141(3):251-262
Outcome transformations vs. generalized linear models
Outcome variables may be transformed:accelerated failure time model (eq. 20-18, Rothman et al page 396)E[ln(Y)] = α + β1X1
Instead of transforming and outcome variable to account for features of itsdistribution and then using linear regression, we may use alternatives to linear regression that can accommodatespecial aspects of the distribution of Y.
Namely, the variance of Y may be constrained by the expected value of Y.
Linear Regression: Y continuousVar[Y|X] independent of E[Y|X]
E[Y] = α + β1X1 + β2X2 + … + βkXk
Logistic Regression: Y dichotomousVar[PrY| X] = (E[PrY|X])(1-E[PrY|X )
E[ Log(odds) ] = α + β1X1 + β2X2 + … + βkXk
Poisson: Y countVar[count| X] = E[count|X]
E [Log(rate)] =α + β1X1 + β2X2 + … + βkXk
Generalized Linear Models
A broad class of models (including linear, logistic, and Poisson regression):The distribution of the outcome Y has a special form
“Exponential dispersion family”
There is a linear model for a transformed version of the expected value of Y – a “mean function”g(E[Y|X] ) = Xβwhere g() is a “link function”
The variance of Y can be expressed as a function of the expected value of YVar(Y|X) = V(g-1(Xβ ) )
There are general methods to solve many forms of these models and extensions of these models
Generalized Linear Models in Stata
For example: logistic regression is Family=binomial, link = logit. Choosing theFamily here specifies the probability model for Y|X, and thus the mean and variancefunctions
Logistic Regression If we use a logistic model, we do not have the problem of suggesting
risks greater than 1 or less than 0 for some values of X:E[1{outcome = 1} ] = exp(a+bX)/ [1 + exp(a+bX) ]
Logistic Regression Logistic model is a linear model, on a different scale than the linear risk model:
log(Pr(outcome=1)/[1 – Pr(outcome=1) ] = a + bX
Extensions to Logistic Regression
More than 2 outcome categories:unordered categories
* polytomous logistic model = multinomical logistic model* one category is designated the reference category, y0* for each alternative category, yi, there is a
- linear model for the log-odds of outcome yi vs. y0- log-odds (Y=yi | X=x ) = ai + bix- Odds (Y=yi | X=x )/ Odds (Y=yi | X=x* )
= exp( (x-x*)bi )
Extensions to Logistic Regression
Clarke et al (2008) Mobility disability and the urban built environment. AJEpi 168(5)
Extensions to Logistic Regression
Extensions to Logistic Regression More than 2 outcome categories:
ordered categories* y0 < y1 < y2* various models possible* cumulative odds = proportional odds model available in Stata
Pr(Y > yi) | X=x ) / Pr(Y <= yi |X=x) = exp(ai + bx) = exp(ai) * exp(bx)
so a unit increase in x will increase
Pr(Y > y0) | X=x ) / Pr(Y <= y0 |X=x)
and
Pr(Y > y1) | X=x ) / Pr(Y <= y1 |X=x)
by the same factor: exp(b)
thus the name “proportional odds”
Extensions to Logistic Regression Ordinal outcomes:
Extensions to Logistic Regression
Extensions to Logistic Regression
Extensions to Logistic Regression
Poisson Regression
We can have counts as an outcome
A Poisson distribution can model the number of independent eventsoccurring per unit of observation when the expected numberof events per unit of observation is λ
Prob (counts Y = k ) = λk *e-λ / k!
Poisson Regression
Think of a linear model for the log of the number of counts per unit of observation:
log( E[Y | X=x] / units of observation ) = a + bx= log( E[Y|X=x] ) – log(units of observation ) = a + bx=>
log(E[Y|X=x) = log(units of observation) + a + bx
note that the units of observation is used in the model like like a covariate, but it has a coefficient of 1. To distinguishfactors like this from other covariates, it is called an offset
E[Y|X=1] / E[Y|X=0] = exp(b)
Poisson Regression
A Poisson distribution can model the number of independent eventsoccurring per unit of observation when the expected numberof events per unit of observation is λ
Prob (counts Y = k ) = λ * exp(-λ) / k!
With a Poisson distribution, the expected number of counts is equal to the variance of the number of counts
Typically the model is applied to aggregated units of observation (groups or strata) for which total counts, total units of observation,and group-level covariates are recorded
Poisson Regression
Johnson et al (2005) Geographic Prediction of Human Onset of West Nile Virus Using Dead CrowClusters: An Evaluation of Year 2002 Data in New York State. AJ Epi 163(2):171-181.
Poisson Regression
Poisson Regression
Typically the model is applied to aggregated units of observation (groups or strata) for which total counts, total units of observation,and group-level covariates are recorded
Collapsing covariates into group-level covariates can introduce bias and loose information
Ungrouped Poisson Regression methods have been developed use individual, time-varying covariate informationestimate effects of covariates on rates
Parametric proportional hazards models with exponential event distribution
Poisson Regression
Loomis et al (2005) Poisson regression analysis of ungrouped data. OccEnvirnMed 62:325-329.
Poisson Regression
Further Extensions for Regression Typically it is assumed that observations are independent Generalized Estimating Equations
Do not necessarily assume that observations are independent, but if not a particular correlations structure for the observations is generally assumed
Independent – special case Block-independent – observations are partitioned into blocks.
Observations within a block are correlated, say with a fixed correlation r Autoregressive – observations within a block correspond to sequential
observations. Consecutive observations are more closely correlated than non-consecutive correlations
GEEs are useful in cases such as longitudinal studies where each subject has a set of observations, studies where several families are studied together and blocks represent families
Non-independent observations may also be studied using hierarchical models where:
Observations are drawn from blocks with some distribution that depends on some covariates
Blocks are drawn from some higher level model that may depend on other covariates