REGRESSION MODELS AND POLYTOMOUS VARIABLES Joel Mefford [email protected] 03/02/2012.

REGRESSION MODELS AND POLYTOMOUS VARIABLES

Joel Mefford [email protected]/02/2012

POLYTOMOUS EXPOSURES AND OUTCOMES

Rothman, Greenland, and Lash: Ch. 17.

Polytomous Exposures and Outcomes

Categorization of continuous variables into polytomous categorical variables has the risks we have already discussed in the context of creating dichotomous variables from continuous variables and misclassification bias

Biologically meaningful categories Residual confounding, especially if wide ranges of continuous measurements

are lumped together into single categories Sparse or empty categories can make analysis difficult Modeling, describing, or adjusting for misclassification become complex

problems


Tabular analyses using polytomous variables Conduct a series of pair-wise analyses using methods for dichotomous variables Global tests for independence or trends Graphical analyses Move on to regression


Fruit and vegetable

servings per day

[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total

cases 49 125 136 178 488

controls 28 111 140 209 488

total 77 236 276 387 976


Fruit and vegetable servings per day

[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total

cases 49 125 136 178 488

controls 28 111 140 209 488

total 77 236 276 387 976

[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total

cases 488

controls 488

total 77 236 276 387 976


Fruit and vegetable

servings per day

[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total

cases 49 125 136 178 488

controls 28 111 140 209 488

total 77 236 276 387 976

[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total

cases 488

controls 488

total 77 236 276 387 976

[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total

cases 38.5 118 138 193.5 488

controls 38.5 118 138 193.5 488

total 77 236 276 387 976


[] <=2 2 < [] <=4 4 < [] <=6 6 < [] total

cases 2.863636364 0.415254237 0.028985507 1.241602067 488

controls 2.863636364 0.415254237 0.028985507 1.241602067 488

total 77 236 276 387 976

df (2-1)(4-1) = 3

P-value 0.003


GWAS looking at relapse or relapse-free survival after chemotherapy with (busulfan + etoposide) and autologous bone-marrow transplantation (BMT) for AML.

•314 AML patients who had chemotherapy and autologous BMT•78 patients relapsed within 12 months•199 patients did not relapse within 12 months•37 lost of follow up (missing data)


Analysis 1:

Cox proportional hazards model :Time = months relapse free survival after transplantationEvent = relapseParameter of interest: hazard ratio associated with the addition of a minor

allele at a particular SNPDataset = all subjectsAdjustment covariates: 10 PCs to adjust for ancestry/relatedness2 clinical prognostic scores


Analysis 2:

Trend test to look for an association between the number of minor alleles and the fraction of subjects who had a relapse of their leukemia within 12 months of transplantation


The “top hits” or the SNPs with the lowest p-values for the most suggestively significant results from the two analyses are highly overlapping sets. Rank orderings of the “top hits” were different though.

The results from the survival analyses with the adjustment covariates are most interesting going forward, but the simple trend tests did capture some of the same information.


status\genotype AA AB BB

case N11 N12 N13

control N21 N22 N23


status\genotype AA AB BB

case N11 N12 N13

control N21 N22 N23

status\genotype AA AB BB total

case N11 N12 N13 R1

control N21 N22 N23 R2

total C1 C2 C3 N

T = sum_{column i}[wi * (N1i*R2 -N2i*R1)]

under null (no assiciation): E[T]=0

There is a formula for V[T]

T/sqrt(V[T]) -> N(0,1)

REGRESSION TOPICS

Rothman, Greenland, and Lash: Ch. 20.

Regression Why use regression models? How about stratified analyses of tabular data?

Control for confounding

Assess effect modification

Summarize disease association of several predictor variables, e.g. ORMH.

Model-freeAssumption: homogeneity within each strata

Regression Limitations of Stratification

Adjustment only for categorical covariates

Categorization of continuous variables: loss of information; residual confounding

Sparse data

Inefficiency

Regression What are regression functions? E [Y|X] or g(E [Y|X])

Y is the outcome variable X is the predictor or a vector of predictors g() is a transformation or “link function”

Need to define Y, X, and population over which the expectation or average is taken:

target populationsource population

sample

Regression

Generally we assume that: the function E[Y|X] has a particular form the errors, the differences between actual observations and their expected

values have particular properties independence Mean = 0 a specified distribution

We may make other assumptions

These assumptions form the “model”.

Regression There are regression models designed for use with many types of outcome variables

and explanatory variables Continuous variables Indicator variables Unordered polytomous variables Ordinal variables …

Regression

Regression There seems to be a relationship between two variables

Regression: E[Y | X ] ?

Regression E[Y | X= xi] for xi strata of X

Regression

Regression

Assume a linear relationship between X and Y (model)

Regression

Regression A regression model may summarize some aspect of the

relationship between variables without completely describing the relationship

Regression Continuous explanatory variable with categorical outcome variable:

Regression We could use a linear model for a dichotomous outcome:

linear risk modelE [1{outcome=1} ] = Pr(outcome = 1)

Regression We could use a linear model for a dichotomous exposure and

Continuous outcome:

Intervention Effects and Regression Intervention effects:

E[ Y | set(X=x1), Z=z] - E[Y | set(X=x0), Z=z]E[ Y | set(X=x1), Z=z] / E[Y | set(X=x0), Z=z]

where the expectation is over the target population

Intervention Effects and Regression If we want to use the regression association measures as estimates of the potential intervention effects,we need to assume:

E[ Y | X=x, Z=z] = E[ Y | set(X=x), Z=z]

No Confounding Assumption

“no residual confounding of X and Y given Z"

Intervention Effects and Regression

Regression standardization

E[ Y | X=x, Z=z]different values of Z correspond to different strata in which you may consider the Y~X association

You can define a overall measure of the Y~X association by taking a weighted average over the different strata or levels of Zresulting in a marginal or population averaged effect:

EW[Y | X=x] = Σ{z in Z}( w(z) * E[Y | X=x, Z=z] )

Different choices for weights w(z):w(z) = proportion of Z=z in source population...

or in a different target populationor in a standard population

Model Specification and Model Fitting

Specification: What is the functional FORM of the relationship between Y and X

E[Y | X0, X1] = a + b0*X0 + b1*X1

Fitting:Using data to estimate the various constants in the generic functional form of a model.

Variable Transformations

Transformations:covariates:

reduce leverage of outlying covariate valueschange units of effect estimates

outcome variables:change scale of model (e.g. loglinear models)make outcome distribution more “Normal” (t-

tests)

Variable Transformations

Millns et al(1995) Is it necessary to transform nutrient variables prior to statistical analysis? AJ Epi 141(3):251-262

Outcome transformations vs. generalized linear models

Outcome variables may be transformed:accelerated failure time model (eq. 20-18, Rothman et al page 396)E[ln(Y)] = α + β1X1

Instead of transforming and outcome variable to account for features of itsdistribution and then using linear regression, we may use alternatives to linear regression that can accommodatespecial aspects of the distribution of Y.

Namely, the variance of Y may be constrained by the expected value of Y.

Linear Regression: Y continuousVar[Y|X] independent of E[Y|X]

E[Y] = α + β1X1 + β2X2 + … + βkXk

Logistic Regression: Y dichotomousVar[PrY| X] = (E[PrY|X])(1-E[PrY|X )

E[ Log(odds) ] = α + β1X1 + β2X2 + … + βkXk

Poisson: Y countVar[count| X] = E[count|X]

E [Log(rate)] =α + β1X1 + β2X2 + … + βkXk

Generalized Linear Models

A broad class of models (including linear, logistic, and Poisson regression):The distribution of the outcome Y has a special form

“Exponential dispersion family”

There is a linear model for a transformed version of the expected value of Y – a “mean function”g(E[Y|X] ) = Xβwhere g() is a “link function”

The variance of Y can be expressed as a function of the expected value of YVar(Y|X) = V(g-1(Xβ ) )

There are general methods to solve many forms of these models and extensions of these models

Generalized Linear Models in Stata

For example: logistic regression is Family=binomial, link = logit. Choosing theFamily here specifies the probability model for Y|X, and thus the mean and variancefunctions

Logistic Regression If we use a logistic model, we do not have the problem of suggesting

risks greater than 1 or less than 0 for some values of X:E[1{outcome = 1} ] = exp(a+bX)/ [1 + exp(a+bX) ]

Logistic Regression Logistic model is a linear model, on a different scale than the linear risk model:

log(Pr(outcome=1)/[1 – Pr(outcome=1) ] = a + bX

Extensions to Logistic Regression

More than 2 outcome categories:unordered categories

* polytomous logistic model = multinomical logistic model* one category is designated the reference category, y0* for each alternative category, yi, there is a

- linear model for the log-odds of outcome yi vs. y0- log-odds (Y=yi | X=x ) = ai + bix- Odds (Y=yi | X=x )/ Odds (Y=yi | X=x* )

= exp( (x-x*)bi )


Clarke et al (2008) Mobility disability and the urban built environment. AJEpi 168(5)

Extensions to Logistic Regression More than 2 outcome categories:

ordered categories* y0 < y1 < y2* various models possible* cumulative odds = proportional odds model available in Stata

Pr(Y > yi) | X=x ) / Pr(Y <= yi |X=x) = exp(ai + bx) = exp(ai) * exp(bx)

so a unit increase in x will increase

Pr(Y > y0) | X=x ) / Pr(Y <= y0 |X=x)

and

Pr(Y > y1) | X=x ) / Pr(Y <= y1 |X=x)

by the same factor: exp(b)

thus the name “proportional odds”

Extensions to Logistic Regression Ordinal outcomes:

Poisson Regression

We can have counts as an outcome

A Poisson distribution can model the number of independent eventsoccurring per unit of observation when the expected numberof events per unit of observation is λ

Prob (counts Y = k ) = λk *e-λ / k!

Poisson Regression

Think of a linear model for the log of the number of counts per unit of observation:

log( E[Y | X=x] / units of observation ) = a + bx= log( E[Y|X=x] ) – log(units of observation ) = a + bx=>

log(E[Y|X=x) = log(units of observation) + a + bx

note that the units of observation is used in the model like like a covariate, but it has a coefficient of 1. To distinguishfactors like this from other covariates, it is called an offset

E[Y|X=1] / E[Y|X=0] = exp(b)

Poisson Regression

A Poisson distribution can model the number of independent eventsoccurring per unit of observation when the expected numberof events per unit of observation is λ

Prob (counts Y = k ) = λ * exp(-λ) / k!

With a Poisson distribution, the expected number of counts is equal to the variance of the number of counts

Typically the model is applied to aggregated units of observation (groups or strata) for which total counts, total units of observation,and group-level covariates are recorded

Poisson Regression

Johnson et al (2005) Geographic Prediction of Human Onset of West Nile Virus Using Dead CrowClusters: An Evaluation of Year 2002 Data in New York State. AJ Epi 163(2):171-181.

Poisson Regression

Poisson Regression

Typically the model is applied to aggregated units of observation (groups or strata) for which total counts, total units of observation,and group-level covariates are recorded

Collapsing covariates into group-level covariates can introduce bias and loose information

Ungrouped Poisson Regression methods have been developed use individual, time-varying covariate informationestimate effects of covariates on rates

Parametric proportional hazards models with exponential event distribution

Poisson Regression

Loomis et al (2005) Poisson regression analysis of ungrouped data. OccEnvirnMed 62:325-329.

Poisson Regression

Further Extensions for Regression Typically it is assumed that observations are independent Generalized Estimating Equations

Do not necessarily assume that observations are independent, but if not a particular correlations structure for the observations is generally assumed

Independent – special case Block-independent – observations are partitioned into blocks.

Observations within a block are correlated, say with a fixed correlation r Autoregressive – observations within a block correspond to sequential

observations. Consecutive observations are more closely correlated than non-consecutive correlations

GEEs are useful in cases such as longitudinal studies where each subject has a set of observations, studies where several families are studied together and blocks represent families

Non-independent observations may also be studied using hierarchical models where:

Observations are drawn from blocks with some distribution that depends on some covariates

Blocks are drawn from some higher level model that may depend on other covariates

REGRESSION MODELS AND POLYTOMOUS VARIABLES Joel Mefford [email protected] 03/02/2012.

Documents

Transcript of REGRESSION MODELS AND POLYTOMOUS VARIABLES Joel Mefford [email protected] 03/02/2012.