Lecture 18: Review Lecture -...

Lecture 18: Review Lecture

Ani [email protected]

15 May 2007

mailto:[email protected]

Types of Biostatistics

n 1) Descriptive Statisticsn Exploratory Data Analysis

n often not in literature

n Summariesn "Table 1" in a paper

n Goal: visualize relationships, generatehypotheses

Types of Biostatistics

n 2) Inferential Statisticsn Confirmatory Data Analysis

n Methods Section of paper

n Goal: quantify relationships, testhypotheses

Approach to Modeling

A general approach for most statisticalmodeling is to:

n Define the Population of Interestn State the Scientific Questions & Underlying

Theoriesn Describe and Explore the Observed Datan Define the Model

n Probability part (models the randomness / noise)n Systematic part (models the expectation / signal)

Approach to Modeling

n Estimate the Parameters in the Modeln Fit the Model to the Observed Data

n Make Inferences about Covariatesn Check the Validity of the Model

n Verify the Model Assumptionsn Re-define, Re-fit, and Re-check the Model if

necessaryn Interpret the results of the Analysis in terms

of the Scientific Questions of Interest

Stem-and-Leaf Plots

n Age in years (10 observations)

25, 26, 29, 32, 35, 36, 38, 44, 49, 51

5 6 920-29

150-594 940-492 5 6 830-39

ObservationsAge Interval

Grouping:Frequency Distribution Tables

n Shows the number of observations for eachrange of data

n Intervals can be chosen in ways similar tostem-and-leaf displays

320-29

150-59240-49430-39

FrequencyAge Interval

Histograms

n Pictures of the frequency or relativefrequency distribution

12

34

Fre

quen

cy

1 2 3 4Age Ca tegory

Histogram of Age

Box-and-Whisker Plots

2530

3540

4550

Age

in Y

ears

Box Plot of Age

n IQR = 44 –29 = 15n Upper Fence = 44 + 15*1.5 = 66.5n Lower Fence = 29 –15*1.5 = 6.5

2 Continuous Variables

n Scatterplot

n Scatterplots visually display the relationship betweentwo continuous variables

150

160

170

180

190

Hei

ght i

n C

entim

eter

s

25 30 35 40 45 50Age in Years

Age by Height in cm

Why is the power of a testimportant?

n Power indicates the chance of finding a“significant”difference when therereally is onen Low power: like to obtain non-significant

results even when significant differencesexist

n High power is desirable!n Low power is usually cause by small

sample size

We’re not always right

Errors in Hypothesis Testing α

n Aim: to keep Type I error small byspecifying a small rejection region

n α is set before performing a test,usually at 0.05

Errors in Hypothesis Testing β

n Aim: To keep Type II error small andthus power high

β: Probability of Type II Error

n The value of β is usually unknown since itdepends on a specified alternative value.

n β depends on sample size and α.n Before data collection, scientists decide

n the test they will performn αn the desired β

n They will use this information to choose thesample size

P-Values

n Definition: The p-value for a hypothesistest is the probability of obtaining bychance, alone, when H0 is true, avalue of the test statistic as extreme ormore extreme (in the appropriatedirection) than the one actuallyobserved.

Steps of Hypothesis Testing

n Define the null hypothesis, H0.n Define the alternative hypothesis, Ha, where

Ha is usually of the form “not H0”.n Define the type 1 error, α, usually 0.05.n Calculate the test statisticn Calculate the P-valuen If the P-value is less than α, reject H0.

Otherwise fail to reject H0.

Why use linear regression?

n Linear regression is very powerful. Itcan be used for many things:n Binary Xn Continuous Xn Categorical Xn Adjustment for confoundingn Interactionn Curved relationships between X and Y

20

SLR: Y= 0+ 1X1

n Linear regression is used for continuousoutcome variablesn 0: mean outcome when X=0 (Center!)n Binary X = “dummy variable”for group

n 1: mean difference in outcome betweengroups

n Continuous Xn 1: mean difference in outcome corresponding

to a 1-unit increase in Xn Center X to give meaning to 0

n Test 1=0 in the population

Assumptions of LinearRegression

n L Linear relationshipn I Independent observationsn N Normally distributed around linen E Equal variance across X’s

In Simple Linear Regression

n In simple linear regression (SLR):n One Predictor / Covariate / Explanatory Variable:

X

n In multiple linear regression (MLR):n Same Assumptions as SLR, (i.e. L.I.N.E.), but:n More than one Covariate: X1, X2, X3, …, Xp

Model:n Y ~ N(µ, σ2)n µ = E(Y | X) = β0 + β1X1 + β2X2 + β3X3 +... βpXp

Regression Methods

Nested models

n One model is nested within another ifthe parent model contains one set ofvariables and the extended modelcontains all of the original variables plusone or more additional variables.

Difference in assessing variables:“nested models”

n other predictor(s)n assess with t test if single variable defines

predictorn assess with F test (today) if two or more

variables are needed to define thepredictor

n potential confounder(s)n compare CI of primary predictor to see

whether new parameter is significantlydifferent

The F test

( )( )

nested

nested

nestedparent

obs

dfresidualRSS

added variablesnewof#RSSRSS

F

−

=

( )4.4

228.49

28.496.69

Fobs =−

=

What is Fcr?

H0: all new ’s=0 in population

HA: at least one new is not 0 in population

The F test: notes

n The F test can be used to compare any twonested models

n If only one variable is added, it’s easier tocompare the models using the t test for thatvariablen t2=F if one variable is added

n For any regression, the estimated variance ofthe residuals is RSS/(residual df)

Nested Models

n Comparing nested modelsn 1 new variable: use t test for that variablen 2+ new variables: use F test

n Categorical predictorn set one group as referencen create dummy variable for other groupsn include/exclude all dummy variablesn evaluate categorical predictor with F test

Effect Modification

n In linear regression, effect modificationis a way of allowing the associationbetween the primary predictor and theoutcome to change with the level ofanother predictor.n If the 3rd predictor is binary, that results in

a graph in which the two lines (for the twogroups) are no longer parallel.

31

Splines and Quadratic Terms

n Splines are used to allow the regression lineto bendn the breakpoint is arbitrary and decided graphically

or by hypothesisn the actual slope above and below the breakpoint

is usually of more interest than the coefficient forthe spline (ie the change in slope)

n Quadratic term allows for curvature in themodel

Logistic regression

n For binary outcomesn Model log odds probability, which we

also call the logitn Baseline term interpreted as log oddsn Other coefficients are log odds ratios

Logistic regression model

[ ]

=

Tx)|reliefP(noTx)|P(relieflogTx)|fodds(Relielog

= β0 + β1Tx

0 if Placebowhere: Tx =

1 if Drug

Then…

n log( odds(Relief|Drug) ) = β0 + β1

n log( odds(Relief|Placebo) ) = β0

n log( odds(R|D)) –log( odds(R|P)) = β1

And…

n Thus: log = β1

n And: OR = exp(β1) = eβ1 !!

n So: exp(β1) = odds ratio of relief forpatients taking the Drug-vs-patientstaking the Placebo.

P)|odds(RD)|odds(R

Logistic Regression

Logit estimates Number of obs = 70LR chi2(1) = 2.83Prob > chi2 = 0.0926

Log likelihood = -46.99169 Pseudo R2 = 0.0292

------------------------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------drug | .8137752 .4889211 1.66 0.096 -.1444926 1.772043_cons | -.2876821 .341565 -0.84 0.400 -.9571372 .3817731

------------------------------------------------------------------------------

Estimates:

log( odds(relief) ) =

= -0.288 + 0.814(Drug)

Therefore: OR = exp(0.814) = 2.26 !

Drug10ˆˆ ββ +

Adding other variables

n What ifPr(relief) = function of Drug or Placebo AND Age

n We could easily include age in a modelsuch as:

log( odds(relief) ) = β0 + β1Drug + β2Age

Logistic Regression

n As in MLR, we can include manyadditional covariates.

n For a Logistic Regression model with ppredictors:

log ( odds(Y=1)) = β0 + β1X1 + ... + βpXp

where: odds(Y=1) = =)1Pr(1

)1Pr(=−

=Y

Y)0Pr()1Pr(

=

=

YY

Types of interpretation

n 0+ 1 = ln(odds) (for X=1)n 1 = difference in log odds

n = odds (for X=1)n = odds ratio

n But we started with P(Y=1).Can we find that?

10e +

1e

More useful math

n

n

n( )

10

10

e1e1Xforrobabilitypso +

+

+==

odds1oddsrobabilityp+

=

robabilityp1robabilitypodds

−=

Nested models

n Adding a single new variable to the model

n null model:

n full model:

( )30Agep1

pln 10 −+=

−

( ) ( )minMultivita30Agep1

pln 210 +−+=

−

Comparing nested models thatdiffer by one variable

n Compare models with p-value or CIn What method is this?

n The Wald test, a test that applies the CLT, liken Z test comparing proportions in 2x2 tablen analogous to the t test for linear regression

n H0: the new variable is not neededn or H0: new=0 in the population

Conclusion from the Wald test

n The p-value for multivitamin is 0.007 (<0.05)and the CI for coefficient multivitamin doesnot include 0 (CI for OR doesn’t include 1)

n Reject H0

n Conclude that the larger model is better:after adjusting for age, multivitamin use isstill an important predictor of physician visitsin the population

Interpretation - log oddsn 0: the log odds of not visiting a physician

for a 30-year-old person who reports notregularly taking multivitamins

n 1: the log odds ratio of not visiting aphysician for a one year increase in agecontrolling for multivitamin use

n 2: the log odds ratio of not visiting aphysician for those who take multivitaminscompared with those who do not, adjustingfor age

Interpretation –odds andodds ratio

n exp{ 0}: the odds of not visiting aphysician for a 30-year-old person whoreports not regularly takingmultivitamins


n exp{ 1}: after adjusting formultivitamin use, the odds ratio of notvisiting a physician changes by a factorof exp{ 1}=1.001 for each additionalyear of agen additional age is associated with lower

frequency of physician visits in these students,but the association is not statistically significant(p>0.05)


n exp{ 2}: the odds ratio of not visiting aphysician for those who takemultivitamins compared with those whodo not is exp{ 2}=0.46, adjusting foragen taking multivitamins is associated with regular

physician visits (p=0.007)

Interpretation In General

n Also: log = β1

n And: OR = exp(β1) !!n exp(β1) is the Multiplicative change in

odds for a 1 unit increase in X1 providedX2 is held constant.

n The result is similar for X2

=

+=

)2X,1X|1odds(Y

)2X1,1X|1odds(Y

CHD by smoking and coffeen Yi = 1 if CHD case, 0 if control

n COFi = 1 if Coffee Drinker, 0 if not

n SMKi = 1 if Smoker, 0 if not

n pi = Pr (Yi = 1)

n ni = Number observed at patterni of Xs

Logistic Regression Model

n Yi are from a Binomial (ni, pi)distribution

n Yi are independentn log odds (Yi=1) (or, logit( Yi=1) ) is a

function ofn Coffeen Smokingn and coffee x smoking interaction

Logistic Regression Model

n Which implies that Pr(Yi=1) is thelogistic function

21322110

21322110

e1e

iXiXiXiX

iiii XXXX

ip ββ

ββ

+++

+=

+++

iiiii

i SMKCOFSMKCOFp

p32101

log ββββ +++=

−

Interpretations

n exp{ 1}: odds ratio of being a CHD casefor coffee drinkers -vs- non-drinkersamong non-smokers

n exp{ 1 3}: odds ratio of being a CHDcase for coffee drinkers -vs- non-drinkers among smokers

Interpretations

n exp{ 2}: odds ratio of being a CHD casefor smokers -vs- non-smokers amongnon-coffee drinkers

n exp{ 2 3}: odds ratio of being casefor smokers -vs- non-smokers amongcoffee drinkers

Interpretations

n fraction of cases among non-smoking non-coffee drinking individualsin the sample (determined by samplingplan)

n exp{ 3}: ratio of odds ratios

0

0

1 β

β

ee

+

exp{ 3} Interpretations

n exp{ 3}: factor by which odds ratio of beinga CHD case for coffee drinkers -vs-nondrinkers is multiplied for smokers ascompared to non-smokers

orn exp{ 3}: factor by which odds ratio of being a

CHD case for smokers -vs- non-smokers ismultiplied for coffee drinkers as compared tonon-coffee drinkers

Some Special Cases

n Given

n If 1 = 2 = 3 = 0

n Neither smoking nor coffee drinking isassociated with increased risk of CHD

SMKCOFSMKCOFYY *

)0Pr()1Pr(log 3210 ββββ +++=

==

Some Special Cases

n Given

n If 1 = 3 = 0

n Smoking, but not coffee drinking, isassociated with increased risk of CHD

SMKCOFSMKCOFYY *

)0Pr()1Pr(log 3210 ββββ +++=

==

Some Special Cases

n If 3 = 0n Smoking and coffee drinking are both

associated with risk of CHD but the odds ratioof CHD-smoking is the same at levels ofcoffee

n Smoking and coffee drinking are bothassociated with risk of CHD but the odds ratioof CHD-coffee is the same at levels ofsmoking.

Confounding

n In epidemiological terms, Z is a “confounder”of the relationship of Y with X if Z is relatedto both X and Y and Z is not in the causalpathway between X and Y

n In statistical terms, Z is a “confounder”of therelationship of Y with X if the X coefficientchanges when Z is added to a regression of Yon X

Confounding

n For example, consider the two modelsY = 0 + 1X + 1

Y = 0 + 1X + 2Z + 2

n then Z is a confounder of the X, Yrelationship if 1 1

Look at Confidence Intervals

n Without SmokingOR = e0.79 = 2.2

n 95% CI for log(OR): 0.79 ± 1.96(0.33)= (0.13, 1.44)

n 95% CI for OR: (e0.13, e1.44)= (1.14, 4.24)

Look at Confidence Intervals

n With Smoking (adjusting for smoking)OR = e0.53 = 1.7

n 95% CI for log(OR): 0.53 ± 1.96(0.35)= (-0.17, 1.22)

n 95% CI for OR: (e-0.17, e1.22)= (0.85, 3.39)

Conclusion

n So, ignoring smoking, the CHD andcoffee OR is 2.2 (95%CI: 1.14 - 4.26)

n Adjusting for smoking, gives moremodest evidence for a coffee effect

n In this case-control study, smoking is aweak-to-moderate confounder of thecoffee-CHD association

Interaction Model

Model 3

2.4.551.3Smoking-.59.73-.43Coffee*

Smoking

1.5.45.69Coffee-3.4.30-1.0Intercept

zseEstVariable

Testing Interaction Term

n Z= -0.59, p-value = 0.554

n 95% Confidence interval for 1 3n (0.42, 3.99)

n Both of the above suggest that there islittle evidence that smoking is an effectmodifier!

Likelihood Ratio Test

n The Likelihood Ratio Test will help decidewhether or not additional term(s)“significantly”improve the model fit

n Likelihood Ratio Test (LRT) statistic forcomparing nested models isn -2 times the difference between the log likelihoods

(LLs) for the Null -vs- Extended modelsn the obtained is identical to from an

analysis of variance test for linear regressionmodels

Likelihood Ratio Test

Deviance is a term used for the difference in-2*log likelihood relative to the best possible value froma perfectly predicting model.

Change in deviance is the same as change in -2LL.

LRT Example

Model comparisons usinglikelihood ratio test

Summary: Unadjusted ORs

n The odds of CHD was estimated to be3.4 times higher among smokerscompared to non-smokersn 95% CI: (1.7, 7.9)

n The odds of CHD was estimated to be2.2 times higher among coffee drinkerscompared to non-coffee drinkersn 95% CI: (1.1, 4.3)

Summary: Adjusted ORs

n Controlling for the potentialconfounding of smoking, the coffeeodds ratio was estimated to be 1.7 with95% CI: (.85, 3.4).

n Hence, the evidence in these data areinsufficient to conclude coffee has anindependent effect on CHD beyond thatof smoking.

Comparing the models

n Models C and F are both nested inModel A

n Models C and F cannot be directlycompared to one another, but we cansee which has a smaller p-value whencompared to Model An C vs. A: X2 = 26.5 with 2 dfn F vs. A: X2 = 21.7 with 3 df

What next?

n Model C improves prediction beyond genderalone (Model A) more than Model F.

n Model C should be the next parent model,and we should test the new variables inModel F to see if they continue to improveprediction within the context of Model C.

n When a tentative final model is identified, theassumptions of logistic regression should bechecked.

74

Flexibility in linear models

n A spline allows the “slope”for acontinuous predictor to change at agiven point; the coefficient is for thedifference in log odds ratio

n An interaction term allows the oddsratio for one variable to differ by thevalue of a second variable; thecoefficient is for the difference in logodds ratio

Poisson regression model

n Log-linear model for mean rate

where p is the number of predictors inthe model

n Random component:

n Here:

Exponentiating Poissonregression models

Interpreting Poissonregression parameters

Modelling rates

n Of key interest in Poisson regressionmodels is to make inference about ratesof events

n We are often interested in whether therate of cancer, or some other disease,varies by population subgroups such asgender, race, or age

Person-years

n In defining rates, it is crucial to statewhat denominator we have in mind

n For disease, we are usually interested indisease rate per person, per year

n If the HIV incidence rate is 5 per 1million person years, that means weexpect to see 5 new cases of HIV per 1million persons per year

Modelling Danish Cancer caseswith an offset

n We observed Danish cancer cases in 6age groups over a period of 4 years

n The model:

predicts log rates per 10,000 personyears

Interpretation of coefficients

More about offsets

n The purpose of an offset is to specifythe denominator of the predicted rates

n We should always try to use an offset ifwe suspect the underlying populationsizes vary for the observed counts

n Typically, we’ll use log(N) as the offset,where N is the sample size or numberof person years generating each count

Poisson regression for cohortstudies

n Log-linear regression can be used to estimaterelative risks for cohort studies (but not casecontrol)

n Relative risks is like relative rates, but we arecomparing risks (probability of disease)instead of rates (expected cases per person-year) across groups

n Could also estimate relative risk bytransforming results from logistic regression

Grand summary

n Exploratory analysis includes graphsand tables –good to get a feel for thedata

n Confirmatory analysis is useful formaking definitive conclusions

n Linear models provide us with aframework in which to performconfirmatory analysis in many settings

Grand summary: linear models

n Linear regression: for continuous(normal) outcomes

n Logistic regression: for binary outcomesn Poisson regression: for counts

Grand summary: modelling

n In all generalized linear models, we canuse the following tools to make modelsmore flexible:n Adjust for confounders using additive

covariatesn Effect modification allows by interaction

termsn Curved and bent lines through polynomials

and splines

Grand summary: testing

n We can test significance of a singlepredictor using z-test (or t-test forlinear regression)

n Test significance of several covariatesusing a pair of nested models by alikelihood ratio test

n Know how to interpret p-values andconfidence intervals!

Lecture 18: Review Lecture -...

Documents

Transcript of Lecture 18: Review Lecture -...