Lecture 18: Review Lecture -...
Transcript of Lecture 18: Review Lecture -...
Types of Biostatistics
n 1) Descriptive Statisticsn Exploratory Data Analysis
n often not in literature
n Summariesn "Table 1" in a paper
n Goal: visualize relationships, generatehypotheses
Types of Biostatistics
n 2) Inferential Statisticsn Confirmatory Data Analysis
n Methods Section of paper
n Goal: quantify relationships, testhypotheses
Approach to Modeling
A general approach for most statisticalmodeling is to:
n Define the Population of Interestn State the Scientific Questions & Underlying
Theoriesn Describe and Explore the Observed Datan Define the Model
n Probability part (models the randomness / noise)n Systematic part (models the expectation / signal)
Approach to Modeling
n Estimate the Parameters in the Modeln Fit the Model to the Observed Data
n Make Inferences about Covariatesn Check the Validity of the Model
n Verify the Model Assumptionsn Re-define, Re-fit, and Re-check the Model if
necessaryn Interpret the results of the Analysis in terms
of the Scientific Questions of Interest
Stem-and-Leaf Plots
n Age in years (10 observations)
25, 26, 29, 32, 35, 36, 38, 44, 49, 51
5 6 920-29
150-594 940-492 5 6 830-39
ObservationsAge Interval
Grouping:Frequency Distribution Tables
n Shows the number of observations for eachrange of data
n Intervals can be chosen in ways similar tostem-and-leaf displays
320-29
150-59240-49430-39
FrequencyAge Interval
Histograms
n Pictures of the frequency or relativefrequency distribution
12
34
Fre
quen
cy
1 2 3 4Age Ca tegory
Histogram of Age
Box-and-Whisker Plots
2530
3540
4550
Age
in Y
ears
Box Plot of Age
n IQR = 44 –29 = 15n Upper Fence = 44 + 15*1.5 = 66.5n Lower Fence = 29 –15*1.5 = 6.5
2 Continuous Variables
n Scatterplot
n Scatterplots visually display the relationship betweentwo continuous variables
150
160
170
180
190
Hei
ght i
n C
entim
eter
s
25 30 35 40 45 50Age in Years
Age by Height in cm
Why is the power of a testimportant?
n Power indicates the chance of finding a“significant”difference when therereally is onen Low power: like to obtain non-significant
results even when significant differencesexist
n High power is desirable!n Low power is usually cause by small
sample size
We’re not always right
Errors in Hypothesis Testing α
n Aim: to keep Type I error small byspecifying a small rejection region
n α is set before performing a test,usually at 0.05
Errors in Hypothesis Testing β
n Aim: To keep Type II error small andthus power high
β: Probability of Type II Error
n The value of β is usually unknown since itdepends on a specified alternative value.
n β depends on sample size and α.n Before data collection, scientists decide
n the test they will performn αn the desired β
n They will use this information to choose thesample size
P-Values
n Definition: The p-value for a hypothesistest is the probability of obtaining bychance, alone, when H0 is true, avalue of the test statistic as extreme ormore extreme (in the appropriatedirection) than the one actuallyobserved.
Steps of Hypothesis Testing
n Define the null hypothesis, H0.n Define the alternative hypothesis, Ha, where
Ha is usually of the form “not H0”.n Define the type 1 error, α, usually 0.05.n Calculate the test statisticn Calculate the P-valuen If the P-value is less than α, reject H0.
Otherwise fail to reject H0.
Why use linear regression?
n Linear regression is very powerful. Itcan be used for many things:n Binary Xn Continuous Xn Categorical Xn Adjustment for confoundingn Interactionn Curved relationships between X and Y
20
SLR: Y= 0+ 1X1
n Linear regression is used for continuousoutcome variablesn 0: mean outcome when X=0 (Center!)n Binary X = “dummy variable”for group
n 1: mean difference in outcome betweengroups
n Continuous Xn 1: mean difference in outcome corresponding
to a 1-unit increase in Xn Center X to give meaning to 0
n Test 1=0 in the population
Assumptions of LinearRegression
n L Linear relationshipn I Independent observationsn N Normally distributed around linen E Equal variance across X’s
In Simple Linear Regression
n In simple linear regression (SLR):n One Predictor / Covariate / Explanatory Variable:
X
n In multiple linear regression (MLR):n Same Assumptions as SLR, (i.e. L.I.N.E.), but:n More than one Covariate: X1, X2, X3, …, Xp
Model:n Y ~ N(µ, σ2)n µ = E(Y | X) = β0 + β1X1 + β2X2 + β3X3 +... βpXp
Regression Methods
Regression Methods
Nested models
n One model is nested within another ifthe parent model contains one set ofvariables and the extended modelcontains all of the original variables plusone or more additional variables.
Difference in assessing variables:“nested models”
n other predictor(s)n assess with t test if single variable defines
predictorn assess with F test (today) if two or more
variables are needed to define thepredictor
n potential confounder(s)n compare CI of primary predictor to see
whether new parameter is significantlydifferent
The F test
( )( )
nested
nested
nestedparent
obs
dfresidualRSS
added variablesnewof#RSSRSS
F
−
=
( )4.4
228.49
28.496.69
Fobs =−
=
What is Fcr?
H0: all new ’s=0 in population
HA: at least one new is not 0 in population
The F test: notes
n The F test can be used to compare any twonested models
n If only one variable is added, it’s easier tocompare the models using the t test for thatvariablen t2=F if one variable is added
n For any regression, the estimated variance ofthe residuals is RSS/(residual df)
Nested Models
n Comparing nested modelsn 1 new variable: use t test for that variablen 2+ new variables: use F test
n Categorical predictorn set one group as referencen create dummy variable for other groupsn include/exclude all dummy variablesn evaluate categorical predictor with F test
Effect Modification
n In linear regression, effect modificationis a way of allowing the associationbetween the primary predictor and theoutcome to change with the level ofanother predictor.n If the 3rd predictor is binary, that results in
a graph in which the two lines (for the twogroups) are no longer parallel.
31
Splines and Quadratic Terms
n Splines are used to allow the regression lineto bendn the breakpoint is arbitrary and decided graphically
or by hypothesisn the actual slope above and below the breakpoint
is usually of more interest than the coefficient forthe spline (ie the change in slope)
n Quadratic term allows for curvature in themodel
Logistic regression
n For binary outcomesn Model log odds probability, which we
also call the logitn Baseline term interpreted as log oddsn Other coefficients are log odds ratios
Logistic regression model
[ ]
=
Tx)|reliefP(noTx)|P(relieflogTx)|fodds(Relielog
= β0 + β1Tx
0 if Placebowhere: Tx =
1 if Drug
Then…
n log( odds(Relief|Drug) ) = β0 + β1
n log( odds(Relief|Placebo) ) = β0
n log( odds(R|D)) –log( odds(R|P)) = β1
And…
n Thus: log = β1
n And: OR = exp(β1) = eβ1 !!
n So: exp(β1) = odds ratio of relief forpatients taking the Drug-vs-patientstaking the Placebo.
P)|odds(RD)|odds(R
Logistic Regression
Logit estimates Number of obs = 70LR chi2(1) = 2.83Prob > chi2 = 0.0926
Log likelihood = -46.99169 Pseudo R2 = 0.0292
------------------------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------drug | .8137752 .4889211 1.66 0.096 -.1444926 1.772043_cons | -.2876821 .341565 -0.84 0.400 -.9571372 .3817731
------------------------------------------------------------------------------
Estimates:
log( odds(relief) ) =
= -0.288 + 0.814(Drug)
Therefore: OR = exp(0.814) = 2.26 !
Drug10ˆˆ ββ +
Adding other variables
n What ifPr(relief) = function of Drug or Placebo AND Age
n We could easily include age in a modelsuch as:
log( odds(relief) ) = β0 + β1Drug + β2Age
Logistic Regression
n As in MLR, we can include manyadditional covariates.
n For a Logistic Regression model with ppredictors:
log ( odds(Y=1)) = β0 + β1X1 + ... + βpXp
where: odds(Y=1) = =)1Pr(1
)1Pr(=−
=Y
Y)0Pr()1Pr(
=
=
YY
Types of interpretation
n 0+ 1 = ln(odds) (for X=1)n 1 = difference in log odds
n = odds (for X=1)n = odds ratio
n But we started with P(Y=1).Can we find that?
10e +
1e
More useful math
n
n
n( )
10
10
e1e1Xforrobabilitypso +
+
+==
odds1oddsrobabilityp+
=
robabilityp1robabilitypodds
−=
Nested models
n Adding a single new variable to the model
n null model:
n full model:
( )30Agep1
pln 10 −+=
−
( ) ( )minMultivita30Agep1
pln 210 +−+=
−
Comparing nested models thatdiffer by one variable
n Compare models with p-value or CIn What method is this?
n The Wald test, a test that applies the CLT, liken Z test comparing proportions in 2x2 tablen analogous to the t test for linear regression
n H0: the new variable is not neededn or H0: new=0 in the population
Conclusion from the Wald test
n The p-value for multivitamin is 0.007 (<0.05)and the CI for coefficient multivitamin doesnot include 0 (CI for OR doesn’t include 1)
n Reject H0
n Conclude that the larger model is better:after adjusting for age, multivitamin use isstill an important predictor of physician visitsin the population
Interpretation - log oddsn 0: the log odds of not visiting a physician
for a 30-year-old person who reports notregularly taking multivitamins
n 1: the log odds ratio of not visiting aphysician for a one year increase in agecontrolling for multivitamin use
n 2: the log odds ratio of not visiting aphysician for those who take multivitaminscompared with those who do not, adjustingfor age
Interpretation –odds andodds ratio
n exp{ 0}: the odds of not visiting aphysician for a 30-year-old person whoreports not regularly takingmultivitamins
Interpretation –odds andodds ratio
n exp{ 1}: after adjusting formultivitamin use, the odds ratio of notvisiting a physician changes by a factorof exp{ 1}=1.001 for each additionalyear of agen additional age is associated with lower
frequency of physician visits in these students,but the association is not statistically significant(p>0.05)
Interpretation –odds andodds ratio
n exp{ 2}: the odds ratio of not visiting aphysician for those who takemultivitamins compared with those whodo not is exp{ 2}=0.46, adjusting foragen taking multivitamins is associated with regular
physician visits (p=0.007)
Interpretation In General
n Also: log = β1
n And: OR = exp(β1) !!n exp(β1) is the Multiplicative change in
odds for a 1 unit increase in X1 providedX2 is held constant.
n The result is similar for X2
=
+=
)2X,1X|1odds(Y
)2X1,1X|1odds(Y
CHD by smoking and coffeen Yi = 1 if CHD case, 0 if control
n COFi = 1 if Coffee Drinker, 0 if not
n SMKi = 1 if Smoker, 0 if not
n pi = Pr (Yi = 1)
n ni = Number observed at patterni of Xs
Logistic Regression Model
n Yi are from a Binomial (ni, pi)distribution
n Yi are independentn log odds (Yi=1) (or, logit( Yi=1) ) is a
function ofn Coffeen Smokingn and coffee x smoking interaction
Logistic Regression Model
n Which implies that Pr(Yi=1) is thelogistic function
21322110
21322110
e1e
iXiXiXiX
iiii XXXX
ip ββ
ββ
+++
+=
+++
iiiii
i SMKCOFSMKCOFp
p32101
log ββββ +++=
−
Interpretations
n exp{ 1}: odds ratio of being a CHD casefor coffee drinkers -vs- non-drinkersamong non-smokers
n exp{ 1 3}: odds ratio of being a CHDcase for coffee drinkers -vs- non-drinkers among smokers
Interpretations
n exp{ 2}: odds ratio of being a CHD casefor smokers -vs- non-smokers amongnon-coffee drinkers
n exp{ 2 3}: odds ratio of being casefor smokers -vs- non-smokers amongcoffee drinkers
Interpretations
n fraction of cases among non-smoking non-coffee drinking individualsin the sample (determined by samplingplan)
n exp{ 3}: ratio of odds ratios
0
0
1 β
β
ee
+
exp{ 3} Interpretations
n exp{ 3}: factor by which odds ratio of beinga CHD case for coffee drinkers -vs-nondrinkers is multiplied for smokers ascompared to non-smokers
orn exp{ 3}: factor by which odds ratio of being a
CHD case for smokers -vs- non-smokers ismultiplied for coffee drinkers as compared tonon-coffee drinkers
Some Special Cases
n Given
n If 1 = 2 = 3 = 0
n Neither smoking nor coffee drinking isassociated with increased risk of CHD
SMKCOFSMKCOFYY *
)0Pr()1Pr(log 3210 ββββ +++=
==
Some Special Cases
n Given
n If 1 = 3 = 0
n Smoking, but not coffee drinking, isassociated with increased risk of CHD
SMKCOFSMKCOFYY *
)0Pr()1Pr(log 3210 ββββ +++=
==
Some Special Cases
n If 3 = 0n Smoking and coffee drinking are both
associated with risk of CHD but the odds ratioof CHD-smoking is the same at levels ofcoffee
n Smoking and coffee drinking are bothassociated with risk of CHD but the odds ratioof CHD-coffee is the same at levels ofsmoking.
Confounding
n In epidemiological terms, Z is a “confounder”of the relationship of Y with X if Z is relatedto both X and Y and Z is not in the causalpathway between X and Y
n In statistical terms, Z is a “confounder”of therelationship of Y with X if the X coefficientchanges when Z is added to a regression of Yon X
Confounding
n For example, consider the two modelsY = 0 + 1X + 1
Y = 0 + 1X + 2Z + 2
n then Z is a confounder of the X, Yrelationship if 1 1
Look at Confidence Intervals
n Without SmokingOR = e0.79 = 2.2
n 95% CI for log(OR): 0.79 ± 1.96(0.33)= (0.13, 1.44)
n 95% CI for OR: (e0.13, e1.44)= (1.14, 4.24)
Look at Confidence Intervals
n With Smoking (adjusting for smoking)OR = e0.53 = 1.7
n 95% CI for log(OR): 0.53 ± 1.96(0.35)= (-0.17, 1.22)
n 95% CI for OR: (e-0.17, e1.22)= (0.85, 3.39)
Conclusion
n So, ignoring smoking, the CHD andcoffee OR is 2.2 (95%CI: 1.14 - 4.26)
n Adjusting for smoking, gives moremodest evidence for a coffee effect
n In this case-control study, smoking is aweak-to-moderate confounder of thecoffee-CHD association
Interaction Model
Model 3
2.4.551.3Smoking-.59.73-.43Coffee*
Smoking
1.5.45.69Coffee-3.4.30-1.0Intercept
zseEstVariable
Testing Interaction Term
n Z= -0.59, p-value = 0.554
n 95% Confidence interval for 1 3n (0.42, 3.99)
n Both of the above suggest that there islittle evidence that smoking is an effectmodifier!
Likelihood Ratio Test
n The Likelihood Ratio Test will help decidewhether or not additional term(s)“significantly”improve the model fit
n Likelihood Ratio Test (LRT) statistic forcomparing nested models isn -2 times the difference between the log likelihoods
(LLs) for the Null -vs- Extended modelsn the obtained is identical to from an
analysis of variance test for linear regressionmodels
Likelihood Ratio Test
Deviance is a term used for the difference in-2*log likelihood relative to the best possible value froma perfectly predicting model.
Change in deviance is the same as change in -2LL.
LRT Example
Model comparisons usinglikelihood ratio test
Summary: Unadjusted ORs
n The odds of CHD was estimated to be3.4 times higher among smokerscompared to non-smokersn 95% CI: (1.7, 7.9)
n The odds of CHD was estimated to be2.2 times higher among coffee drinkerscompared to non-coffee drinkersn 95% CI: (1.1, 4.3)
Summary: Adjusted ORs
n Controlling for the potentialconfounding of smoking, the coffeeodds ratio was estimated to be 1.7 with95% CI: (.85, 3.4).
n Hence, the evidence in these data areinsufficient to conclude coffee has anindependent effect on CHD beyond thatof smoking.
Comparing the models
n Models C and F are both nested inModel A
n Models C and F cannot be directlycompared to one another, but we cansee which has a smaller p-value whencompared to Model An C vs. A: X2 = 26.5 with 2 dfn F vs. A: X2 = 21.7 with 3 df
What next?
n Model C improves prediction beyond genderalone (Model A) more than Model F.
n Model C should be the next parent model,and we should test the new variables inModel F to see if they continue to improveprediction within the context of Model C.
n When a tentative final model is identified, theassumptions of logistic regression should bechecked.
74
Flexibility in linear models
n A spline allows the “slope”for acontinuous predictor to change at agiven point; the coefficient is for thedifference in log odds ratio
n An interaction term allows the oddsratio for one variable to differ by thevalue of a second variable; thecoefficient is for the difference in logodds ratio
Poisson regression model
n Log-linear model for mean rate
where p is the number of predictors inthe model
n Random component:
n Here:
Exponentiating Poissonregression models
Interpreting Poissonregression parameters
Modelling rates
n Of key interest in Poisson regressionmodels is to make inference about ratesof events
n We are often interested in whether therate of cancer, or some other disease,varies by population subgroups such asgender, race, or age
Person-years
n In defining rates, it is crucial to statewhat denominator we have in mind
n For disease, we are usually interested indisease rate per person, per year
n If the HIV incidence rate is 5 per 1million person years, that means weexpect to see 5 new cases of HIV per 1million persons per year
Modelling Danish Cancer caseswith an offset
n We observed Danish cancer cases in 6age groups over a period of 4 years
n The model:
predicts log rates per 10,000 personyears
Interpretation of coefficients
More about offsets
n The purpose of an offset is to specifythe denominator of the predicted rates
n We should always try to use an offset ifwe suspect the underlying populationsizes vary for the observed counts
n Typically, we’ll use log(N) as the offset,where N is the sample size or numberof person years generating each count
Poisson regression for cohortstudies
n Log-linear regression can be used to estimaterelative risks for cohort studies (but not casecontrol)
n Relative risks is like relative rates, but we arecomparing risks (probability of disease)instead of rates (expected cases per person-year) across groups
n Could also estimate relative risk bytransforming results from logistic regression
Grand summary
n Exploratory analysis includes graphsand tables –good to get a feel for thedata
n Confirmatory analysis is useful formaking definitive conclusions
n Linear models provide us with aframework in which to performconfirmatory analysis in many settings
Grand summary: linear models
n Linear regression: for continuous(normal) outcomes
n Logistic regression: for binary outcomesn Poisson regression: for counts
Grand summary: modelling
n In all generalized linear models, we canuse the following tools to make modelsmore flexible:n Adjust for confounders using additive
covariatesn Effect modification allows by interaction
termsn Curved and bent lines through polynomials
and splines
Grand summary: testing
n We can test significance of a singlepredictor using z-test (or t-test forlinear regression)
n Test significance of several covariatesusing a pair of nested models by alikelihood ratio test
n Know how to interpret p-values andconfidence intervals!