HST 190: Introduction to Biostatistics
Transcript of HST 190: Introduction to Biostatistics
HST190:IntroductiontoBiostatistics
Lecture6:Methodsforbinarydata
1 HST190:IntrotoBiostatistics
Binarydata
• Sofar,wehavefocusedonsettingwhereoutcomeiscontinuous
• Now,weconsiderthesettingwhereouroutcomeofinterestisbinary,meaningittakesvalues1or0.§ Inparticular,weconsiderthe2x2contingencytable tabulatingpairsofbinaryobservations(𝑋#, 𝑌#), … , (𝑋(, 𝑌()
HST190:IntrotoBiostatistics2
HST190:IntrotoBiostatistics3
• Considertwopopulations§ IVdruguserswhoreportsharingneedles
§ IVdruguserswhodonotreportsharingneedles
• Istherateofpositivetuberculinskintestequalinbothpopulations?§ Toaddressthisquestion,wesample40patientswhoreportand60patientswhodonottocompareratesofpositivetuberculintest
§ Datacross-classified accordingtothesetwobinaryvariables2x2table Positive Negative Total
Reportsharing 12 28 40
Don’treportsharing 11 49 60
Total 23 77 100
Chi-squaretestforcontingencytables
HST190:IntrotoBiostatistics4
• TheChi-squaretestisatestofassociationbetweentwocategoricalvariables.
• Ingeneral,itsnullandalternativehypothesesare§ 𝐻*:therelativeproportionsofindividualsineachcategoryofvariable#1arethesameacrossallcategoriesofvariable#2;thatis,thevariablesarenotassociated (i.e.,statisticallyindependent).
§ 𝐻# :thevariablesareassociatedo Noticethealternativeisalwaystwo-sided
• Inourexample,thismeans§ 𝐻*:reportedneedlesharingisnotassociatedwithPPD
HST190:IntrotoBiostatistics5
• TheChi-squaretestcomparesobservedcountsinthetabletocountsexpectedifnoassociation(i.e.,𝐻*)§ Expectedcountsareobtainedusingthemarginaltotals ofthetable.
• Recallindependencerule 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃(𝐵),sofrom100people,assumingindependence,weexpect
𝑃 share ∩ positive = 𝑃 share 𝑃 positive =40100
23100 = 0.092
§ Then,we’dexpect0.092 100 = 9.2 positivesharers,insteadof12
2x2table Positive Negative TotalReportsharing 12 28 40
Don’treportsharing 11 49 60
Total 23 77 100
HST190:IntrotoBiostatistics6
• Similarly,therewilllikelybesomediscrepancybetweenobservedandexpectedcountsfortheotherthreecellsinthetable.§ Chi-squaretestassesses:arethesedifferencestoolargetobetheresultofsamplingvariability?
• StepsofChi-squaretest1) Completetheobserved-datatable
2) Computetableofexpectedcounts
3) Calculatethe𝑋A statistic
4) Getp-valuefromthechi-squaretable
• Thismethodisvalidonlyifallexpectedcounts≥5§ testreliesonapproximationthatdoesnotholdinsmallsamples
HST190:IntrotoBiostatistics7
1) Completeobserveddatatable
2) Completetableofexpectedcounts
𝐸CD =𝑂C⋅×𝑂⋅D𝑛 =
(𝑂C# + 𝑂CA)(𝑂#D + 𝑂AD)𝑛
3) Calculatechi-squareteststatistic
𝑋A = ∑observed − expected A
expected
=𝑂## − 𝐸## A
𝐸##+
𝑂#A − 𝐸#A A
𝐸#A+
𝑂A# − 𝐸A# A
𝐸A#+
𝑂AA − 𝐸AA A
𝐸AA§ swap𝑂CD − 𝐸CD with 𝑂CD − 𝐸CD − 0.5 forYatescontinuitycorrection
O11 O12 O1.O21 O22 O2.O.1 O.2 n
E11 E12 E1.E21 E22 E2.E.1 E.2 n
HST190:IntrotoBiostatistics8
4) Getp-valuefromchi-squaredistribution§ Undernullhypothesis𝐻*:noassociationbetweenthetwofactors,the𝑋A statisticfollowsachi-squaredistributionwith1degreeoffreedom.Thisisoftenwrittenas𝑋A~𝜒#A
o continuousandpositive-valued,definedbyoneparameterdf
§ p-valuecomesfromrighttail,butisinherently‘two-sided’o matlab: 1-chi2cdf(x,1)
𝜒#,*.STA = 3.84Area= 0.05
HST190:IntrotoBiostatistics9
• Thus,atthe𝛼 level,𝐻* isrejectedif𝑋A > 𝜒#,#YZA
• Using2x2contingencytable,analternateformulaforthe
Yatescorrectedteststatisticis𝑋A =( [\Y]^ Y_`
`
([a])(^a\)([a^)(]a\)
𝑋A =100 12(49) − 28(11) − 50 A
(40)(60)(23)(77) = 1.24 < 3.84 = 𝜒#,*.STA
• ⇒ Failtoreject𝐻* 2x2tablePositive
Negative Total
Reportsharing 𝑎 = 12 𝑏 = 28 𝑎 + 𝑏
= 40Don’treport
sharing 𝑐 = 11 𝑑 = 49 𝑐 + 𝑑= 60
Total 𝑎 + 𝑐= 23
𝑏 + 𝑑= 77 𝑛 = 100
Fisher’sexacttest
HST190:IntrotoBiostatistics10
Whathappensifallexpectedcounts<5?Insteadofchi-squaretest,useaFisher’sexacttest (seeRosner10.3)
• Likethechi-squaretest,Fisher'sexacttestexaminesthesignificanceoftheassociation(contingency)betweenthetwokindsofclassification– rowsandcolumns.
• Bothrowandcolumntotals(a+c,b+d,a+b,c+d)areassumedtobefixed- notrandom.
• Wethenconsiderallpossibletablesthatcouldgivetherowandcolumntotalsobservedandcorrespondingprobabilityofeachconfiguration(ithelpstorealizethatthefirstcount,a,hasahypergeometricdistributionunderthenull)
• Finally,thep-valuesarecomputedbyaddinguptheprobabilitiesofthetablesasextremeormoreextremethantheobservedone.
Whatifweareinterestedinavariablethathasmorethantwocategories?
Example: Testforassociationbetweeneyecolorandpresenceorabsenceofamutantalleleatsomegeneticlocus.
Eyecolorcategories:blue,green,brown,hazel,gray
Geneticcategories:0copiesmutantallele,
≥1 copymutantallele
11
Chi-squaretestforcontingencytables,RxC
Thechi-squaretest canbeusedforvariableswithmorethantwocategories.DatapresentedinanRxC table,ageneralizationofthe2x2table:
R =#rows,C =#columns(doesn’tmatterwhichvariableiswhich)
12
blue green brown hazel gray TotalMutantallele
absent 3 7 21 15 15 61
Mutantallelepresent 6 10 18 14 17 65
Total 9 17 39 29 32 126
Chi-squaretestforRxC tablesame asfor2x2tableexcept:
• Thismethodcanonly beusedifnomorethan1/5ofcellshaveexpectedcount<5ANDifnocellhasexpectedcount<1.
• UnderH0,theX2 teststatisticfollowsachi-squaredistributionon(R-1)(C-1)degreesoffreedom
13
𝑋A = jkkYlkk `
lkk+ jk`Ylk` `
lk`+ …+ jmnYlmn `
lmn
𝑋A~𝜒(oY#)(pY#)A
Again,wehavetoobtainmarginaltotalstodetermineexpectedcountforeachcell.Forexample…
Theexpectedcountswouldbecalculatedasfollows
blue green brown hazel gray TotalMutantallele
absent 4.36 8.23 18.88 14.04 15.49 61
Mutantallelepresent 4.64 8.77 20.12 14.96 16.51 65
Total 9 17 39 29 32 126
14
E11=q#rS#Aq
= 4.36,… , ERC =qTrsA#Aq
= 16.51
• UnderH0,𝑋A~𝜒tA
•
15
X 2 =3− 4.36( )
2
4.36+
7 −8.23( )2
8.23+!+
17 − 16.51( )2
16.51 = 1.80
MATLAB:1-chi2cdf(1.8,4)p-value=0.77
Conclusion:Noevidenceforassociationbetweeneyecolorandmutantalleles.
HST190:IntrotoBiostatistics16
Whatifweareinterestedinestimatingandquantifyinguncertaintyaboutthedifferenceinproportionsbetweentwogroups?
• e.g.,wantestimateandCIofdifferenceinproportionsofpositivetuberculosisskintestsbetweenneedlesharersandnon-sharers
Approachissimilartotwo-sampleestimationforcontinuousdataquestions,withsubtledifferences!
Two-samplecomparisonofproportions
Two-samplecomparisonofproportions
HST190:IntrotoBiostatistics17
• Whereaswehavepreviouslyconsideredthedifferenceinmeansofcontinuoustwo-sampledata,wenowcomparetwopopulations’unknownproportions𝑝# and𝑝A.
• Supposewewanttoknowwhethertwocommunitieshavethesameobesityrate.§ Youdrawrandomsamplesfromboth;inthefirstcity,20outof100areobese,whileinthesecond24outof150areobese.
• Goals:§ estimateandcomputethe95%C.I.forthedifferenceinproportions
§ conductasignificancetestatlevel𝛼 = 0.05 foradifference
HST190:IntrotoBiostatistics18
• Before,wesawthatifarandomexperimenthastwopossibleoutcomes,“success”and“failure”,andwedo𝑛independentrepetitionswithidenticalsuccessprobability𝑝,then𝑋~Bin(𝑛, 𝑝) isthenumberofsuccesses.§ Now,weobserve𝑋#~Bin(𝑛#, 𝑝#) andXA~Bin(𝑛A, 𝑝A) andthenmakeinferenceabout𝑝# − 𝑝A.
• Estimationisidenticaltotwo-samplecontinuouscase:differenceofsampleproportions, �̂�# − �̂�A
• If𝑛#�̂�# 1 − �̂�# ≥ 5 and𝑛A�̂�A 1 − �̂�A ≥ 5,theassociated100 1 − 𝛼 % CIgivenby
�̂�# − �̂�A ± 𝑧#YZA�̂�#(1 − �̂�#)
𝑛#+�̂�A(1 − �̂�A)
𝑛A
�
HST190:IntrotoBiostatistics19
• Forexample,considertwosamples
§ 𝑛# = 100, 𝑋# = 20, �̂�# =A*#**
= 0.20, 𝑛#�̂�# 1−�̂�# = 16 ≥ 5
§ 𝑛A = 150, 𝑋A = 24, �̂�A =At#T*
= 0.16, 𝑛A�̂�A(1−�̂�A) = 20.16 ≥ 5
• Thenthe95%CIforthedifferenceis
= (0.20 − 0.16) ± 1.960.2(0.8)100 +
0.16(0.84)150
�
= 0.04 ± 1.96 0.050 = 0.04 ± 0.10 = −0.06, 0.14
Hypothesistestingfordifferenceofproportions
HST190:IntrotoBiostatistics20
• Now,consider𝐻*:𝑝# = 𝑝A versus𝐻#:𝑝# ≠ 𝑝A§ Under𝐻*,wecanpoolthetwosamplestocalculatestandarderror,
letting�̂� = (k��ka(`��`(ka(`
• ThenIf𝑛#�̂�# 1 − �̂�# ≥ 5 and𝑛A�̂�A 1 − �̂�A ≥ 5,under𝐻*weformtheZ-teststatistic
𝑍 =�̂�# − �̂�A
�̂�(1 − �̂�) 1𝑛#+ 1𝑛A
�
• IthasanapproximateN(0,1)distributionwhenthenullistrue.
HST190:IntrotoBiostatistics21
• Continuingthesameexample,
§ 𝑛# = 100, 𝑋# = 20, �̂�# =A*#**
= 0.20, 𝑛#�̂�# 1−�̂�# = 16 ≥ 5
§ 𝑛A = 150, 𝑋A = 24, �̂�A =At#T*
= 0.16, 𝑛A�̂�A(1−�̂�A) = 20.16 ≥ 5
§ �̂� = �ka�`(ka(`
= A*aAt#**a#T*
= 0.176
• Teststatisticisthen
𝑧 =�̂�# − �̂�A
�̂�(1 − �̂�) 1𝑛#+ 1𝑛A
�=
0.20 − 0.16
0.176(0.824) 1100 +
1150
�
= 0.81
• FromtableorMATLAB,𝑃 𝑍 > 0.81 = 0.21,sop-valueis2 0.21 = 0.42 > 0.05 ⇒ donotrejectH*
Chi-squaretestsforcontingencytablesallowustotestforassociation betweentwocategoricalvariables.
“Istherestatisticalevidenceofanassociationbetweendailyaspirinandpepticulcerdisease?”
Howdoweestimatethemagnitudeoftheassociation betweentwocategoricalvariables?
“Howmuchhigheristherateofpepticulcerdiseaseamongdailyaspirinusers?”
22
Oddsratioandrelativerisk
HST190:IntrotoBiostatistics23
• Considertwocategoricalvariables:§ “disease”vs“nodisease”
§ “exposure”vs“noexposure”
• “Exposure”couldbetreatment,riskfactor,orotherfactor§ noassumptionsaboutincreasesordecreasesdiseaserisk
• Prospectivestudy:Supposefornowthatweenrollpatientsbasedonexposurestatus(vs.basedondiseasestatus)§ e.g.,100smokersand100nonsmokers
MeasuresofEffectforCategoricalData
HST190:IntrotoBiostatistics24
Afterwesampleaspecifiednumberofexposedandunexposedindividuals,weclassifythembydiseasestatusasshownbelow
Threewaystoquantifymagnitudeofassociation:
1. Riskdifference(RD)=sameasdifferenceofproportions
2. Relativerisk(RR)or‘riskratio’
3. Oddsratio
Exposure
Disease+ -
+ a b a+b- c d c+da+c b+d n
RiskDifference =p1 – p2,where
p1 =P(disease|exposed)
p2 =P(disease|unexposed)
estimated Risk Difference =aa + b
−cc + d
25 *
RiskDifference
Exposure
Disease+ -
+ a b a+b- c d c+da+c b+d n
RelativeRisk(RiskRatio) = 1
2
pp
estimated Relative Risk =
aa + b
!
"#
$
%&
cc + d
!
"#
$
%&
26 *
Exposure
Disease+ -
+ a b a+b- c d c+da+c b+d n
RiskRatio
Supposethatyouenroll100smokersand100nonsmokersinyourstudy:
smoke
disease+ -
+ 30 70 100- 15 85 100
45 155 200
15010015
10030 difference Risk .=-=
2100
15100
30 risk Relative ==
27
RiskDifferencevs.Ratio
Complicatingfactors
HST190:IntrotoBiostatistics28
Measuring“effectsize”:Whyitgetsmorecomplicated?
• Time§ Weoftenmeasurerateratioinsteadofariskratio
§ Moreonthisaspectwhenwediscusssurvivalanalysis
• EffectModificationandConfounding§ Ourestimatestypicallyneedtobeadjustedforotherfactors
• Sampling§ Dependingonhowyouenrollpatientsinyourstudy,itmaynotbepossibletoestimateariskdifferenceorriskratioeveninprinciple
Suppose you conduct a case-control study by enrolling 100 patients with disease and 100 without, and then determine which have smoked:
29
RiskDifferencevs.Ratio
smoke
disease+ -
+ 25 10 35- 75 90 165
100 100 200
• Can’testimatep1 &p2 ifyoupre-specifythenumberofsubjectswithdiseaseà can’testimateRDorRR.
• Needtoknowhowdatainyourtableweresampled!
Retrospectivesampling
HST190:IntrotoBiostatistics30
• Acase-controlstudy(orretrospectivestudy)samplespatientsbasedondiseasestatus,thenclassifiesaccordingtoexposure
§ oftenperformedforcostandefficiency,particularlywhenthediseaseoroutcomeisrarenoneedtofollowsubjectsthroughentirelifetimeandcollecthugesamples
• Case-controlstudiesareoftenperformedforcostandefficiency,particularlywhenthediseaseoroutcomeisrare– noneedtofollowsubjectsthroughtheirentirelifetimeandcollecthugesamples.
• Thereisameasureofeffectsizethatcanbecomputedregardlessofwhetherpatientsareenrolledbasedonexposurestatusordiseasestatus…
Odds
HST190:IntrotoBiostatistics31
• If𝑝 = 𝑃(event),thendefineoddsoftheeventas �#Y�
§ Probability = 0.2 ⇒ Odds = 0.25
§ Probability = 0.5 ⇒ Odds = 1
§ Probability = 0.75 ⇒ Odds = *.�T*.AT
= 3
§ Probability = 0.99 ⇒ Odds = *.SS*.*#
= 99
• Oddscanrangefrom0toinfinity§ Whenwerandomlysamplepatientsbasedonexposurestatus,wecanestimate𝑃(disease|exposed) and𝑃(disease|unexposed)
§ Ifweinsteadperformacase-controlstudy,wecan’t.Wecanonlyestimate𝑃(exposed|disease) and𝑃(exposed|nodisease)
Oddsratio
HST190:IntrotoBiostatistics32
Imagineatableshowingallindividualsinthepopulation(thetableyou“wish”youcouldsee)
Let𝑝# = 𝑃(disease|exposed) and𝑝A = 𝑃(disease|unexposed),thentheratioofbothexposure groupsʼoddsofdisease is:
OR =OddsofdiseaseforexposedOddsofdiseaseforunexposed
=𝑝# (1 − 𝑝#)⁄𝑝A (1 − 𝑝A)⁄
=𝑎/(𝑎 + 𝑏)𝑏 (𝑎 + 𝑏)⁄
𝑐/(𝑐 + 𝑑)𝑑 (𝑐 + 𝑑)⁄�
=𝑎𝑑𝑐𝑏
Exposure
Disease+ -
+ a b a+b- c d c+da+c b+d n
Oddsratio
HST190:IntrotoBiostatistics33
Imagineatableshowingallindividualsinthepopulation(thetableyou“wish”youcouldsee)
Ifweinsteadconsider𝑃(exposed|disease)and𝑃(exposed|nodisease),thentheratioofboth disease groupsʼoddsofexposure is:
OR =Oddsofexposurefordiseased
Oddsofexposurefornondiseased
=𝑎/(𝑎 + 𝑐)𝑐 (𝑎 + 𝑐)⁄
𝑏/(𝑏 + 𝑑)𝑑 (𝑏 + 𝑑)⁄�
=𝑎𝑑𝑐𝑏
Therefore,theORisameasureofassociationthatisnumericallyidenticalineitherstudydesign.
Exposure
Disease+ -
+ a b a+b- c d c+da+c b+d n
0.0 0.2 0.4 0.6 0.8
02
46
8
p
p/(1
− p
)
𝑝1 − 𝑝
𝑝
HST190:IntrotoBiostatistics34
• Therefore,samplingbyexposure,estimating𝑝# and𝑝A,andcomputingoddsratioisestimatingthesamequantityasestimatingtheoddsratio(of“exposureprobabilities”)inacase-controlstudy.
• SowhatifRRisofinterest?§ Ifdiseaseisrare,𝑝#, 𝑝A smallso𝑝
1 − 𝑝 ≈ 𝑝forsmall𝑝and
1 − 𝑝#1 − 𝑝A
≈ 1 ⇒
OR = �k #Y�k⁄�` #Y�`⁄ ≈ �k
�`= 𝑅𝑅
ORapproximatesRRforrareoutcome
Takeaways
HST190:IntrotoBiostatistics35
• CannotestimateRRandRDinacase-controlstudy(unlessyouhaveadditionaldata).
• Canestimateoddsratiofromeither“prospective”orcase-controlstudy,andweestimateitthesamewayineitherone.
• OddsratioapproximatesRRforraredisease.
Interpretingoddsratio
HST190:IntrotoBiostatistics36
• Difficulttogivean“everyday”interpretationofwhattheoddsratio’sprecisevaluemeans
• 𝑂𝑅 > 1 → exposureassociatedwithhigherdiseaserisk
• 𝑂𝑅 < 1 → exposureassociatedwithlowerdiseaserisk
• 𝑂𝑅 = 1 → noassociationofexposureanddiseasestatus
Inferenceonoddsratio
HST190:IntrotoBiostatistics37
• ToperformhypothesistestorgenerateCIforOR,we
1) ComputelogarithmofestimatedOR[ln(OR)]
2) Makeinferenceonln(OR)
3) TranslateconclusionsintostatementsaboutOR
• WhythelogoftheOR?
§ Thesamplingdistributionofln(OR)approximatesnormaldistributionmorecloselythanthatofORitself
o Hence,methodsbasedonnormalapproximationworkbetterforln(OR)
§ Toseethis,comparesamplingdistributionsofORvs.ln(OR):onthenextslidewesimulateapopulationwithfixedratesofexposureanddisease.Forthreedifferentsamplesizes,werandomlydraw1,000samplesandcomputeORandln(OR)foreach
HST190:IntrotoBiostatistics38
38
CodetorecreateinMatlab
HST190:IntrotoBiostatistics39
Sample_Size = [50,200,1000]; % Define the sample sizesProb1 = 0.75; Prob2 = 0.5;% Set the binomial probabilities for X and Ufigure;
for i=1:length(Sample_Size)X = binornd(1,Prob1,Sample_Size(i),10000); % Generate 10,000 trials
of XU = binornd(1,Prob2,Sample_Size(i),10000); % Generate 10,000 trials
of U
OR = (sum(X,1).*(sum(1-U,1)))./(sum(U,1).*(sum(1-X,1))); % Calculate the Odds Ratio
LOR = log(OR); % Calculate the log of the Odds Ratio
subplot(length(Sample_Size),2,2*i-1); hist(OR,20); xlim([min(OR) max(OR)]); xlabel('Odds Ratio'); ylabel(['Sample Size ' num2str(Sample_Size(i))]) % Plot the Odds Ratio
subplot(length(Sample_Size),2,2*i); hist(LOR,20); xlim([min(LOR) max(LOR)]); xlabel('Log Odds Ratio'); % Plot the Log Odds Ratioend
suptitle('Odds Ratio Demonstration'); % Set the title for the figure
ConfidenceintervalforOR
HST190:IntrotoBiostatistics40
• Iftheexpectedcountineachcellofthe2x2tableis≥5,thenthesampleestimateofthetruepopulationln(OR)approximatelyfollowsthedistribution
ln(OR)� ~𝑁 ln OR ,1𝑎 +
1𝑏 +
1𝑐 +
1𝑑
• Anotherwayofwritingthisresultis
Var 𝑂𝑅� ≈1
𝑛#�̂�#(1 − �̂�#)+
1𝑛A�̂�A(1 − �̂�A)
Exposure
Disease+ -
+ a b a+b- c d c+d
a+c b+d n
HST190:IntrotoBiostatistics41
• Therefore,togeta100(1 − 𝛼)% CIforthepopulationORweuseatwo-stepprocess:
1) CIforln OR :ln OR� ±𝑧#Y�`#[+ #
]+ #
^+ #
\� = (𝑐#, 𝑐A)
2) CIforOR:(𝑒^k, 𝑒^`)
• Importantly,theCIisnotsymmetricaroundestimatedOR
HST190:IntrotoBiostatistics42
• Consideranoutbreakofgastroenteritisinaschoolfollowinglunch.263studentsatelunchincafeteriathatday.Sandwichessuspected§ Howstrongistheassociation,ifany,betweenconsumptionofthesandwichandillness?Providea95%CIfortheoddsratio
§ OR = [\]^= #*S st
t(##q)= 7.99 ⇒ ln(OR� ) = ln 7.99 = 2.078
§ Step1:2.078 ± 𝑧#Y�`##*S
+ ###q
+ #t+ #
st� = (1.01,3.146)
§ Step2:95%CIforOR𝒆𝟏.𝟎𝟏, 𝒆𝟑.𝟏𝟒𝟔 = (𝟐. 𝟕𝟓, 𝟐𝟑. 𝟐)§ BecauseCIdoesnotcontain1,rejectnullofnoassociationat0.05level
Atesandwich? Ill?
Yes NoYes 109 116 225No 4 34 38
113 150 263
Multiple2x2tables
HST190:IntrotoBiostatistics43
• Whatifwehaveaconfoundingvariableassociatedwithexposureandoutcome,suchthatthereareseveral2x2tables,eachcorrespondingtooneleveloftheconfoundingvariable?
• Canwepoolthecountsinthetablesintoonetable?§ Notsofast.Thiscanseriouslybiasourresults…
HST190:IntrotoBiostatistics44
• Forexample,PercutaneousNephrolithotomy(PN)wascomparedwithseveralotherprocedures,classifiedas“open”procedures(OP),fortreatmentofrenalcalculi
• Percutaneoustreatmentclearlylookssuperior;theestimatedoddsratioforsuccessbasedonhaving(vs.nothaving)percutaneoustreatmentis
OR =289 7761(273) = 1.33 > 1
Successful UnsuccessfulPN 289 61 350OP 273 77 350
562 138 700
289/350=0.826chanceofsuccessforPN273/350=0.780chancesuccessesforOP
HST190:IntrotoBiostatistics45
• However,ifresultsarestratifiedbasedonstonesize,percutaneoustreatmentlooksworse!
§ Largestones:OR = TT �#AT(#SA)
= 0.81 < 1
§ Smallstones:OR = Ast qsq(ª#)
= 0.48 < 1
Suc. Unsuc.PN 289 61 350OP 273 77 350
562 138 700LargestonesSuc. Unsuc.
PN 55 25 80OP 192 71 263
247 96 343
SmallstonesSuc. Unsuc.
PN 234 36 270OP 81 6 87
315 42 357
HST190:IntrotoBiostatistics46
• Percutaneoustreatmentisassociatedwithhighersuccessrate(OR>1)overall,yetwithlowersuccessrate(OR<1)foreachtypeofstoneseparately§ Howisthatpossible?
• Thisistheresultofconfounding byafactorassociatedwithboththetreatmentandtheoutcome(whatisit?)§ PNwasusedmostlyforsmallstones,whichhadahighersuccessrateingeneral(88%).OP’swereusedmostlyforlargestones,whichhadlowersuccessrates(72%)
§ Poolingthedataallowedthestone-sizeeffecttomaskthedifferenceintreatmenteffectiveness
• Confoundingmayoccurwheneverthereisafactorthatisassociatedwithbothtreatmentassignmentandoutcome§ ConfoundingleadingtotheoppositeconclusioninaggregateddataiscalledSimpson’sParadox(or EcologicalFallacy).
HST190:IntrotoBiostatistics47
• Nostatisticalprocedure“automatically”protectsyoufromconfounding.Adjustmentforconfoundingrequiresunderstandingofthescience
• Afterastudyisconducted,certainstatisticaltechniquescanbeusedtoadjustforit(discussedovernexttwolectures)§ Stratification
§ Matching
§ (Logistic)Regressionadjustment
Stratification
HST190:IntrotoBiostatistics48
• Ifyoustratifydataintomultiple2x2tables(strata)basedonaconfounder,andbelievetheyshareacommonOR,youcanestimatethisORusingtheMantel-Haenszel Method(MH)
• Thismethodisvalidiftherelationshipbetweenexposureanddiseaseisthesameineachstratum(eventhoughbaselineriskmaydiffer)§ Iftherelationshipisnotthesameineachstratum,thenitdoesnotmakesensetocombinethedatafordoinginference
• Followtwosteps:1) TestwhethertheOR’sarethesameineachstratum
2) Ifso,proceedwithinferenceforthecommonOR,usingallthetables
Chi-squaretestforhomogeneity
HST190:IntrotoBiostatistics49
• ToseeiftheOR’sarethesameineachstratum,weusethechi-squaretestforhomogeneity
• Given𝑘 strata(tables),wetestthehypotheses§ 𝐻*:OR# = ORA = ⋯ = OR (homogeneity)
§ 𝐻#: atleastoneoftheOR’sisdifferent
• Teststatisticis𝑋¯°±A = ∑ 𝑤DD³# ln OR� D − ln OR
A
§ 𝑤D =#[´+ #
]´+ #
^´+ #
\´
Y#, ln OR =
∑ µ´¶´·k ¸¹ °º� ´
∑ µ´¶´·k
§ Underthenull,𝑋¯°±A ~𝜒Y#A
• Ifwereject𝐻*,stophere.Otherwise,estimatecommonOR
HST190:IntrotoBiostatistics50
• InRenalcalculiexample,testofhomogeneitybystonesize
§ Largestones:ln OR # = ln TT �#AT #SA
= −0.206
o 𝑤# =#TT+ #
AT+ #
#SA+ #
�#
Y#= 12.91
§ Smallstones:ln OR A = ln Ast qsq ª#
= −0.731
o 𝑤A =#Ast
+ #sq+ #
ª#+ #
q
Y#= 4.74
§ ln(OR) = #A.S# Y*.A*q at.�t(Y*.�s#)#A.S#at.�t
= −0.347
𝑋¯°±A = 12.91 −0.206 + 0.347 A + 4.74 −0.731 + 0.347 A
= 0.956 < 3.84 = 𝜒#,*.STA
§ Wefailtorejectthenullthattheoddsratiosdiffer,andcontinue
Mantel-Haenzel oddsratioestimator
HST190:IntrotoBiostatistics51
• Ifweconcludehomogeneityacrossstrata,thentheMantel-Haenszel Estimator ofthecommonOddsRatio is
OR ±¯ =∑ 𝑎D𝑑D/𝑛DD³#
∑ 𝑏D𝑐D/𝑛DD³#
• WecannowusehypothesistestsandconfidenceintervalsforthecommonOR(viatheln(OR)).First,checkthat
§ ∑ (𝑎D + 𝑐D)(𝑎D + 𝑏D)/𝑛DD³# ≥ 5
§ ∑ (𝑎D + 𝑐D)(𝑐D + 𝑑D)/𝑛DD³# ≥ 5
§ ∑ (𝑏D + 𝑑D)(𝑎D + 𝑏D)/𝑛DD³# ≥ 5
§ ∑ (𝑏D + 𝑑D)(𝑐D + 𝑑D)/𝑛DD³# ≥ 5
HST190:IntrotoBiostatistics52
• Undertheseconditions,the100(1 − 𝛼)% CIforln(OR)is
ln OR »¼ ± z#YZA¾𝑤D
D³#
Y#A
= (𝐿, 𝑈)
§ Where𝑤D =#[´+ #
]´+ #
^´+ #
\´
Y#
• TheCIfortheORisthen 𝑒Á, 𝑒Â
HypothesistestingforMH
HST190:IntrotoBiostatistics53
• Finally,wemaywishtotestnullhypothesisofnoassociationbetweentwovariables,controllingforacofounder:𝐻*: OR = 1versus𝐻#: OR ≠ 1
• Todothetest,weneedtocalculate3quantities:§ 𝑂 = ∑ 𝑂D
D³# = ∑ 𝑎DD³#
§ 𝐸 = ∑ 𝐸DD³# = ∑ ([´a]´)([´a^´)
(´D³#
§ 𝑉 = ∑ 𝑉DD³# = ∑ ([´a]´)(^´a\´)([´a^´)(]´a\´)
(´`((´Y#)
D³# (mustbe≥ 5)
• 𝑋±¯A = jYl Y*.T `
Ä,whichfollows𝜒#A distributionif𝐻* true
HST190:IntrotoBiostatistics54
• Returningtorenalcalculiexample,
OR ±¯ =55 71343 +
234 6357
25 192343 +
36 81357� = 0.69
§ compromisebetweentwostratum-specificORs(0.81and0.48)
• Tocompute95%CI,firstverifytheconditionsgivenpreviously(theyaremessytoshow,butinthiscasemet)
ln OR ±¯ ± 𝑧#YZA1/ 12.91 + 4.74� = −0.84,0.10
• Thus,95%CIforORis 𝑒Y*.ªt, 𝑒*.#* = (0.43,1.10)