SticiGui Confidence Intervals

18

description

dfgvdfvdsfvdf

Transcript of SticiGui Confidence Intervals

  • Chapter26

    ConfidenceIntervalsThischaptercontinuesourstudyofestimatingpopulationPARAMETERSfromRANDOMSAMPLES.InCHAPTER25,ESTIMATINGPARAMETERSFROMSIMPLERANDOMSAMPLES,westudiedESTIMATORSthatassignanumbertoeachpossiblerandomsample,andtheuncertaintyofsuchestimators,measuredbytheirRMSE.(TheRMSEisthesquarerootoftheexpectedvalueofthesquareddifferencebetweentheestimatorandtheparameterameasureofthetypicalsizeoftheerror.)Insteadofassigningasinglenumbertoeachsampleandreportingthesizeofatypicalerror,themethodsinthischapterassignanintervaltoeachsampleandreporttheCONFIDENCELEVELthattheintervalcontainstheparameter.Confidenceisatechnicaltermrelatedtoprobability.JustastheRMSEofanestimatormeasuresthelongrunaveragesizeoftheerrorinrepeatedsampling,buttheerrorforanyparticularsamplecouldbesmallerorlargerthantheRMSE,theconfidencelevelisthelongrunfractionofintervalsthatcontaintheparameterinrepeatedsampling,buttheintervalforanyparticularsamplemightormightnotcontaintheparameter.

    Thestatement"theinterval[92%,94%]containsthepopulationpercentageatconfidencelevel90%"doesnotmeanthattheprobabilitythatthepopulationpercentageisbetween92%and94%is90%.(Theeventthattheinterval[92%,94%]containsthepopulationpercentageisnotrandom:Eitherthepopulationpercentageisbetween92%and94%,oritisnot.)Rather,thestatementmeansthatifweweretotakesamplesofsizenrepeatedlyandcomputea90%confidencelevelconfidenceintervalforthepopulationpercentagefromeachsampleofsizen,thelongrunfractionofintervalsthatcontainthepopulationpercentagewouldconvergeto90%.

    Thelengthoftheconfidenceintervalandtheconfidencelevelmeasurehowaccuratelyweareabletoestimatetheparameterfromasample.Ifashortintervalhashighconfidence,thedataallowustoestimatetheparameteraccurately.Higherconfidencegenerallyrequiresalongerinterval,ceterisparibus,and,shorterintervalsgenerallyhavelowerconfidencelevels.Conventionalvaluesfortheconfidencelevelofconfidenceintervalsinclude68%,90%,95%,and99%,butsometimesothervaluesareused.Itiscrucialtoknowtheconfidencelevelassociatedwithaconfidenceinterval:Theintervalbyitselfismeaningless.

    Conservativeconfidenceintervalsforpercentages

  • Inthissection,wedevelopconservativeconfidenceintervalsforthepopulationPERCENTAGEbasedontheSAMPLEPERCENTAGE,usingCHEBYCHEVSINEQUALITYandanupperboundontheSDofliststhatcontainonlythenumbers0and1.Conservativemeansthatthechancethattheprocedureproducesanintervalthatcontainsthepopulationpercentageisatleastlargeasclaimed.(Laterinthischapterwewillconsiderapproximateconfidenceintervals.)

    Considera01BOXofNtickets.Thepopulationpercentagepisthefractionofticketslabeled"1:"

    p=100%(#ticketsinthepopulationlabeled"1")/N,

    ThepopulationpercentageisalsothePOPULATIONMEANofthenumbersonalltheticketsinthebox,ave(box).ThesamplepercentageofaSIMPLERANDOMSAMPLE(randomsamplewithoutreplacement)ofsizenfromthepopulationofNticketsis

    =100%(#ticketsinthesamplelabeled"1")/n.

    Thesamplepercentageisthesamplemeanofthelabelsontheticketsinthesample.TheEXPECTEDVALUEofthesamplepercentageisthepopulationpercentagep,andtheSEofthesamplepercentageis[+]

    SE()=f(p(1p))/n

    f50%/n,

    wherefisthefinitepopulationcorrection

    f=(Nn)/(N1).

    Thusf50%/nisanupperboundontheSEofthesamplepercentage.

    FIGURE261showswhathappensifwecenteranintervalatthesamplepercentage,andextendtheintervaldownandupfromthesamplepercentagebytwicetheupperboundontheSEofthesamplepercentage.Whentheintervalincludesthepopulationpercentage,wesaytheintervalCOVERSthetruth.Theintervalisrandom,becauseitiscenteredatthesamplepercentage,whichisrandom.ThechancethattherandomintervalwillcontainthetruepopulationpercentageiscalledtheCOVERAGEPROBABILITYoftheinterval.TakeafewsamplesbyclickingTakeSampletogetthefeelofthetoolthenincreaseSamplestoTaketo1000andclickTakeSampleagain.Theactualpercentageofintervalsthatcoverwillvary,butalmostalwaysitwillbelargerthan75%,sometimesnearly100%.Theempiricalpercentageofintervalsthatcoverisanestimateofthecoverageprobabilityoftheprocedure.VarythesamplesizeandputafewdifferentlistsofzerosandonesintothePopulationboxattherightofthefigure,andtryafewdifferentsamplesizesforeachpopulation.Youshouldfindthatthefractionofintervalsthatcoverthetruepopulationpercentagestaysabove75%(almostwithoutfail),nomatterwhatthepopulationofzerosandonesis.

    Figure261:ConservativeConfidenceIntervalforthePopulationPercentage

  • Whydotheserandomintervalscoverthetruepopulationpercentagesooften?WecanshowthattheyshouldusingChebychev'sinequality.Because

    SE()f50%/n,

    theevent

    |p|kSE()

    isasubsetoftheevent

    |p|kf50%/n.

    Itfollowsthat

    P(|p|kSE())P(|p|kf50%/n).

    CHEBYCHEV'SINEQUALITYguaranteesthatthechancethesamplepercentagediffersfromitsexpectedvaluepbymorethanktimesitsSTANDARDERRORisatmost1/k2,so

    11/k2P(|p|kSE())

    P(|p|kf50%/n).

    Thatis,

    P(|p|kf50%/n)11/k2.

    Therefore,inthelongruninrepeatedsampling,thefractionoftrialsinwhichthesamplepercentageiswithin2f50%/nofthepopulationpercentagepconvergestoanumberthatis75%orlarger.[+]Wheneveriswithin2f50%/nofthepopulationpercentagep,anintervalcenteredatextendingdownandupby2f50%/nwillcontainp.Thatis,theinterval

    2f50%/n,

    whichisshorthandfor

    [2f50%/n,+2f50%/n],

    containspatleast75%ofthetime,inthelongrun.Similarly,thefractionoftrialsinwhichiswithin3f50%/nofpconvergestoanumberthatis88.89%orlarger,sothelongrunfractionofintervals3f50%/nthatcontainpwillbe88.89%orlarger.Thefractionoftrialsinwhichiswithin

    Samplefrom: Box withoutreplacementTakeSample HideBox

    Samples:0SD(Box):0.5Ave(Box):0.5

    0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    110001

    SampleSize:3 Samplestotake:1 Intervals:+/2 * BoundonSE(01boxonly) 0%cover

  • 4f50%/nofpconvergestoanumberthatis93.75%orlarger,sothelongrunfractionofintervals4f50%/nthatcontainpwillbe93.75%orlarger,etc.

    Ingeneral,ifwegodownandupfromthesamplepercentagebykf50%/n,theninthelongruninrepeatedtrials,theresultingintervalswillincludethetruepopulationpercentageatleast11/k2ofthetime.

    ChangetheIntervals:valueinFIGURE261to3andto4toconfirmempiricallythatthisistrue.

    Theintervalkf(50%/n)israndom:Itscenterdependson,whichinturndependsonwhichUNITS(here,tickets)happentobeintherandomsample.Theprobabilityisintherandomsamplingprocedure,notintheparameter.ThePARAMETERisthesame,nomatterwhatsamplewehappentogettheparameterisapropertyofthepopulation,notthesample.Itistheintervalthatvarieswiththerandomsample.Beforethedataarecollected,thecoverageprobabilityisthechancethatsamplingwillresultinanintervalthatcontainstheparameter.

    Takingthesampledeterminestheinterval,leavingnothingtochance:Theintervaltheprocedureproducedeitherdoesordoesnotcontainthepopulationpercentage.(Onecouldsaythataftercollectingthedata,thechancethattheintervalcoverstheparameteriseither0or100%.)Typically,weneverlearnwhethertheintervalcoverstheparameter,butourignoranceisnotaprobability(atleast,notaccordingtotheFREQUENCYTHEORYOFPROBABILITYusedinthisbook).

    TheintervaltheproceduregivesforanyparticularsetofdataiscalledaCONFIDENCEINTERVAL.TheCONFIDENCELEVELofaCONFIDENCEINTERVALisequaltotheCOVERAGEPROBABILITYoftheprocedurebeforethedataarecollected.

    CONFIDENCEisawordstatisticiansreserveforthisidea.If,beforecollectingthedata,theprocedureweareusinghasaP%chanceofproducinganintervalthatCOVERSthetruePOPULATIONPERCENTAGE,then,aftercollectingthedata,theintervaltheprocedureproducediscalledaP%CONFIDENCEINTERVAL.

    CoverageProbabilityandConfidenceLevel

    Considerapopulationparameter,andaprocedurethatproducesrandomintervals.SupposethattheprobabilitythattheprocedureproducesanintervalthatcontainstheparameterisP%.

    1. TheprocedureissaidtohavecoverageprobabilityP%.2. Theintervaltheprocedureproducesforanyparticularsampleis

    calledaP%confidenceintervalfortheparameter,oraconfidenceintervalfortheparameterwithconfidencelevelP%.

    Inrepeatedsampling,aboutP%ofconfidenceintervalswithconfidencelevelP%willcontain(COVER)thePARAMETER.About(100P)%oftheintervalswillnotcovertheparameter.Foranyparticularsample,unlessthepopulationparameterisknown,wewillnotknowwhethertheconfidenceintervalcoversthePARAMETER.

    CHAPTER25,ESTIMATINGPARAMETERSFROMSIMPLERANDOMSAMPLES,summarizedtheuncertaintyofanestimateofaparameterbytheMEANSQUAREDERRORorROOTMEANSQUAREDERRORoftheestimator,whicharemeasuresoftheaverageerroroftheestimatorinrepeatedsampling.Aconfidenceintervalisadifferentwayofexpressingtheuncertaintyinanestimate:arangeofvaluesthatcontainstheparameterwithspecifiedconfidencelevel.

    TheinterpretationofconfidencelevelforaparticularintervalisanalogoustotheinterpretationofRMSEforaparticularvalueoftheestimate:TheRMSEisthesquarerootofthelongrunaveragesquarederroroftheestimatorinrepeatedsampling,butforanyparticularsample,theerrorcouldbelargerorsmallerthantheRMSEandwewillnotknowwhichunlessweknowthetruevalueoftheparameter.Theconfidencelevelmeasuresthelongrunfractionofintervalsthatcontaintheparameterinrepeatedsampling,butforanyparticularsample,theconfidenceintervaleitherwillorwillnotcontaintheparameterandwewillnotknowwhichunlessweknowthetruevalueoftheparameter.[+]

    WecanusetheapproachdevelopedinthissectiontoconstructconfidenceintervalsforthePOPULATION

  • PERCENTAGEPwithothernominalconfidencelevels,byextendingtheintervalupanddownfromtheSAMPLEPERCENTAGEbylargerorsmalleramounts.Thelongertheintervals,thelargerthenominalconfidencelevelthelargerthechancethatanintervalwillcontainp.Theshortertheintervals,thesmallerthechancethatanintervalwillcontainp.Inparticular,ifwechooseksothat[+]

    11/k2=P%,

    thentheinterval

    [kf50%/n,+kf50%/n]

    isa(nominal)P%confidenceintervalforthepopulationpercentagep.

    Conversely,togetanominalP%conservativeconfidenceintervalforthepopulationpercentageusingasimplerandomsample,weshouldtakeanintervalthatextendsdownandupfromthesamplepercentagebykf50%/n,with

    k=(1P/100).

    TheactualCOVERAGEPROBABILITYoftheinterval

    [kf50%/n,+kf50%/n]

    isgreaterthan(11/k2),fortworeasons.First,theSTANDARDERRORofthesamplepercentageislessthanf(50%/n)unlessthepopulationpercentagepis50%.Second,thedistributionofthesamplepercentageisthatofanhypergeometricrandomvariabledividedbythesamplesize,n,andsuchadistributioncannotattaintheboundinCHEBYCHEV'SINEQUALITY:EvenforthetrueSEofthesamplepercentage,

    SE()=f(p(1p))/n,

    thechancethatthesamplepercentageiswithinkSE()ofthepopulationpercentagepisgreaterthan11/k2:

    P(|p|11/k2.

    Asaresult,confidenceintervalsforthepopulationpercentagebasedonChebychev'sinequalityandtheupperboundof50%fortheSDofalistofzerosandonesareconservative:theactualCONFIDENCELEVELisgreaterthanthenominalconfidencelevel,(11/k2).Thenextsectiondevelopsaprocedurethatisnotconservative,butthatisapproximate:Theconfidencelevelcouldbelargerorsmallerthanthenominallevel.(Thenominalconfidencelevelisclosetotheactualconfidencelevelwhenthesamplesizenislarge.)

    Apopulationpercentagecannotbelessthan0%.Ifthelowerendpointofaconfidenceintervalforapopulationpercentageisnegative,itiscompletelylegitimatetoreplacethelowerendpointbyzero:Itdoesnotdecreasetheconfidencelevel.Similarly,apopulationpercentagecannotbegreaterthan100%.Iftheupperendpointofaconfidenceintervalforapopulationpercentageisgreaterthan100%,itislegitimatetoreplacetheupperendpointby100%.Theconfidencelevelremainsthesame.Similarly,ifweareconstructingaconfidenceintervalforaquantitythatcannotbenegative(height,weight,orage,forinstance),removingnegativevaluesfromaconfidenceintervalcannotreducethecoverageprobabilityorconfidencelevel.

    ConfidenceIntervalsforRestrictedParameters

    Ifsomevaluesofaparameterareknowntobeimpossible,excludingthosevaluesfromaconfidenceintervaldoesnotreducetheconfidenceleveloftheconfidenceinterval.

    Conversely,includingimpossiblevaluesofaparameterinaconfidenceintervaldoesnotincreasetheconfidencelevel.

    Forexample,ifaconfidenceintervalforaparameterthatmustbepositivehasalowerendpointthatisnegative,thelowerendpointcanbereplaced

  • withzero.Theconfidencelevelremainsthesame.

    Inparticular,ifthelowerendpointofaconfidenceintervalforapopulationpercentageisnegative,thelowerendpointcanbereplacedwithzero.Iftheupperendpointofaconfidenceintervalforapopulationpercentageisgreaterthan100%,theupperendpointcanbereplacedwith100%.

    Wheneveryouuseaconfidenceinterval,itcrucialtoreporttheconfidencelevel.Otherwise,itisimpossibletointerprettheresult.Thechoiceoftheconfidencelevelisessentiallyarbitrary,butthechoiceshouldbemadebeforecollectingthedata.Commonvaluesoftheconfidencelevelare68%,90%,95%,and99%.Thereisatradeoffbetweenprecision(thelengthoftheconfidenceinterval),andconfidencelevel:Ceterisparibus,higherconfidencelevelsrequirelongerconfidenceintervals.

    Thefollowingexercisechecksyourabilitytocomputeaconservativeconfidenceintervalforthepopulationpercentage.

    Exercise261.TheenteringclassatNorthSouthcentralUniversitycontains600students.Thedean'sofficeseekstodeterminethepercentageofenteringstudentswhohavecreditcards.Thedean'sofficewilltakeasimplerandomsampleof40enteringstudents,interviewthem,andcomputethesamplepercentage.Theofficewouldliketoconstructaconservative75%confidenceintervalforthepercentageofenteringstudentswhohavecreditcards.Thecenteroftheintervalwillbethesamplepercentage.

    Theintervalshouldextendupanddownfromthesamplepercentageby

    Thesampleistaken,andthesamplepercentageisobservedtobe86%.

    Thelowerendpointoftheconfidenceintervalshouldbe andtheupperendpointshouldbe

    Theprobabilitythatthisintervalcontainsthepercentageofstudentsintheenteringclasswhohavecreditcards ?

    Theconfidencelevelofthisinterval ?

    [+Solution]

    Conservativeconfidenceintervalsforpopulationmeansofboundedboxes

    Recallthatpercentagesarejustmeansofspeciallistsofnumbers,liststhatcontainsonlyzerosandones.Wecanfindconfidenceintervalsforthemeansofmoregenerallistsofnumbers,too.

    IntheprevioussectionweexploitedthefactthattheSDofa01boxisatmost1/2toconstructconservativeconfidenceintervalforthepopulationmeanofa01boxthatis,thepopulationpercentage.Theapproachcanbeusednotonlyfor01boxes,butwheneverwecanfindaboundontheSDofthebox,sothatwecanapplyChebychev'sinequality.Foranyboxofnumberedticketswhatsoever,thesamplemeanofasimplerandomsampleorrandomsamplewithreplacementisanunbiasedestimatorofthepopulationmeanofthenumbersonthetickets,andtheSEofthesamplemeanisproportionaltotheSDofthebox.

    Forinstance,supposeweknowthatthenumbersontheticketsintheboxareallbetweenaandb,withab.ThenSD(box)isatmost(ba)/2.[+]Inthespecialcasethata=0andb=1,thisimpliesthattheSDofa01boxisatmost50%,aswehaveseenalready.

  • Thatinturnimpliesthatthemeansthatifallthenumbersinaboxarebetweenaandb,theSEofthesamplemeanofasimplerandomsampleofndrawsfromtheboxisatmostf(ba)/(2n),wherefistheFINITEPOPULATIONCORRECTION.AndtheSEofthesamplemeanofndrawswithreplacementfromtheboxisatmost(ba)/(2n).

    SamplingfromaBoundedBox

    Supposeallthenumbersinaboxarebetweenaandb,withab.Then:

    SD(box)isatmost(ba)/2TheSEofthesamplemeanofndrawswithreplacementfromtheboxisatmost(ba)/(2n).TheSEofthesamplemeanofasimplerandomsampleofsizenfromtheboxisatmostf(ba)/(2n),wherefistheFINITEPOPULATIONCORRECTION.

    WithaboundontheSE,wecanuseChebychev'sinequalitythesamewaywedidforthepopulationpercentagetogetaconfidenceintervalforthepopulationmeanofthenumbersontheticketsinabox:

    ConservativeConfidenceIntervalsforthePopulationMeanofaBoundedList

    Supposeallthenumbersinaboxarebetweenaandb,whereab.

    Forasimplerandomsampleofsizen,thechancethattherandominterval

    [(samplemean)kf(ba)/(2n),(samplemean)+kf(ba)/(2n)]

    includesthemeanofthenumbersintheboxisatleast11/k2,wherefisthefinitepopulationcorrection(Nn)/(N1),Nisthepopulationsize,andnisthesamplesize.

    Forrandomsamplingwithreplacement,thechancethattherandominterval

    [(samplemean)k(ba)/(2n),(samplemean)+k(ba)/(2n)]

    includesthemeanofthenumbersintheboxisatleast11/k2.

    Inbothcases,ifthelowerendpointoftheintervalislessthana,itcanbereplacedbya,andiftheupperendpointoftheintervalisgreaterthanb,itcanbereplacedbyb.

    Theseareconservativeproceduresforconstructingconfidenceintervals:theprobabilitythattheintervalstheyproducecoverthetruepopulationmeanisgreaterthantheprobabilitytheyclaim,11/k2(thenominalcoverageprobability).

    Approximateconfidenceintervalsforpercentages

  • ConfidenceintervalsforthepopulationpercentagebasedonChebychev'sinequalityandtheupperboundof50%fortheSDoflistsofzerosandonesareconservative:Theirtrueconfidencelevelisgreaterthantheirnominalconfidencelevel,(11/k2).Wecoulduseshorterintervalsandstillhaveconfidencelevel(11/k2),orwecouldclaimaconfidencelevelhigherthan(11/k2).

    Howmuchshortercouldtheintervalbe,orhowlargeaconfidencelevelcouldweclaim?Itispossibletofigurethesethingsoutprecisely,[+]butweshallfollowastandardapproximateapproachinstead,onethatwecanextendtoothersituations.WeshallusetheCENTRALLIMITTHEOREMtodevelopaprocedurethatproducesshorterconfidenceintervalsforagivennominalconfidencelevel.Thenewprocedurewillbeapproximateinsteadofconservative:thecoverageprobabilitywillbeclosetothenominalcoverageprobabilitywhenthesamplesizeislarge,butcouldbesmallerorlargerdependingonthepopulationpercentage,andcouldbequitedifferentfromthenominalcoverageprobabilityforsmallsamplesfrompathologicalpopulations.

    Weshallassumethroughouttherestofthischapterthateither

    thesampleisdrawnwithreplacement,orthesamplesizenismuch,muchsmallerthanthepopulationsizeN.

    Withthisassumption,wecanneglecttheFINITEPOPULATIONCORRECTIONandactasiftheticketsinthesampleweredrawnindependently.(SeeCHAPTER22,STANDARDERROR.)Whentheticketsaredrawnindependently,theCENTRALLIMITTHEOREMtellsusthatasthesamplesizegrows,theNORMALCURVEisabetterandbetterapproximationtothePROBABILITYHISTOGRAMoftheSAMPLEPERCENTAGE(andtotheprobabilityhistogramoftheSAMPLEMEAN).TheNORMALAPPROXIMATIONtotheprobabilitythatthesamplepercentageisintheinterval

    [p1.15(p(1p))/n,p+1.15(p(1p))/n]

    isequaltotheareaundertheNORMALCURVEforthecorrespondingrangeofvaluesinSTANDARDUNITS,[1.15,1.15].Theareaunderthenormalcurvebetween1.15and1.15isabout75%:

    Selectedarea:74.99%Lowerendpoint: 1.15 Upperendpoint: 1.15

  • Thisismuchlargerthantheboundof(11/(1.15)2)=24.4%thatCHEBYCHEV'SINEQUALITYgives.Whenthesamplepercentageiswithin

    1.15(p(1p))/n

    ofp,piswithin

    1.15(p(1p))/n

    ofthesamplepercentage,sotheprobabilitythattheinterval

    I=[1.15(p(1p))/n,+1.15(p(1p))/n]

    containsthepopulationpercentagepisabout75%:ThecoverageprobabilityofIisapproximately75%.

    Unfortunately,wecannotconstructIfromthesamplealone:thesampledeterminesthecenterofI,buttofindthelengthofIweneedtoknowp(1p),whichistantamounttoknowingp.[+]Ifweknewp,wewouldnotbeestimatingit.

    Ifthesamplesizenislarge,theSAMPLESTANDARDDEVIATIONS

    s=((n/(n1))(1)),

    islikelytobeclosetotheSDofthepopulationwhenthathappens,

    s/n

    isclosetoSE(),thestandarderrorofthesamplepercentage.Therefore,ifthesamplesizeislarge,buteitherthesampleissmallcomparedtothepopulationorthesampleistakenwithreplacement,theprobabilitythattherandominterval

    [1.15s/n,+1.15s/n]

    containsthepopulationpercentagepisabout75%.Thisintervalhasnotonlyarandomcenter(thesamplepercentage),butalsoarandomlength(thelengthdependsontheobservedvalueofs,andsisrandom,becauseitdependsontherandomsample).

    FigureFIGURE262letsyoutrytheprocedureyourself.EachtimeyouclicktheTakeSamplebutton,asampleisdrawnwithreplacementfromthenumbersintheboxontheright(initiallysettoarandomlistofzerosandones).Thesamplesizeinitiallyissetto30.Thecontrolsatthebottomofthefigureallowyoutochangethesizeofeachsample,thenumberofsamplesthataretakeneachtimeyouclickthebutton,andthewidthoftheinterval,asamultipleoftheestimatedSEortheconservativeboundontheSE.(TheestimatedSEisS/nbecausewearesamplingwithreplacementtheboundis0.5/n.)AlabelinthebottomrightcornerreportsthefractionofintervalsthatCOVERthepopulationpercentage.Intervalsthatcoveraregreenthosethatdonotcoverarered.Asmallblackdotmarksthemiddleofeachinterval(thesamplepercentage).Ablueverticallinemarksthetruepopulationpercentagep.

    Figure262:Approximateconfidenceintervalsforthepopulationmeanandpercentage

    5 4 3 2 1 0 1 2 3 4 5

    Samplefrom: Box withreplacementTakeSample HideBox

    Samples:0SD(Box):0.49Ave(Box):0.4

    00101

  • TakeafewsamplestogetthefeelofthetoolthenincreasetheSamplestotaketo1000,andclicktheTakeSamplebuttonagain.Theactualpercentageofintervalsthatcoverwillvary,butshouldbereasonablycloseto75%.IncreaseSamplesizeto200andtryagainthepercentageofintervalsthatcovershouldbecloserto75%.TryputtingafewdifferentlistsofzerosandonesintothePopulationboxattherightofthefigure,andtryafewdifferentsamplesizesforeachpopulation.Whenthesamplesizeislarge,thefractionofintervalsthatcoverthetruepopulationpercentagewillbeverycloseto75%.

    Thefollowingexercisescheckyourabilitytocomputeconservativeandapproximateconfidenceintervalsforthepopulationpercentage,andyourabilitytodeterminewhichmethodismoreappropriate.

    VideosofExercises

    (Reminder:Examplesandexercisesmayvarywhenthepageisreloadedthevideoshowsonlyoneversion.)

    0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    SampleSize:30 Samplestotake:1 Intervals:+/1.15 * EstimatedSE 0%cover

  • Exercise262.IwouldliketoknowthefractionofUCBerkeleyundergraduateswhocommutetoschoolfromtheirparents'homes.Isendemailtostudentswithcampuscomputeraccountsuntil100haveresponded3oftheresponderswerecommuters.

    Anapproximate95%confidenceintervalforthefractionofUCBerkeleyundergraduateswhocommutetoschoolisfrom to

    [+Solution]

    Exercise263.IwouldliketoknowthefractionofhomesinAlamedaCounty,California,thathaveassessedvaluesof$700,000ormore.Itakeasimplerandomsampleofsize500fromtheAlamedaCountypropertytaxrecords(somehow).Thesamplepercentageofhomesassessedat$700,000ormoreis10%.

    Anapproximate96%confidenceintervalforthepercentageofhomesassessedat$700,000ormoreisfrom to

    [+Solution]

    Exercise264.Arandomsamplewithreplacementofsize20wastakenfromaboxoftickets.Eachticketintheboxisnumberedeitherzeroorone.Sixoftheticketsinthesamplearelabeled"1"therestarelabeled"0."

    Thesamplepercentageis

    Thesamplesize ? largeenoughtojustifyassumingthatsisclosetoSD(box)andusingaconfidenceintervalbasedonthenormaldistribution.

    TheSEofthesamplepercentageis ?

    Aconservative70%confidenceintervalforthepopulationpercentageisfrom to

    [+Solution]

  • Exercise265.Arestauranteurplanstochangethemenuinherrestaurant,whichspecializesingamemeats.Sheistryingtodecidewhetherornottooffervenisongoulashonthenewmenu.Eachdayforamonth,shepickspeopleatrandomastheycomeintotherestaurant,andasksthemwhethertheywouldordervenisongoulashifitwereoffered.Onbusydays,shepicksmorepeopleonquietdays,shepicksfewerpeople.Supposethatineffectshehasasimplerandomsampleof160peoplewhoeatatherrestaurant.Supposefurtherthatthenumberofdinersismuchmuchlargerthanthesample.Inthesample,118saytheywouldordervenisongoulashifitwereoffered.

    Thesamplepercentageofdinerswhosaytheywouldordervenisongoulashis

    Thebootstrapestimateofthepopulationstandarddeviationis

    A98%confidenceintervalforthepercentageofdinerswhowouldsaytheywouldordervenisongoulashamongthepopulationofpeoplewhoeatatthatrestaurantwouldgofrom to

    [+Solution]

    ApproximateConfidenceIntervalsforthePopulationMean

    Supposethatweseekaconfidenceintervalforthemeanofapopulation(box)ofnumbers,basedonarandomsamplefromthepopulation.TheSAMPLEMEANisanUNBIASEDestimatorofthepopulationmean(E(SAMPLEMEAN)=AVE(box)),soitisreasonabletocenteraconfidenceintervalatthesamplemean.Howwideshouldwemakeanintervalcenteredatthesamplemean,fortheintervaltohaveaspecifiedprobabilityofCOVERINGthePOPULATIONMEAN?

    IfweknewtheSDofthepopulationorhadanupperboundontheSDofthepopulation,wecoulduseCHEBYCHEV'SINEQUALITYtoconstructaconservativeconfidenceintervalforthepopulationmean,aswedidearlierinthechapter:thestandarderrorofthesamplemeanis

    SE(SAMPLEMEAN)=SD(box)/n,

    wherenisthesamplesize.So,forexample,theCOVERAGEPROBABILITYoftherandominterval

    [(samplemean)2SD(box)/n,(samplemean)+2SD(box)/n]

    isatleast75%.

    Typically,however,theSDofthepopulationisnotknown,sowecannotconstructthisinterval.Moreover,

  • typicallywecannotusetheconservativeapproachbasedonChebychev'sInequality,becausethereisnoupperboundontheSDofagenerallistofnumbersanalogoustotheupperboundof50%fortheSDofliststhatcontainonlyzerosandones.(Aswehaveseen,ifallthenumbersareboundedbetweenaandb,withab,thenSD(box)(ba)/2buttypicallywedonotknowsuchlowerandupperboundsaandb.)

    However,theapproximateapproachtoconstructingconfidenceintervals,basedonthenormalcurve,worksifthesamplesizeissufficientlylarge.TheCENTRALLIMITTHEOREMtellsusthatthePROBABILITYHISTOGRAMoftheAVERAGEofndrawswithreplacementfromaboxfollowstheNORMALCURVEincreasinglywellasthenumberofdrawsnincreases.WealsoknowthatthesamplestandarddeviationsisincreasinglylikelytobeanaccurateestimateoftheSDofthepopulationasnincreases.Asaresult,theprobabilitythattheSAMPLEMEANiswithinzs/nisapproximatelythesameastheareaunderthenormalcurvebetweenzandz.Foranyfixedpopulation(box),theapproximationimprovesasthesamplesizenincreases,forrandomsamplingwithreplacement.ExampleEXAMPLE261illustratescalculatinganapproximateconfidenceintervalforthepopulationmean.Theexampleisdynamic:Itwilltendtochangewhenyoureloadthepage.

    Example261:ApproximateConfidenceIntervalforthePopulationMean

    Toassessscholasticperformance,astateadministersanachievementtesttoasimplerandomsampleof160highschoolseniors.Thereare40000highschoolseniorsinthestate.Themeanscoreofthestudentswhotooktheexamis104.34points,andthesamplestandarddeviationoftheirscoresis13.9points.Findanapproximate98%confidenceintervalfortheaverageofthepopulationscoresthatwouldhavebeenobtainedhadeveryhighschoolseniorinthestatebeenadministeredtheachievementtest.

    Solution.Thesamplesize(160)isasufficientlysmallfractionofthepopulationsize(40,000)thattreatingthesampleasifitweredrawnwithreplacementisreasonable.Thesamplesizeissufficientlylargethatthenormalapproximationtothedistributionofthesamplemeanshouldbereasonablyaccurate,andthatthesamplestandarddeviationshouldbeclosetothestandarddeviationofthepopulation.Theareaunderthenormalcurvebetween2.326is98%:

    Thus,anapproximate98%confidenceintervalwouldbecenteredatthesamplemean,andextenddownandupbyfromthesamplemeanby2.326standardunits.Theestimatedstandarderrorofthesamplemeanis

    13.9/160=1.099points.

    Theconfidenceintervalthusshouldextenddownandupfromthesamplemeanby

    2.3261.099points,

    sotheconfidenceintervalis

    [101.784points,106.896points]

    Thefollowingexercisechecksyourabilitytocalculateapproximateconfidenceintervalsforthepopulation

    5 4 3 2 1 0 1 2 3 4 5Selectedarea:98%

    Lowerendpoint: 2.326 Upperendpoint: 2.326

  • mean.Theexerciseisdynamic:Thequestionwilltendtochangewhenyoureloadthepage.

    Exercise266.Todeterminetheaveragelifetimeoftheirlightemittingdiode(LED)lightbulbs,amanufacturertakesasimplerandomsampleof110bulbsfromamanufacturinglotof34,000bulbs.Themeanlifetimeofthebulbsinthesampleis93.49thousandhours,andthesamplestandarddeviationoftheirlifetimesis8.56thousandhours.

    Anapproximate98%confidenceintervalfortheaveragelifetimeofthebulbsinthemanufacturinglotwouldextendfrom

    thousandhours(low)to thousandhours(high)

    .

    [+Solution]

    ExactConfidenceIntervalsforPercentagesWehaveseentwomethodsforconstructingconfidenceintervalsforapopulationpercentage:aconservativemethodbasedonChebychev'sInequalityandaboundonSD(box),andanapproximatemethodbasedonthenormalapproximation.Conservativemeansthatthecoverageprobabilityisatleastashighasclaimedbutcouldbesubstantiallyhigherforsomepopulations.Approximatemeansthatthecoverageprobabilityisroughlyashighasclaimedbutcouldbesubstantiallylower(orsubstantiallyhigher)forsomepopulations.Thissectiondevelopsathirdmethod,whichisexact.Exactmeansthattheprobabilitythattherandomintervalcoversthetruepopulationpercentageisjustwhatitisclaimedtobe(dependingonthevalueofitcanbeabithigher,simplybecausethebinomialdistributionisadiscretedistribution).

    Theseintervalsareratherdifferentfromtheconfidenceintervalspresentedearlierinthischapter,whichwereoftheform(estimateuncertainty).Instead,eachoftheendpointsiscomputedfromthedata,separately.Theresultingintervalusuallyisnotsymmetricaroundthesamplepercentage.

    Weassumethatasampleofsizenisdrawnatrandomwithreplacementfroma01box.Wewanttofindaconfidenceintervalforp,thepercentageofticketslabeled"1"inthebox.LetXbethenumberofticketsinthesamplethatarelabeled"1."Ifthetruepercentageofticketslabeled"1"inthe01boxisp,thenXhasaBINOMIALPROBABILITYDISTRIBUTIONwithparametersnandp.Wewillconstructaconfidenceintervalforpbylookingatthevaluesofpthatareplausible,giventheobservedvalueofX.TheapproachissimilartotheapproachwetookinCHAPTER19,PROBABILITYMEETSDATA,andverycloselyrelatedtohypothesistesting,discussedinCHAPTER27,HYPOTHESISTESTING:DOESCHANCEEXPLAINTHERESULTS?.

    SupposetheobservedvalueofXisx.Ifpwereveryverysmall(closetozero),itwouldbeunlikelytoseexormoreonesinthesampleunlessx=0.Soseeingxonesinthesampleisevidencethatpisnottoosmall.Conversely,ifpwereveryverylarge(closetoone),itwouldbeunlikelytoseexorfeweronesinthesampleunlessx=n.SoobservingthatX=xlimitstheplausiblerangeofvaluesofp.

    Supposewewantaconfidenceintervalforpwithconfidencelevel1.Letpbethesmallestvalueofqforwhich

    /2P(Xxifp=q)=nCxqx(1q)nx+nCx+1qx+1(1q)nx1++nCnqn(1q)0.

    Similarly,letp+bethelargestvalueofqforwhich

    /2P(Xxifp=q)=nCxqx(1q)nx+nCx1qx+1(1q)nx+1++nC0q0(1q)n.

    Thentheinterval[p,p+]isa1confidenceintervalforp.IntervalsconstructedthiswaycanbemuchshorterthantheconservativeintervalsbasedonChebychev'sInequalityandtheupperboundonSD(box),buttheyarestillguaranteedtoattainatleasttheirnominalconfidencelevel.Confidenceintervalsbasedonthenormalapproximationaregenerallynotmuchshorter,buttheiractualconfidencelevelcanbesubstantiallylowerthantheirnominalconfidencelevel.

  • ConfidenceIntervalsforPopulationPercentiles

    WecanalsousearandomsamplewithreplacementtofindaconfidenceintervalforaPERCENTILEofapopulation.WeshallworkoutthedetailsfortheMEDIANotherpercentilescanbetreatedsimilarly.Unliketheconservativeandapproximateconfidenceintervalsandlikeexactconfidenceintervalsforthepopulationpercentagewejustsawandtheseintervalsarenotoftheform(estimateuncertainty).Instead,theendpointsoftheintervalsaretwoofthedata.Andthisapproachalsoleadstoexactconfidenceintervals:Thenominalcoverageprobabilityisequal[+]totheactualcoverageprobability.

    Tobegin,supposewehavearandomsampleofsize10

    {X1,X2,,X10}

    takenwithreplacementfromapopulationwithmedianm.Sortthedataintoincreasingorder:letX(1)bethesmallestdatum,X(2)bethesecondsmallest,etc.,andletX(10)bethelargestdatum.(Thesorteddataarecalledtheorderstatistics.)LetA1betheeventthatthefourthsmallestdatum,X(4),islessthanorequaltothemedian,andletA2betheeventthattheseventhsmallestdatum,X(7),isgreaterthanorequaltothemedian.TheeventA1occursunless7ormoredataaregreaterthanthepopulationmedian,soA1cistheeventthat7ormoredataaregreaterthanthepopulationmedian.Similarly,theeventA2occursunless7ormoredataarelessthanthepopulationmedian,soA2cistheeventthat7ormoredataarelessthanthepopulationmedian.LetA=A1A2betheeventthatthefourthandseventhorderstatisticsbracketthemedian.WeshallfindalowerboundontheprobabilityofA.

    Notethatifsevenormoredataarelessthanthemedian,thenitisnotthecasethatsevenormoredataaregreaterthanthemedian,soA1candA2caredisjoint.Hence,

    P(Ac)=P((A1A2)c)

    =P(A1cA2c)

    =P(A1c)+P(A2c),

    andthus

    P(A)=1P(Ac)=1P(A1c)P(A2c).

    WearedoneifwecanfindupperboundsforP(A1c)andP(A2c).

    Recallthatthemedianisthesmallestnumberthatatleast50%ofthepopulationarelessthanorequalto.Itfollowsthattheprobabilitythatanumberdrawnatrandomfromthepopulationisstrictlylessthanthemedianisatmost50%(andpossiblyless),andthattheprobabilitythatanumberdrawnatrandomfromthepopulationisstrictlygreaterthanthemedianisatmost50%(andpossiblyless).Thedataaredrawnfromthepopulationindependently,sothenumberofdatathatarelessthanthepopulationmedianhasaBINOMIALPROBABILITYDISTRIBUTIONwithntrialsandp50%,asdoesthenumberofdatathataregreaterthanthepopulationmedian.

  • LetYbearandomvariablewithaBinomialdistributionwithparametersn=10andp=50%.ThusP(A1c)P(Y7),andP(A2c)P(Y7).However,P(Y7)=P(Y3),so

    P(A)1P(Y3orY7)=P(4Y6).

    Thustheprobabilitythattheinterval[X(4),X(7)]containsthepopulationmedianisatleastaslargeastheprobabilityofobserving4,5,or6successesin10independenttrialswithprobability50%ofsucceessineachtrialthehighlightedareainFIGURE263:

    Figure263:Binomialprobabilityhistogram

    Theintervalfromthefourthsmallestdatumtotheseventhsmallestdatumisthereforea65.6%confidenceintervalforthepopulationmedian.

    Thesameideacanbeusedtofindconfidenceintervalsforotherpercentiles:Theprobabilitydistributionofthenumberofdatathatarelessthanthe100qthpercentileisBinomialwithnumberoftrialsequaltothenumberofdata,n,andprobabilityofsuccessatmostq,andtheprobabilitydistributionofthenumberofdatathataregreaterthanthe100qthpercentileisBinomialwithnumberoftrialsequaltothenumberofdata,n,andprobabilityofsuccessatmost1q.

    Thefollowingexercisecheckswhetheryoucanfindaconfidenceintervalforapopulationmedian.

    Exercise267.Considerfindinga96.5%confidenceintervalforthemedianofapopulationfromarandomsamplewithreplacementofsize15.

    Theconfidenceintervalshouldgofromthe ? datumtothe?

    [+Solution]

    SummarySupposewehaveaprocedureforcalculatinganintervalfromeverypossiblesampleofsizenfromapopulationofsizeN(aboxofNnumberedtickets).Lettbeaparameterofthepopulation.Supposethatiftheprocedureisappliedtoarandomsampleofsizen,thechancethattheresultingintervalwillcontaint

    0 1 2 3 4 5 6 7 8 9 10Selectedarea:0%

    Areafrom: 0.5 to: 0.5n: 10 p: 0.5

  • isP%.ThentheintervalthatresultsfromapplyingtheproceduretoanyparticularrandomsampleofsizenisaP%CONFIDENCEINTERVALFORt.Oncetherandomsamplehasbeendrawn,theresultingintervaleithercovers(contains)ordoesnotcoverttheprobabilitythattheintervalcoverstiseither0or100%.TheprobabilitythattheintervalwillcovertbeforethesampleisdrawniscalledtheCONFIDENCELEVELoftheintervalafterthesampleisdrawn.Confidenceintervalsprovideanalternativetoreportingasingle"bestestimate"ofaparameterandasummarymeasureoftheuncertaintyoftheestimate.Itispossibletoconstructconservativeconfidenceintervalsforthepopulationpercentagefromsimplerandomsamplesorrandomsampleswithreplacementfrom01BOXES:Forasimplerandomsampleofsizen,thechancethattherandominterval

    [kf/(2n),kf/(2n)]

    coversthepopulationpercentagepisatleast11/k2,whereisthesamplepercentage,fisthefinitepopulationcorrection(Nn)/(N1),Nisthepopulationsize,andnisthesamplesize.Forrandomsamplingwithreplacement,thechancethattherandominterval

    [k/(2n),k/(2n)]

    includesthepopulationpercentagepisatleast11/k2.Theseareconservativeproceduresforconstructingconfidenceintervals,becausetheprobabilitythattheintervalstheyproducecoverthetruepopulationpercentagep(theactualcoverageprobability)isgreaterthantheprobabilitytheyclaim,11/k2(thenominalcoverageprobability).Theseprocedurescanbeextremelypessimistic,especiallywhenthesamplesizenislargeandwhenthetruepopulationpercentagepisfarfrom50%theintervalsthenaremuchwiderthantheyneedtobefortheactualcoverageprobabilitytobe11/k2.

    Supposethattherandomsampleisdrawnwithreplacement.Whenthesamplesizenislarge,thecentrallimittheoremensuresthattheprobabilityhistogramofthesamplepercentagecanbeapproximatedaccuratelybythenormalcurve.TheexpectedvalueofthesamplepercentageispandtheSEofthesamplepercentageisSD(box)/n,whereSD(box)isthepopulationSD,(p(1p)),theSDofthelistofnumbersontheticketsinthebox.Whennislarge,theSDofthesample,s*,tendstobeanaccurateestimateofSD(box),andthechancethattherandominterval

    [zs*/n,+zs*/n]

    containspisapproximatelyequaltotheareaunderthenormalcurvebetweenz.Takingz=1.96,forexample,givesapproximate95%confidenceintervals.Thecoverageprobabilityofthisproceduretypicallyisnotexactlytheareaunderthenormalcurvebetweenz,butasthesamplesizegrows,thecoverageprobabilityapproachesthatarea.

    Approximateconfidenceintervalsforthepopulationmeancanbeconstructedsimilarly,butthenitismorecommontouse

    s=s*n/(n1)

    toestimateSD(box)thantouses*.LetMdenotethesamplemean.Forrandomsamplingwithreplacement,ifthesamplesizenislarge,thechancethattherandominterval

    [Mzs/n,M+zs/n]

    coversthepopulationmeanisapproximatelyequaltotheareaunderthenormalcurvebetweenz.Again,thecoverageprobabilityisnotexactlytheareaunderthenormalcurvebetweenz,butitapproachesthatareaasthesamplesizegrows.

    Confidenceintervalscanbeconstructedforpopulationparametersotherthanpercentagesandmeans.Forexample,onecanconstructconfidenceintervalsforpercentilesofapopulationusingthefactthatforrandomsamplingwithreplacement,thenumberofdatathatarelessthanthe100qthpercentilehasabinomialdistributionwithparametersnandp=q,andthenumberofdatathataregreaterthanthe100qthpercentilehasabinomialdistributionwithparametersnandp=1q.

    KeyTermsapproximateconfidenceintervalbootstrapestimateofthestandarddeviationChebychev'sinequality

  • confidenceintervalconfidencelevelconservativeconfidenceintervalcoverageprobabilityexpectedvaluefinitepopulationcorrectionfnormalapproximationnormalcurveparameterpopulationmeanpopulationpercentagepopulationSDprobabilitysamplemeansamplepercentagesamplestandarddeviationsstandarddeviation(SD)standarddeviationofthesamples*

    19972015.P.B.Stark.Allrightsreserved.Lastgenerated5/29/2015,8:04:49PM.Contentlastmodified21January201308:37PST.