Statistics level 1

download Statistics  level 1

of 98

Transcript of Statistics level 1

  • 7/28/2019 Statistics level 1

    1/98

    STAT115 INTRODUCTION TO BIOSTATISTICS 2013

    Advances in our understanding of factors which affect health and wellbeing come through

    research in the health sciences. Examples of such research include surveys to describepatterns of disease in a community or risk factors for disease such as diet and smoking; studies

    trying to find out whether a newly developed treatment works; studies of factors which mayprevent disease such as physical activity; studies of barriers to improving health such as

    reasons for declining vaccination rates in children, prevention of smoking. Biostatistics(statistics applied in the health sciences) is a vital tool in our mission to improve health and

    wellbeing for all people.

    STAT115 provides an introduction to the core principles and methods of biostatistics. In thiscourse you will gain an understanding of how statistics is used to answer research questions:

    how to look for patterns in data, how to test hypotheses about disease causation and preventionand improvement in well-being. The understanding and skills gained in STAT115 can be a

    starting point for a career in biostatistics or can be used to assist understanding of research in

    other disciplines including physiology, anatomy, human nutrition, sports science, andpsychology.

    GENERAL INFORMATION AND ADMINISTRATION

    Lecturers

    Dr Tilman Davies, Room 516, Science III Building.

    Ms Megan Drysdale, Room 231, Science III Building.

    D K t i Sh l D t f P ti d S i l M di i Ad B ildi g

  • 7/28/2019 Statistics level 1

    2/98

    Dr Katrina Sharples Dept of Preventive and Social Medicine Adams Building

    Support Classes

    We have students from a range of backgrounds in the course. If you are concerned about your

    mathematical skills, we have a Support Class available on Tuesday evenings from 6pm to8pm (commencing week 1) in the North CAL lab, ground floor Science III building, opposite

    the Science Library. In order to check whether this is appropriate for you, have a look at the"Basics Booklet" in Appendix 1 of these notes. The Support Class is designed for those who

    struggle with the material in the booklet.

    Practice resources

    Check out the following:

    1. Basics Booklet in Appendix 12. MATHERCIZE http://mathercize.otago.ac.nz, login password is plusReferences

    There is no set text for the course as this course booklet contains all material necessary. Iffurther reference materials are desired, two useful texts are:

    Clark, M.J. and Randal, J. R. A First Course in Applied Statistics. Pearson MacGillivray, H. Utts & Heckard's Mind on Statistics. Cengage Learning.

    Multiple copies of both references are in the Science Library on close reserve at the Loans

    Desk.

    C ti g

  • 7/28/2019 Statistics level 1

    3/98

    Course content (in approximate lecture order)

    Introduction: research methods and study design; designed experiments versus

    observational studies; case control, cohort and intervention studies.

    Tilman

    2 lecturesData description and presentation: the use of R; histograms, box-and-whisker

    plots, measures of centre and spread of data, measures of disease frequency and

    association.

    Megan

    6 lectures

    Probability: the nature of random variation; diagnostic tests; probability

    distributions including the binomial and normal distributions.

    Tilman

    8 lectures

    Estimation: sampling distributions; confidence intervals for means, differences

    proportions.

    Tilman

    5 lecturesHypothesis testing: classical procedures for means, proportions, and differences;

    thep-value; statistical vs clinical significance; power and sample size.Tilman3 lectures

    Analysis of variance: completely randomised design; multiple comparisons. Tilman

    3 lectures

    Categorical data: tests for association; rates, relative risk and risk differences,

    odds ratios; confidence intervals for relative risk and odds ratio.

    Megan

    4 lectures

    Regression and correlation: the simple linear regression model; tests on the slope;predictions; confidence intervals for predictions; correlation.

    Megan

    4 lectures

    Multiple regression: tests on the estimated parameters; dummy variables for

    qualitative predictors; parallel regressions and control of confounding.

    Megan

    4 lectures

    Ethics and Study design: Ethical issues, bias and confounding. Katrina

    7 lectures

    I t l A t

  • 7/28/2019 Statistics level 1

    4/98

    NOTES

    1. Bring your student ID card.2. Be on time so as not to disturb others and also to avoid delaying the start of the next

    session.3. Bring your calculator. You are not allowed to use your cellphone as a calculator.4. Open R before you log on to the Resources Page.5. Be mindful when scrolling that you do not inadvertently change your selected answer.6. Any issues arising during the test must be brought immediately to the attention of the

    supervisor so that remedial action can be taken and you complete the test before leaving

    the room. There is no comeback once you have left the venue.

    7. There is no resit for these mastery tests.IMPORTANT - The only devices to be in operation during any testing period are:

    1. the lab computer with just the test and R active on screen;2.a calculator

    The use of other devices, eg cellphone, tablet, laptop etc, or using other programmes onthe computer, eg email, may be deemed as an attempt to cheat.

    SecurityYou are strongly advised to ID your calculator and other personal devices as these frequently

    get left behind in the computer laboratory.

    Exam format

  • 7/28/2019 Statistics level 1

    5/98

  • 7/28/2019 Statistics level 1

    6/98

  • 7/28/2019 Statistics level 1

    7/98

  • 7/28/2019 Statistics level 1

    8/98

  • 7/28/2019 Statistics level 1

    9/98

  • 7/28/2019 Statistics level 1

    10/98

  • 7/28/2019 Statistics level 1

    11/98

  • 7/28/2019 Statistics level 1

    12/98

  • 7/28/2019 Statistics level 1

    13/98

  • 7/28/2019 Statistics level 1

    14/98

  • 7/28/2019 Statistics level 1

    15/98

  • 7/28/2019 Statistics level 1

    16/98

    STAT115&&

    Introduc/on&to&Biosta/s/cs&

    Sec$on'1:'Introduc$on'

    Lecture'1'

    Sec/on&1&&

    Biosta/s/cs&and&research:&&an&overview&

    Course&aim:'

    An'introduc$on'to'the'core'biosta$s$cal'methods'

    essen$al'to'the'health'sciences'

    scien$fic'method' design'of'research'studies' descrip$on'and'analysis'of'data'

    Learning&aims&and&objec/ves'

    By'the'end'of'the'course'students'should'

    be'aware'of'the'appropriate'use'of'common'study'designs'and'their'strengths'and'weaknesses'

    be'able'to'describe'the'informa$on'contained'in'a'data'set' be&able&to&carry&out&common&sta/s/cal&data&analyses& be'able'to'interpret'the'results'of'common'sta$s$cal'analyses'in'the'context'of'the'par$cular'study'design'used' be'able'to'cri$cally'evaluate'selected'research'ar$cles'

    published'in'health'sciences'journals'

    Goal&of&health&sciences&professions'

    To'improve'the'health'and'wellDbeing of'individuals'and'communi$es''

    This'involves'

    'treatment'of'disease' 'preven$on'of'disease' 'promo$on'of'health'

    'In'order'to'do'this'we'need'knowledge'about' 'causes'of'disease' 'diagnosis' 'disease'processes' 'effec$veness'of'treatments' 'societal'factors'which'affect'health'

    ''

  • 7/28/2019 Statistics level 1

    17/98

    Examples&of&current&gaps&in&knowledge'

    Common'diseases:'

    Diabetes'

    Cancer'

    New'diseases'

    'HIV,'SARS,'avian'influenza'

    Exposures'

    Vitamin'D'deficiency'

    Smoking'

    New'technologies'

    'Cell'phones,'3D'technology'

    '

    Research'

    A'process'for'providing'answers'to'ques$ons'for'which'the'answer'is'not'immediately'available'

    Examples&of&general&health&research&ques/ons&

    What'are'the'gene$c'events'which'lead'to'childhood'cancer?' Can'we'develop'a'vaccine'to'prevent'SARS?' Can'we'develop'vaccines'against'cancer'cells?' How'do'we'stop'people'smoking?' Can'a'new'drug'improve'survival'in'people'with'colorectal'cancer?'' How'can'we'prevent'childhood'overweight'and'obesity?' What'are'the'main'factors'affec$ng'quality'of'life'of'people'with'a'

    chronic'illness?'

    ''

    Research'provides'a'systema$c'process'for'answering'these'ques$ons''

    '

    Invasive&poten/al&of&CIN3&

    Epidemiologists '

    Margaret'McCredie,''

    CharloUe'Paul'

    David'Skegg'

    Sta/s/cian'

    Katrina'Sharples'

    Department'of'Preven$ve'and'Social'Medicine,''

    University'of'Otago'

    Gynaecological&oncologist'

    Ron'Jones'

    Na$onal'Womens'Hospital,'Auckland'

    Pathologist'

    Judith'Baranyai,''

    Lab'Plus'

    Auckland''

    &'Cytologist'

    Gabrielle'Medley'

    Melbourne'Pathology,'Victoria,'

    Australia'

    '

    Dr&Greens&clinical&study&

    Clinical'study'of'the'natural'history'of'carcinoma'in'situ'(CIS)'(1965D74)'

    Carried'out'at'Na$onal'Womens'Hospital,'Auckland,'New'Zealand' Aim:''to'inves$gate'Dr'Greens'hypothesis'that'CIS'was'not'a'precursor'of'

    invasive'cancer'

    Involved'withholding'or'delaying'treatment'of'cura$ve'intent'for'a'group'of'women'diagnosed'with'CIS'

    '

    It'has'since'been'the'subject'of'a'Judicial'Inquiry'(1987D88)'

    concluded'that'the'study'was'unethical' recommended'that'the'histological'and'other'material'kept'at'Na$onal'

    Womens'should'be'available'for'properly'planned'and'approved'research'

    and'teaching'

    '' '

  • 7/28/2019 Statistics level 1

    18/98

    hUp://www.bioacademy.gr/lab/lab.php?lb=36&pg=6'

    Absence'of'virus'

    produc$on'

    Viral'DNA'integra$on'

    E6DE7'produc$on'

    HPV'infec$on'

    virus'produc$on'

    Smith'MA,'Canfell'K.'

    'Int.'J.'Cancer:'123,'18541863'(2008)'

    HPV&Transmission&model&

    Research&Aim&

    To'es$mate'the'long'term'risk'of'cervical'cancer'in'

    ' ' 'i)' 'women'whose'CIN3'lesion'was'minimally''

    'disturbed'and'

    'ii)' 'women'who'had'persis$ng'CIN3'

    Our study

    Women'diagnosed'with'CIS'at'Na$onal'Womens'Hospital'

    between'1'Jan'1955'and'31'Dec'1675'(1063'women)'

    Informa$on'on'smears'and'procedures'extracted'from'

    hospital'notes'

    Endpoint:'invasive'cancer'of'cervix'or'vaginal'vault'

    '

    '

    Ini/al&treatment&of&CIN3&lesion&

  • 7/28/2019 Statistics level 1

    19/98

    Time&un/l&adequate&treatment&

    FollowDup'censored'

    aner'treatment'

    Invasive&cancer&of&cervix&or&vaginal&vault&&Ini/al&treatment&punch&or&wedge&biopsy&

    No'censoring'

    Blakely'T,'Shaw'C,'Atkinson'J,'Tobias'M,'Bas$ampillai'N,'Sloane'K,'Sarfa$'D,'Cunningham'R.''2010.''

    Cancer'Trends:'Trends'in'Incidence'by'Ethnic'and'

    Socioeconomic'Group,'New'Zealand'19812004.''Wellington:'University'of'Otago'and'Ministry'of'

    Health.'

  • 7/28/2019 Statistics level 1

    20/98

    Hill'et'al'J'Epidemiol'Community'Health'2010;64:117e123.'

    Possible&reasons&for&poorer&survival&aRer&

    diagnosis&

    Different'disease'subDtype'

    Different'stage'of'disease'at'diagnosis'

    Differences'in'access'to'or'uptake'of'treatment''

    CoDmorbidi$es'

    Poorer'followDup'

    '

    PIPER&Aims

    To'compare'progression'free'survival'in'pa$ents'diagnosed'with'colon'cancer'and'rectal'cancer'(CRC)'according'to:

    loca$on'of'residence'(urban'or'rural'&'distance'from'trea$ng'

    centre);'ethnicity;'and'socioDeconomic'depriva$on'of'area'of'

    residence.

    To'iden$fy'differences'in'pa$ent'presenta$on,'diagnos$c'evalua$on,'treatment'and'followDup'which'contribute'to'

    differences'in'outcome'by'rurality,'ethnicity'or'socioD

    economic'depriva$on''

    PIPER&Study&design&

    Na$onal'study'including'6323'pa$ents' Data'obtained'by'reviewing'pa$ents'notes'and'hospital'data'

    bases'

    Analyses'will'compare'the'demographic'and'disease'characteris$cs,'treatment'delivery'and'followDup'among'the'

    different'ethnic'groups'

    Overall'goal'is'to'iden$fy'areas'for'interven$on'in'order'to'improve'outcomes'

    A'secondary'goal'is'to'set'up'prospec$ve'data'collec$on'to'allow'research'into'beUer'treatments''

  • 7/28/2019 Statistics level 1

    21/98

    Scien$fic

    method'

    Hypothesis'

    Predic$ons'

    Experiment/

    observa$on'(carry'

    out'research)'

    Accept/reject/refine'theory'

    Research&

    The'objec$ve'for'most'research'studies'is'to'use'data'from'a'

    sample'to'draw'inference'about'a'larger'popula$on:'

    '

    Steps&in&the&research&process&

    &Development'of'the'research'ques$on'

    &Design'of'the'study'&

    'Collec$on'of'informa$on''

    'Data'descrip$on'and'analysis'

    'Interpreta$on'of'results ' '

    ''

    '

    Research&ques/ons&relevant&to&course&

    Epidemiology:''the'study'of'distribu$on'and'determinants'of'disease'frequency'

    Clinical&research:''the'study'of'ques$ons'rela$ng'to'care'of'pa$ents'

    Descrip/ve&ques/ons:'

    'What'is'the'distribu$on'of'a'disease?'

    'What'is'the'natural'history'of'a'disease?'Analy/c&ques/ons: & '

    'What'are'the'causes'of'a'disease?''

    'Will'this'approach'prevent'disease?'

    'Does'this'treatment'improve'outcome?'

    ''

    '

  • 7/28/2019 Statistics level 1

    22/98

    Introduc/on&to&study&design&&

    1. Descrip$ve'studies'(studies'which'describe'things)'

    ''

    2. Analy$c'studies'(studies'which'test'hypotheses)'

    ' 'Experimental'studies'' 'Observa$onal'studies'

    '''Examples'of'types'of'analy$c'study'

    ''

    3. Summary'

    ' 'Classifica$on'of'research'designs'

    ' 'Classifica$on'of'common'study'types'

    '

    &'

    ''

    '

    Descrip/ve&studies&&

    Aim:'to'describe,'for'example:'

    'the'characteris$cs'of'people'with'a'disease'(person,'place,'''$me)''lifestyle'paUerns'in'a'popula$on'

    'ap

    tudes'to'health'care'''

    Descrip$ve'studies'are'onen'called'surveys'or'crossDsec$onal'studies'

    ''

    Descrip$ve'studies'generally'use'a'sample'from'a'popula$on'

    ''

    '''

    '

    &'

    ''

    '

    Example:&What&are&the&serum&cholesterol&

    levels&of&New&Zealanders'

    Method:Select'a'subgroup'(sample)'of'people'and'measure'their'serum'cholesterol'levels'

    Random&sampling'

    'choose'the'sample'in'such'a'way'that'every'individual'in'the'popula$on'has'a'known'chance'of'being'selected'''

    'in'a'simple'random'sample,'everyone'has'an'equal'chance'of'being'chosen'

    'this'method'is'the'best'way'of'obtaining'a'sample'which'is'representa$ve'

    of'the'popula$on''

    Suppose'we'want'to'es$mate'mean'cholesterol'in'the'popula$on:''

    '''

    '

    &'

    ''

    '

    Random'error:'

    due'to'natural'biological'variability' increasing'the'sample'size'will'reduce'the'random''

    'fluctua$ons'in'the'sample'mean'''Systema$c'error'(=bias)'

    due'to'aspects'of'the'design'or'conduct'of'the'study' ' ''which'systema$cally'distort'the'results'

    occurs'if'a'sample'is'not'representa$ve'of'the'popula$on'' cannot'be'reduced'by'increasing'the'sample'size'

    '

    = Popula$on'

    (true)'mean'+' Error'

    Systema$c'

    error'Random'

    error'

    Sample'mean'

  • 7/28/2019 Statistics level 1

    23/98

    Analy/c&studies&&

    Purpose:&'to'test'hypotheses,'about,'for'example:'

    causes'of'disease' methods'for'preven$on'of'disease' the'effects'of'treatments'

    &

    &

    &

    &

    '

    &'

    ''

    '

    Analy/c&studies&&

    Experimental&studies'

    the'researcher'intervenes'and'records'the'result'of their'interven$on'

    the'aim'is'to'control'all'other'factors'to'isolate'the'effects'of'the'interven$on'

    best'way'to'study'causa$on''Observa0onal&studies&&'

    the'inves$gator'does'not'intervene,'simply'observes'a'naturally'occurring'process, and'collects'informa$on'

    ideal'is'to'get'as'close'as'possible'to'the'informa$on'that'would'have'been'obtained'if'the'experimental'study'could'have'been'done'

    ''

    '''

    Example:''Op$ons'for'studying'the'rela$onship'

    between'smoking'and'heart'disease''

    '

    &'

    ''

    '

    '

    Experiment'

    Smoke'

    Par$cipants'

    (nonDsmokers)'

    Dont'

    smoke'

    Random'

    alloca$on'

    Heart'

    disease?'

    Heart'

    disease?'

    FollowDup'

    Randomised'controlled'trial'

    Example:''Op$ons'for'studying'the'rela$onship'

    between'smoking'and'heart'disease''

    '

    &'

    ''

    '

    '

    Observa$onal Study

    Smokers'

    Par$cipants'

    NonD

    smokers'

    Heart'

    disease?'

    Heart'

    disease?'

    FollowDup'

    Cohort'study'

  • 7/28/2019 Statistics level 1

    24/98

    Example:''Op$ons'for'studying'the'rela$onship'

    between'smoking'and'heart'disease''

    '

    &'

    ''

    '

    '

    Smokers?'

    Observa$onal'Study'

    Smokers?'

    People'with'

    heart'disease'(cases)'

    People'

    without'heart'

    disease'

    (controls)'

    ='?'

    CaseDcontrol'study'

  • 7/28/2019 Statistics level 1

    25/98

    STAT115&&

    Introduc/on&to&Biosta/s/cs&

    Sec$on'1:'Introduc$on'

    Lecture'2'

    STAT&115:&Introduc/on&to&study&design&&

    1.'Descrip$ve'studies'(studies'which'describe'things)'

    ''

    2.'Analy$c'studies'(studies'which'test'hypotheses)'

    ' 'Experimental'studies'' 'Observa$onal'studies'

    '''Examples'of'types'of'analy$c'study'

    ''

    3.'Summary'

    ' 'Classifica$on'of'research'designs'

    ' 'Classifica$on'of'common'study'types'

    '

    &'

    ''

    '

    Example:&What&are&the&serum&cholesterol&

    levels&of&New&Zealanders'

    '

    Method: 'Select'a'subgroup'(sample)'of'people'and'measure'

    their'serum'cholesterol'levels'

    '

    Suppose'we'want'to'es$mate'mean'cholesterol'in'the'

    popula$on:''

    '

    ''

    '

    &'

    ''

    '

    Random'error:'

    due'to'natural'biological'variability' increasing'the'sample'size'will'reduce'the'random' '

    'fluctua$ons'in'the'sample'mean''

    '''

    = Popula$on'

    (true)'mean'+' Error'

    Systema$c'

    error'Random'

    error'

    Sample'mean'

    Systema$c'error'(=bias)'

    due'to'aspects'of'the'design'or'conduct'of'the'study'' '

    'which'systema$cally'distort'the'results'

    occurs'if'a'sample'is'not'representa$ve'of'the'popula$on''

    cannot'be'reduced'by'increasing'the'sample'size'

    '

  • 7/28/2019 Statistics level 1

    26/98

    Analy/c&studies&&

    Purpose:&'to'test'hypotheses,'about,'for'example:'

    causes'of'disease' methods'for'preven$on'of'disease' the'effects'of'treatments'

    &

    &

    &

    &

    '

    &'

    ''

    '

    Analy/c&studies&&

    Experimental&studies'

    the'researcher'intervenes'and'records'the'result'of their'interven$on'

    the'aim'is'to'control'all'other'factors'to'isolate'the'effects'of'the'interven$on'

    best'way'to'study'causa$on''Observa0onal&studies&&'

    the'inves$gator'does'not'intervene,'simply'observes'a'naturally'occurring'process, and'collects'informa$on'

    ideal'is'to'get'as'close'as'possible'to'the'informa$on'that'would'have'been'obtained'if'the'experimental'study'could'have'been'done'

    ''

    '''

    Example:''Op$ons'for'studying'the'rela$onship'

    between'smoking'and'heart'disease''

    '

    &'

    ''

    '

    '

    Experiment'

    Smoke'

    Par$cipants'

    (nonSsmokers)'

    Dont'

    smoke'

    Random'

    alloca$on'

    Heart'

    disease?'

    Heart'

    disease?'

    FollowSup'

    Randomised'controlled'trial'

    Example:''Op$ons'for'studying'the'rela$onship'

    between'smoking'and'heart'disease''

    '

    &'

    ''

    '

    '

    Observa$onal Study

    Smokers'

    Par$cipants'

    NonS

    smokers'

    Heart'

    disease?'

    Heart'

    disease?'

    FollowSup'

    Cohort'study'

  • 7/28/2019 Statistics level 1

    27/98

    Example:''Op$ons'for'studying'the'rela$onship'

    between'smoking'and'heart'disease''

    '

    &'

    ''

    '

    '

    Smokers?'

    Observa$onal'Study'

    Smokers?'

    People'with'

    heart'disease'(cases)'

    People'

    without'heart'

    disease'

    (controls)'

    ='?'

    CaseScontrol'study'

    Common&analy/c&study&designs ''

    '

    Experimental:''

    & &Randomised'controlled'trial'

    '

    Observa$onal:''

    'Cohort'study'

    'CaseScontrol'study'

    ''

    '

    '

    Randomised&controlled&trial&

    The'Gold'standard'analy$c'study'

    'Characteris$cs'of'a'RCT:'

    'select'a'group'of'people' randomly'allocate'them'to'either'an'interven$on'group(s)'

    'or'a'control'group'

    follow'par$cipants'up'over'$me,'and'measure'outcome'A'control'group'is'used'to'isolate'the'effects'of'the'interven$on'

    Random'alloca$on,'or'randomisa,on-means'every'person'has'the'same'chance'of'being'in'each'group.'This'gives'the'best'chance'of'ge]ng'two'groups'which'are'comparable'in'all'respects''

    ''

    Randomised&controlled&trial''

    Used'to'evaluate'new'treatments'or'preven$ve'strategies'

    O_en'not'ethical'in'studies'of'disease'causa$on'

    &

    Example&RCT:&LIPID&study&(NEJM,&1998)'

    Does'treatment'with'pravasta$n'reduce'the'risk'of'death'in'

    pa$ents'with'coronary'heart'disease?'

    Study'par$cipants:''''

    9014'pa$ents' age'31S75' coronary'heart'disease' cholesterol''155'S'271mg/decilitre'

    '

  • 7/28/2019 Statistics level 1

    28/98

    Par$cipants'

    (n=9014)'

    Randomisa$on'

    Placebo'

    (n=4502)'

    Pravasta$n'

    (n=4512)'

    8.3%'died'

    FollowSup'

    6'years'

    6.4%'died'

    LIPID&study& As'par$cipants'were'recruited'to'the'

    study'they'were'

    allocated'to'either'

    pravasta$n'or'

    control'according'to'

    a'random'number'

    sequence'

    LIPID&trial&results&&

    Randomised&controlled&trial''

    Advantages:'

    experiment''the'best'way'to'test'an'hypothesis' if'the'trial'is'well'conducted,'differences'in'outcome'can'be''''akributed'to'the'interven$on'

    Disadvantages:'

    may'not'be'ethical'or'feasible'''

    '

    Cohort&study''

    Observa$onal'study,'generally'carried'out'to'test'hypotheses'

    Characteris$cs:'

    par$cipants'are'selected'before'disease'has'developed' followed'over'$me'to'determine'development'of'disease' informa$on'is'collected'about'exposures'at'baseline'and'during'

    followSup'

    Example:&Bri/sh&Doctors&Study&&

    Aim:'to'inves$gate'the'rela$onship'between'smoking'and'lung'cancer'

    ''

    '

    '

  • 7/28/2019 Statistics level 1

    29/98

    ''

    '

    '

    '

    Doctors'on'the'Bri$sh'

    medical'register'

    (men'n=24,389)'

    Sent'a'ques$onnaire'

    about'their'smoking'

    habits''

    Lung'cancer?'

    Smokers'

    (n=21,'296)'NonSsmokers'

    (n=3093)'

    Lung'cancer?'

    Bri/sh&Doctors&&

    Study&

    Found'smokers'had'a'

    14'fold'higher'risk'of'

    lung'cancer'than'the'

    nonSsmokers'

    CasePcontrol&study'

    'Observa$onal'study,'generally'carried'out'to'test'hypotheses'

    Characteris$cs'

    par$cipants'are'chosen'on'the'basis'of'their'disease'status:'a'group'with'disease'(cases)'and'a'group'without'(controls)'

    ' informa$on'is'collected'from'people'with'and'without'disease'about'exposures'that'occurred'in'the past'

    longitudinal'(retrospec$ve)'&

    Example:&Risk&of&venous&thromboebolism&aSer&air&travel&''

    '

    '

    '

    Popula$on'

    Deep'vein'

    thrombosis'

    (n=210)'

    No'deep'

    vein'

    thrombosis'

    (n=210)'

    Long'

    distance'

    flight'

    (n=31)'

    No'long'

    distance'

    flight'

    (n=179)'

    A'random'sample'

    of'people'who'

    have'not'had'a'

    deep'vein'

    thrombosis''

    No'long'

    distance'

    flight'

    (n=194'

    Long'

    distance'

    flight'

    (n=16)'

    CasePcontrol&study'

    &Findings:'

    'odds'ra$o=2.1,'95%'confidence'interval''(1.1'to'4.0)'

    Air'travel'doubled'the'odds'of'venous'thromboembolism''

    &

    ''

    '

    ''

  • 7/28/2019 Statistics level 1

    30/98

    Cohort&vs&casePcontrol&studies'

    Cohort&study&'

    Advantages'

    closest'observa$onal'study'to'randomised'controlled'trial' good'for'examining'common'outcomes'

    can'evaluate'the'eff

    ect'of'exposure'on'mul$ple'outcomes'Disadvantages'

    long'dura$on'needed'if'the'disease'takes'a'long'$me'to'develop'a_er'exposure'

    if'the'disease'is'rare,'the'number'of'par$cipants'needs'to'be'very'large

    &&

    &''

    '

    '

    '

    Cohort&vs&casePcontrol&studies'

    CasePcontrol&study&

    Advantages'

    rela$vely'quick' smaller'than'cohort'studies,'par$cularly'for'rare'diseases' can'examine'the'effects'of'mul$ple'exposures'

    'Disadvantages'

    events'have'already'occurred'so'the'poten$al'for'bias'is'higher'''

    &''

    '

    '

    '

    Classifica/on&of&research&designs'

    i)'Classifica$on'by'purpose'of'the'study'

    'Descrip$ve'(describe'things)'vs''analy$c'(tes$ng'hypotheses)'

    ''

    ii) 'Classifica$on'by'form'of'the'design'

    ' 'experimental''(researcher'intervenes)''

    ' 'vs.-

    'observa$onal''(researcher'observes)'

    iii)'Classifi

    ca$on'by'$me''''''crossSsec$onal''(informa$on'collected'about'one'point'in'$me)

    ' 'vs.-

    'longitudinal'''&''

    '

    '

    '

    Classifica/on&of&common&study&types'

    Randomised'controlled'trial''' '&

    ' ' 'analy$c,'experimental,'longitudinal'(prospec$ve)'

    'Cohort'study ' ''

    ' ' 'analy$c,'observa$onal,'longitudinal'(usually'prospec$ve)'

    CaseScontrol'studies'

    ' ' 'analy$c,'observa$onal,'longitudinal'(retrospec$ve)'

    '

    These'classifica$ons'provide'a'useful'framework'for'thinking'about'the'strengths'and'weaknesses'of'different'study'designs,'but'they'will'not'always'work'

    ''

    '

    '

    '

  • 7/28/2019 Statistics level 1

    31/98

    STAT115Introduction to Biostatistics

    Megan Drysdale

    University of OtagoCourse Co-ordinator: Dr Tilman Davies

    Section 2: Data Description and PresentationLecture 3

    Data and variables

    There are two types of measurement of interest in many scientific studies...

    1 First, the outcomes measured on each experimental unit (plant,

    animal, person) provide values of what is called a response variable.2 Second, the characteristics or levels of exposure that explain at least

    some of the dierences in the observed values of the response variableare called explanatory variables.

    Data forming the response and exposure variables can be either categoricalor numerical (otherwise known as qualitativeand quantitativerespectively).

    Iron levels in newborn children

    Anthony-Sivan et al. (2012) measured cord ferritin levels in 140pregnant women in Israel.

    Mothers in the stress group (n = 63) were in the first trimester duringthe period that the area was under rocket attack.

    Mothers in the control group became pregnant 3-4 months after the

    attacks ended.Results indicated that cord ferritin levels tended to be lower in thestress group.

    Question: what are the response and explanatory variables?

    Types of data: categorical data

    Categorical data takes on values in a fixed number of categories. Thesimplest kind involves just two categories. For example, a person couldbe...

    male/female

    smoker/non-smoker

    diabetic/non-diabetic

    Such data are also called binary data, dichotomous data, yes/no data and0-1 data.The last is particularly important, for example 0 would typically representnon-diabetic and 1 would represent diabetic.

  • 7/28/2019 Statistics level 1

    32/98

    Example: Altruism gene?

    Subjects had two versions of a particular gene (COMT)

    Participants earned money, and then had the opportunity to give it toa poor child in a developing country

    Study claimed that participants carrying the Val version of the genegave more money

    Types of data: categorical data (multiple)

    We often need to use more than two categories.

    Data are nominalif there is no natural (or relevant) ordering...

    Blood group: A/B/AB/O

    Ethnicity: Maori/Pacific Island/Caucasian/Asian

    Data are ordinal if there is a natural ordering...

    Degree of pain: minimal/moderate/severe/unbearable

    Megan is an awesome lecturer:strongly agree/agree/neutral/disagree/strongly disagree

    Even in this case it can be misleading to code the categories as integervalues (e.g. 0,1,2,3 for Degree of pain). Is unbearable three times moresevere than moderate?

    Example: Beer, anyone?

    Survey data collected to analyse drinking habits in NZ.

    Age categories on the form were 18-24, 25-34, 35-49, 50+

    Types of data: discrete numerical data

    With discrete data, observations take only certain numerical values,typically integers or whole numbers. For example:

    number of possums caught in traps

    number of children in a family (0,1,2,3,4,...)

    It is important to note that these are not like categorical data as thenumerical representations are always consistent... e.g. 3 children is three

    times as many as one.

    This type of data can be treated as though it is categorical if we must, butthis discards information about the magnitude of the relationships betweensuccessive outcomes.

  • 7/28/2019 Statistics level 1

    33/98

    Types of data: continuous numerical data

    Here recorded values or observations result from some form of

    measurement. For example:height, age, blood pressure, serum cholesterol, oxygen levels in a lake,...

    Often no restriction on values other than that caused by accuracy ofequipment for recording values.

    Types of data: continuous numerical data

    Often the values show pattern similar to what is called the bell-shapednormal curve with many values clustered around a central point and fewvalues in the tails.

    Commonly used measures: ratios, proportions

    Ratio: fraction given by one quantity over another. Both quantities havethe same units.

    Example: In a class with 10 boys and 20 girls, the ratio of boys togirls is 10/20 (= 1/2) and the ratio of girls to boys is 20/10 (= 2)

    Proportion: fraction of one quantity when compared to the whole.Example: In a class with 10 boys and 20 girls, the proportion of boysis 10

    10+20= 1

    3

    Commonly used measures: percentages

    Proportions are often expressed in terms of percentages.

    To convert proportions to percentages, multiply by 100 and add a % sign.To convert percentages to proportions, divide by 100 and remove % sign.

    Example: 30% = 0.3, 56% = 0.56 etc.

  • 7/28/2019 Statistics level 1

    34/98

    Careful with percentages

    NZ Herald, Friday, June 8, 2012:

    Of Australias top 200 listed companies, 12.7 per cent hadfemale directors by the beginning of August, compared to 9.3 percent for the top 100 listed companies here.

    12.7% = 0.127...

    0.127 200 companies = 25.4 companies with female directors? Hmmm...

    Commonly used measures: rates

    Rates are like ratios for quantities with dierent units.

    Number of new cases of HIV in NZ per year (this is the incidence ofHIV)

    Number of people with HIV in NZ at a certain time (this is theprevalence of HIV)

    Number of children per family

    Number of yawns per lecture

    Usual practice is to simplify rates to a per unit measure...

    13 deaths over 5 years is 2.6 deaths per year

    Question: Is 13 deaths per 5 years better than 10 deaths per 4

    years?

    Commonly used measures: scores

    Continuous phenomena are often scored and binned in survey data asordinal categories. example:

    Responses in a question about back pain might be on a scale of 1 (nopain) to 5 (unbearable pain).

    Levels of agreement in a survey (e.g. course evaluation) might be

    labelled a great deal / somewhat / not much / not at allIn both cases, the responses might be numbered (e.g. 0,1,2,3,4), but caremust be taken interpreting these as numerical data.

    Computing with Data: R

    R is an open source (free) software package designed for statisticalanalysis.

    It was designed and first developed in NZ, but is now used around theworld.

    It is powerful, but takes a bit of learning.

  • 7/28/2019 Statistics level 1

    35/98

    Anyone on Facebook at the moment? Using R: At home

    You can use R on Macs, Windows and Linux. R is available as a free

    download from

    http://cran.r-project.org/

    This is a self-extracting executable that will install everything for you.

    A bit of a demo...

    To give you an initial look, Ill now

    1 Start up R

    2 Import some data

    3 Recode a numerical variable as a categorical variable

  • 7/28/2019 Statistics level 1

    36/98

    STAT115

    Introduction to Biostatistics

    Megan Drysdale

    University of OtagoCourse Co-ordinator: Dr Tilman Davies

    Section 2: Data Description and PresentationLecture 4

    Statistics and samples

    In statistics we often (informally) talk about samples from a population.What do we mean?

    By population we mean a set of individuals or entities or subjects that wewish to make inferences about. This could be a real population (e.g.males in Dunedin), but doesnt need to be.

    By sample we mean a subset of the population, usually selected at random.

    A statistic is a quantity that we can compute from a sample (hence thesubject name!)

    Describing numerical data

    Graphs can be used to summarise sample data, though many graphs canbe highly misleading. Today we shall look at summarising numerical datagraphically using histograms, and how to make these with R.

    In subsequent lectures we will talk about particular values whichsummarise numerical data, including:

    1 mean; median; mode

    2 standard deviation; interquartile rangeThese statistics measure the centreand the variabilityof the sample datarespectively.

    And then we will look at box and whisker plots, another way of displayingnumerical data.

    Example: hypertension data

    In a hypertension study 56 men who are heavy smokers (smoked for 25years) have blood pressures measured (in mm of Hg). Blood pressures areclassified into intervals to form a frequency table and interval frequencies(fj) are obtained

    Pressure (mm of Hg) Frequency fj59.5 (69.5) 269.5 (79.5) 779.5 (84.5) 9

    84.5 (89.5) 1089.5 (94.5) 1194.5 (99.5) 799.5 (109.5) 81 09 .5 (11 9. 5) 2Total 56 (sample size)

  • 7/28/2019 Statistics level 1

    37/98

    Computing proportions

    Pressure (mm of Hg) fj Proportion

    59.5 (69.5) 2 2/56 = 0.03669.5 (79.5) 7 7/56 = 0.125

    79.5 (84.5) 9 9/56 = 0.1618 4. 5 (89 .5 ) 10 10 /56 = 0 .1 798 9. 5 (94 .5 ) 11 11 /56 = 0 .1 9694.5 (99.5) 7 7/56 = 0.12599.5 (109.5) 8 8/56 = 0.1431 09 .5 (119 .5) 2 2 /56 = 0 .0 36

    Total 56

    ...and percentages

    Pressure (mm of Hg) fj Proportion %

    59.5 (69.5) 2 2/56 = 0.036 3.6%69.5 (79.5) 7 7/56 = 0.125 12.5%

    79.5 (84.5) 9 9/56 = 0.161 16.1%84 .5 (89. 5) 1 0 1 0/5 6 = 0. 17 9 17 .9 %89 .5 (94. 5) 1 1 1 1/5 6 = 0. 19 6 19 .6 %94.5 (99.5) 7 7/56 = 0.125 12.5%99.5 (109.5) 8 8/56 = 0.143 14.3%10 9. 5 (11 9.5 ) 2 2 /5 6 = 0. 03 6 3 .6 %

    Total 56 56/56 = 1.0 100%

    Excel Bar Chart

    0.00%$

    2.00%$

    4.00%$

    6.00%$

    8.00%$

    10.00%$

    12.00%$

    14.00%$

    16.00%$

    18.00%$

    20.00%$

    59.5$$

    (69.5)$$69.5$$

    (79.5)$$79.5$$

    (84.5)$$84.5$$

    (89.5)$$89.5$$

    (94.5)$$94.5$$

    (99.5)$$99.5$$

    (109.5)$$109.5$$

    (119.5)$$

    %$Frequency$(1dp)$$

    Issues

    3D eect doesnt help us understand the data (in general, avoid 3Deects wherever possible).

    Labelling issues....

    Bars 1,2,7,8 actually cover 10mm range while the others cover 5mm

    range. This makes us overestimatethe proportion with low or highblood pressure.

  • 7/28/2019 Statistics level 1

    38/98

    Switching to R

    1 Start up R.

    2 Import the data file BPdata.txt (available on the resources page)

    3 Read the data into R

    4 Create a histogram using the command:

    hist(Dataset$Pressure, scale="frequency", breaks="Sturges",

    col="darkgray")

    About the Histogram

    Notice that

    Y-axis gives the raw frequencies. We can change options to giveproportions or percentages.

    Equally spaced bins, makes it easier to compare areas.

    Number of bins chosen automatically (to make it look good) but wecan alter that.

    No distracting 3d graphics!

    Modifying the histogram

    Suppose we want to specify the actual bins (values) to use in thehistogram.

    The command to generate the histogram was

    hist(Dataset$Pressure, scale="frequency", breaks="Sturges",

    col="darkgray")

    Actually, a much simpler version would have also worked:

    hist(Dataset$Pressure)

  • 7/28/2019 Statistics level 1

    39/98

    Modifying the histogram

    Adding options fine tune the presentation of the graph.

    1 The simplest histogram:

    hist(Dataset$Pressure)

    2 Colouring in the boxes:hist(Dataset$Pressure, col="darkgray")

    3 Specifying the number of bins

    hist(Dataset$Pressure, col="darkgray",breaks=5)

    4 Setting the breaks between the bins

    hist(Dataset$Pressure, col="darkgray",

    breaks=c(59.5,69.5,79.5,84.5,89.5,94.5,99.5,109.5,119.5))

    5 Technical point: R uses c( , , , ) to encode an array.

    Reading a histogram

    The heights of the first two and last two rectangles are halved buttheir bases are doubled from 5 to 10mm. Area is proportional to

    frequency.A histogram with rectangle heightsproportional to class frequencieswould give a misleading picture of the data.

    You will find that most of the histograms produced by statisticalpackages like R have class intervals of equal length and you candecide the number of intervals you want in the graph.

    Usually between 5 and 20 intervals of equal length are chosen for agood summary of the data.

    Shape and skewness

    Many numerical data sets look roughly like a blob in the middle with twotails extending out either side.

    If the left tail is longer than theright we say the data is left-skewed

    If the right tail is longer thanthe left we say the data is right-skewed

    Data that is neither left nor right skewed is symmetric.

  • 7/28/2019 Statistics level 1

    40/98

    STAT 115 Final exam marks

    The file Stat115Grades2011.txt contains the (anonymised) internalassessment and final exam marks for STAT 115 students in 2011.

    Import these data in R, produce histograms with varying numbers of bins,and a histogram with bins049, 5059, 6064, 6569, 7074, 7579, 8084,8589,90100.

    The R command to use is:

    hist(Grades$Final, breaks = c(0,50,60,65,70,75,80,85,90,100))

    Grades histogram

    In this histogram, the boxes correspond to letter grades.

    These data are left-skewedsince the left tail is longer.

  • 7/28/2019 Statistics level 1

    41/98

    STAT115

    Introduction to Biostatistics

    Megan Drysdale

    University of OtagoCourse Co-ordinator: Dr Tilman Davies

    Section 2: Data Description and PresentationLecture 5

    Statistics and samples

    By sample we mean a subset of the population, usually selected at random.

    A statistic is a quantity that we can compute from a sample (hence thesubject name!)

    In this lecture we will study some standard summary statistics, quantitiescomputed from the sample which in some way summarise properties ofthat sample.

    Measures of central tendancy: the mean

    The sample mean is the average of the quantities in the sample. Computeit by summing up all of the values and dividing by the sample size.

    Example

    Six patients lived the following years after diagnosis of HIV

    Datum Symbol1.8 x13.2 x26.8 x34.6 x42.8 x57.9 x6

    Sample mean = (1.8 + 3.2 + 6.8 + 4.6 + 2.8 + 7.9)/6 years

    = 4.52 years.

    Notes on the mean

    1 In general, the mean of a sample of size n is given by

    1

    n

    nXi=1

    xi = (x1 + x2 + + xn)/n.

    2 Be careful with a calculator. To compute the mean of 1,6,5 enter

    ( 1 + 6 + 5 )

    3

    not 1 + 6 + 5 33 The mean may be dierent to all of the values, but it will be within

    the same range of values.

    4 We sometimes use x to denote the mean ofx1, . . . , xn.

  • 7/28/2019 Statistics level 1

    42/98

    Computing the sample mean in R

    In R the command

    mean()

    computes the mean.

    You can also choose

    summary()

    computes the mean and other summary statistics.

    The sample median

    The median is the middle value, once the sample has been sorted fromsmallest to largest.

    Example

    To find the median of 1,6,4,8,3 we first sort the values:1,3,4,6,8

    and then take the middle value (4).

    Example

    To find the median of 5,3,7,5,3,6,3 we first sort the values:

    3,3,3,5,5,6,7

    and then take the middle value (5).

    Sample median calculation

    Suppose that the sample size is n.

    Ifn is odd then there is a unique middle element in the sorted list(element (n + 1)/2).

    Ifn is even then there are two middle elements, and the medianequals their average.

    Example

    To find the median of 7,6,1,4 we sort:1,4,6,7

    and then average the two middle elements (4,6), to give a median of 5.

    Median examples

    Consider the data:Data 95 86 78 90 62 73 89

    We first sort the data:Data 95 86 78 90 62 73 89

    Sorted: 62 73 78 86 89 90 95

    and then take the middle (4th) entry: 86.

    Note that if the value 62 had been replaced by, say, 5, then the medianwould be unchanged, whereas the mean would have been changedsubstantially. This is important, since data sets often contain outliervalues which are due to experimental or measurement error. They can beradically dierent from the other values, and should be ignored.

  • 7/28/2019 Statistics level 1

    43/98

    NZ weekly income

    In June 2011 (latest reported data), in NZ,

    the median weekly income (from all sources) was $550;

    the mean weekly income (from all sources) was $703.

    Why do you think there is such a dierence?

    Mode

    If the data is discreteor if it has been binned, then we can talk about the

    mode of the data. The mode is the most frequently occurring value.

    Note for later in the courseit is much more common to talk about themode of a density, rather than a data set.

    Example: grades

    Here is the (default) histogram for STAT115 exam results in 2011.

    In R we compute that the mean of these marks is 63.3 and the median ofthese marks is 70.

    Example: blood pressure

    Here again the (default) histogram for the blood pressure data we lookedat earlier.

    In R we compute that the mean of these marks is 89.32 and the median ofthese marks is 89.50.

  • 7/28/2019 Statistics level 1

    44/98

    A nice mathematical property

    Consider the data set 95, 86, 78, 90, 62, 73, 89.

    The mean is the value x which minimises the sum of squares

    (x

    95)2 +(x

    86)2 +(x

    78)2 +(x

    90)2 +(x

    62)2 +(x

    73)2 +(x

    89)2.

    The median is a value x which minimises the sum of dierences

    |x 95| + |x 86| + |x 78| + |x 90| + |x 62| + |x 73| + |x 89|.

    Try and convince yourself that this holds in general.

    Sources of variability

    Variability reflects dierences in the values collected for dierent unitsbeing measured, for example people, or animals or plants or

    companies or readings on dierent days etc. Two sets of values canhave the same mean and median yet show quite dierent patterns.

    If data are highly variable there are problems analysing the data and itwill be necessary to select larger samples.

    We look at several ways to quantify variability in a sample.

    Range

    The range is the dierence between the largest and smallest values in thesample.

    Example

    The range of the sample 95, 86, 78, 90, 62, 73, 89 is 95 62 = 33.

    While the range does tell us something about the sample, it is aected alot by outliers and random noise. For this reason, we dont often use therange to tell us something about the underlying population.

    Sample variance and sample standard deviation

    The sample variance S2 is defined by the formula

    S2 =1

    n 1

    nXi=1

    (xi x)2.

    Although the divisor is (n 1) rather then n in this equation, we can seethat S2 is eectively the average of the squared deviations of theindividual data values (xi) from their mean (x).

    The sample variance is an overall measure of the extent to which the xivalues dier from their mean (x).

    If you didnt take squares, the values above the mean would cancel thosebelow the mean, and you would end up with 0.

  • 7/28/2019 Statistics level 1

    45/98

    Standard deviation

    A convenient alternative to the sample variance is the sample standarddeviation.

    S = pS2 =vuut1

    n 1

    n

    Xi=1

    (xi

    x)2

    The standard deviation, s, is measured in the same units as the originaldata (taking the square root cancels the squaring).

    Example In the hypertension example, the values in the data are measuredin mm (of Hg). Hence the variance is measured in mm2 while the standarddeviation is measured in mm.

    Variance example

    Find the sample variance and standard deviation of 11, 18, 14, 15, 12.

    The mean is x = (11 + 18 + 14 + 15 + 12)/5 = 14.

    xi xi x (xi x)2

    11 (11-14)=-3 (

    3)2

    = 918 (18- 14 ) = 4 42 = 1614 (14-14) = 0 02 = 015 (15-14) = 1 12 = 112 (12-14)=-2 (2)2 = 4

    70 0 30

    HenceS2 = 30/(5 1) = 7.5

    andS =

    p7.5 = 2.74.

    Computing sample mean and variance in R

    You can compute the mean in R usingmean()

    You can compute the variance in R usingvar()

    You can compute the standard deviation in R usingsd()

    An on a technical note...

    The formula for the sample variance is

    S2 =1

    n 1

    nXi=1

    (xi x)2.

    Why (n 1)?

    The variance of the whole population is defined as

    2 = 1N

    NXi=1

    (xi

    xi)2.

    However if you take a random sample of size n and use this varianceformula you will, on average, get an amount that is n1

    ntimes what you

    want.

    The 1n1

    term in the S2 formula corrects for this. (see STAT 261)

  • 7/28/2019 Statistics level 1

    46/98

    STAT115

    Introduction to Biostatistics

    Megan Drysdale

    University of OtagoCourse Co-ordinator: Dr Tilman Davies

    Section 2: Data Description and PresentationLecture 6

    Statistics and samples

    By sample we mean a subset of the population, usually selected at random.

    A statistic is a quantity that we can compute from a sample (hence thesubject name!)

    In the last lecture we studied some standard summary statistics, quantitiescomputed from the sample which in some way summarise properties ofthat sample. Previously, we had looked at histograms as a way torepresent sample data.

    Today we look at interquartile range and their graphical equivalent, boxand whisker plots.

    Range

    Recall that the range of a sample is the dierence between the maximumand minimum values.

    The median is the value which is (informally) in the middle of the samplevalues.

    The upper quartile is the value which is 3/4 of the way up the samplevalues.

    The lower quartile is the value which is 1/4 of the way up the samplevalues.

    25% of the values

    50 % of the values25% of the values

    Lower

    Quartile MedianUpper

    Quartile

    Interquartile range

    The interquartile range (IQR) is the dierence between the upper quartileand the lower quartile.

    25%

    Data

    25%

    Data

    25%

    Data

    25%

    Data

    MedianLQ UQ

    Interquartile Range

    Range

    The upper and lower quartiles are produced by the summary command inR.

  • 7/28/2019 Statistics level 1

    47/98

    Fiddly details

    We usually cant get exactly 25% of the data points below some value,so we do something in between.

    Example Here are the maximum temperatures in Dunedin for the last 14days.

    10, 11, 12, 11, 11, 11, 11, 11, 11, 11, 13, 14, 13, 12

    We sort the values:

    10, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 13, 13, 14

    Lower quartile: (14+1)/4 = 3.75. We want (0.25) times 3rd value +(0.75) times fourth, or

    0.25 11 + 0.75 11 = 11

    10,11,11,11,11,11,11,11,11,12,12,13,13,14Upper quartile (14+1)*(3/4) =11.25. We want (0.75) times 11th value + (0.25) times 12th value

    General instructions:

    To find the lower quartile from a sample of size n

    1 Sort the numbers in increasing order.

    x(1), x(2), x(3), . . . , x(n)

    2 Let m be the whole number part of n+14 and let r be the fractional

    part. (e.g. if n = 6 then n+14 = 1.75, m = 1 and r = 0.75)1 If r = 0 then return x(m).2 If r = 0.25 then return 0.75x(m) + 0.25x(m+1).3 If r = 0.5 then return 0.5x(m) + 0.5x(m+1).4 If r = 0.75 then return 0.25x(m) + 0.75x(m+1).

    Note: dierent software and textbooks use dierent rules here.

    Skinks

    Thirty-two traps were placed in each of three habitats: pasture, replantedforest and tussock on Stephens Island.The data are the counts of skinksper trap totalled over a ten-day period in each habitat.

    Pasture 4 3 0 2 2 1 4 1 2 5 0 1 5 6 5 611 3 1 1 4 8 5 14 6 8 10 7 4 8 1 3 6

    Replanted 15 24 31 8 4 18 14 33 11 16 20 1 17 12 27 2618 6 12 16 11 8 13 12 11 8 10 17 29 3 12 5

    Tussock 14 23 15 14 5 16 10 16 14 10 7 10 8 12 19 177 12 29 10 11 11 10 10 6 13 7 10 8 12 6 12

    The summary command in R gives

    Pasture Replanted TussockMin. : 0.000 Min. : 1.00 Min. : 5.0

    1st Qu.: 2.000 1st Qu.: 9.50 1st Qu.: 9.5

    Median : 4.500 Median :12.50 Median :11.0

    Mean : 4.875 Mean :14.62 Mean :12.0

    3rd Qu.: 6.250 3rd Qu.:18.00 3rd Qu.:14.0

    Max. :14.000 Max. :33.00 Max. :29.0

    Box plots

    A box plot gives a graphical summary of some of these numbers.

  • 7/28/2019 Statistics level 1

    48/98

    Box plots step one

    Draw a box and a middle line at the upper quartile, median and lowerquartile.

    5

    10

    15

    20

    2

    5

    Tussock

    Upper quartile

    Lower quartile

    Median

    Box plots step two

    Upper (3rd) quartile is 14.0. IQR is 4.5.

    Starting at the top of the box, measure up a length of 1.5 IQR. Here,that gives 14.0 + 1.5 4.5 = 20.75.

    Dont draw the line there. Instead find the largest data point that is atmost 20.75. The sorted values in this example are

    5 6 6 7 7 7 8 8 10 10 10 10 10 10 10 1111 12 12 12 12 13 14 14 14 15 16 16 17 19 23 29

    The largest value that is at most 20.75 equals 19.

    We draw the top line at 19.

    Box plots step two

    5

    10

    15

    20

    25

    Tussock

    Upper quartile

    Lower quartile

    Median

    Box plots step three

    In the same way we subtract 1.5 IQR from the lower quartile. In thisexample, we get

    9.5 1.5 4.5 = 2.75

    The smallest data value that is at least as large as 2.75 equals 5. We drawthe bottom line here.

    5

    10

    15

    20

    25

    Tussock

    Upper quartile

    Lower quartile

    Median

  • 7/28/2019 Statistics level 1

    49/98

    Box plots step three

    If there are any data values outside the range of the bottom and top line,then we plot crosses or dots for each one of them.

    5 6 6 7 7 7 8 8 10 10 10 10 10 10 10 1111 12 12 12 12 13 14 14 14 15 16 16 17 19 23 29

    In this case, 5 is the minimum value (none below that), but 23 and 29 areboth above the top line.

    5

    10

    15

    20

    25

    Tussock

    Upper quartile

    Lower quartile

    Median

    Summary of box plots

    5

    10

    15

    20

    25

    Tussock

    Upper quartile

    Largest value at

    most 1.5IQR above UQ

    Outliers (extreme values)

    Smallest value at

    least1.5IQR below LQ

    Lower quartile

    Median

    Comparing variables

    Box plots provide some information about the center, range and symmetryof the data. We can easily put multiple box plots on one graph. In R, useboxplot() to see box plots for all variables in a data set.

    Combining histograms and box plots

    Box plots will soon be superseded by violin plots.

    To make these, use the commands

    library(UsingR)

    violinplot(Skinks)

  • 7/28/2019 Statistics level 1

    50/98

    STAT115

    Introduction to Biostatistics

    Megan Drysdale

    University of OtagoCourse Co-ordinator: Dr Tilman Davies

    Section 2: Data Description and PresentationLecture 7

    Overview

    In the last few lectures we looks at statistics and graphs used tosummarise the data in a sample.

    Today we will look at look at two statistics used in epidemiology, incidenceand prevalence. They get special treatment because

    They are very widely used, and youll need to know exactly what theymean,

    Theyre often confused, or at least confusing.

    Proportions and rates

    Measures of disease frequency are typically presented in the formnumerator/denominator.

    Recall:

    Proportion. Expresses the value as a fraction of the whole. ExampleIn STAT 115 this year there are 310 female students and 138 malesstudents. The proportion of female students is

    310

    310 + 138 = 0.69

    The denominator and numerator have the same units.

    Rate. A fraction where the numerator and denominator have dierentunits (e.g. children per family, new cases per year). Usually, thedenominator is a measure of time.

    Prevalence

    Prevalence gives frequency of existing cases of disease. It is useful formeasuring the disease burden in a community and is often measured in across-sectional survey.

    Example The proportion of students in this class who currently have a cold.

    Example The proportion of Otago students who had swine flu at 3pm lastTuesday.

  • 7/28/2019 Statistics level 1

    51/98

    Prevalence of eye disease

    In a survey of eye disease among 2477 people aged 52-85 in Framingham,Massachusetts, there were 310 with cataracts and 22 blind.

    Prevalence of cataracts

    310

    2477= 0.125 = 125 cases per 1000 people = 12.5%

    Prevalence of blindness

    22

    2477= 0.009 = 9 cases per 1000 people = 0.9%

    Prevalences and timelines

    In the following diagram shaded lines indicate the time each person hasthe disease.

    !

    t"Time!

    1!2!3!4!5!

    Subject!Number!

    Prevalence 1/5 2/5 3/5 2/5

    Note on prevalence

    Prevalence is the proportion of people in a population who have thedisease at a given point in time.

    The time point may refer to calendar time, or to a fixed point in thecourse of events.

    Example: the proportion of people free from back pain 2 months afterback injury. (The time point here is relative to an event, rather than anabsolute time.

    Incidence

    Incidence measures the frequency of new cases of a disease. As such, it isuseful for looking at the causes of disease.

    Example

    How many people in this lecture theatre currently have a cold.

    (Prevalence)How many people in this lecture theatre got a cold this week?

  • 7/28/2019 Statistics level 1

    52/98

    Cumulative incidence

    Cumulative incidence is the proportion of people who become diseasedduring a specified period of time.

    =number of new cases of disease during time period

    size of total population at risk

    This provides an estimate of the probability, or risk, that an individual willdevelop the disease during the specified period of time.

    The period of time could be one day, one week, one year, five years etc.

    Cumulative incidence example

    A study of heart disease was made in Evans County, Georgia.

    There were 609 men aged 4076 who had no detected heart diseasein 1960.

    These men were followed for 7 years and 71 cases of heart disease

    were detected during this period.Number of cases = 71. Population size = 609.

    Cumulative incidence =71

    609= 0.117 cases per person = 11.7%

    over the 7 year period.

    NOTE: For the cumulative incidence to be interpretable, the time periodmust be specified.

    Incidence rate motivation

    Cumulative incidence calculations assume that the population at thestart of the study is exactly the same as at the end of the study.

    In practice, people are lost to follow up, and people enter the study atdierent times.

    We therefore compute a rate of incidence which is not dependent onthe exact study time, but instead summarises the incidence per year.

    People-time

    People-time is the total time each person in the population is

    part of the study, and

    at risk (for the disease being studied).

    It is the same if we follow 16 people for one year or 4 people for four years,

    or 1 person for 16 years.

  • 7/28/2019 Statistics level 1

    53/98

    People-time at risk

    TotalJan Jan Jan Jan Jan Jan time

    Subject 1997 1998 1999 2000 2001 2002 at risk

    A - (lost to follow up) 2.0

    B x 3.0C - 5.0D - 4.0E x 2.5

    Here = initiation of follow-upx = development of disease.

    Number of person-years at risk = 16.5.

    Incidence rate:

    The incidence rate

    =number of new cases of disease

    total person-time at risk

    In the previous example

    The number of new cases was two.

    The number of person-years at risk = 16.5

    Incidence rate =2

    16.5= 0.121,

    which is 0.121 cases per person year of observation, or 12.1 cases per 100person years of observation.

    Hepatitis

    Example. Frequency of hepatitis in two regions.

    New cases ReportingLocation of hepatitis Period Population

    Region A 58 1985 25,000Region B 35 1984-1985 7000

    In Region A,

    Number cases = 58

    Person-years = 1 25000

    Incidence rate = 58/25000 cases per person-year

    = 2.32 cases per 1000 people per year.

    Hepatitis

    Example. Frequency of hepatitis in two regions.

    New cases ReportingLocation of hepatitis Period Population

    Region A 58 1985 25,000Region B 35 1984-1985 7000

    In Region B,

    Number cases = 35Person-years = 2 7000

    = 14000

    Incidence rate = 35/14000 cases per person-year

    = 2.50 cases per 1000 people per year.

  • 7/28/2019 Statistics level 1

    54/98

    Strokes in women aged 30-55

    A study in the United States measured the incidence rate of stroke in agroup of 118,539 women aged 30-55 years of age. The women were freefrom stroke in 1986, and were followed for 8 years.

    Smoking cate-gory Num. cases o f stroke

    Person-years of

    observation (over8 years)

    Stroke incidence

    rate (per 100,000person years)Never smoked 70 395,594 17.7Ex-smoker 65 232,712 27.9Smoker 139 280,141 49.6Total 274 908,447 30.2

    (Total) incidence rate = 274908,447

    100, 000 = 30.2 cases of stroke per100,000 person-years of observation.

    Average follow-up per women (person-times) = 908,447118,539

    = 7.7 years.

    Notes on incidence rate

    The denominator for measures of incidence should include only those whoare at risk of developing the disease. It should exclude

    those who already have the disease

    those who cannot develop the disease

    Failure to do this will lead to an underestimate of the true incidence sincefewer will develop the condition.

    For example when studying the incidence of endometrial cancer we shouldexclude women who have had a hysterectomy.

    Incidence versus prevalence

    Example: Disease A

    1.

    2.

    3.

    4.

    5.

    tL

    Time

    Cumulative Incidence = 5/5 in t-years.

    Incidence rate = 55t

    cases per person year.

    Prevalence at time L = 25

    .

    Incidence versus prevalence

    Example: Disease B

    1.

    2.

    3.

    4.

    5.

    tL

    T me

    Cumulative Incidence = 5/5 in t-years.

    Incidence rate = 55t

    cases per person year.

    Prevalence at time L = 55

    .

  • 7/28/2019 Statistics level 1

    55/98

    Prevalence versus incidence

    Prevalence depends on the

    incidence rate, as well as the

    duration of disease.

    Adult onset diabetes has a low incidence rate but a long duration, asthe disease is neither curable nor total. Hence prevalence is high relative toincidence.

    A cold has a (very) high incidence, but the duration is short, soprevalence is low relative to incidence.

  • 7/28/2019 Statistics level 1

    56/98

    STAT115

    Introduction to Biostatistics

    Megan Drysdale

    University of OtagoCourse Co-ordinator: Dr Tilman Davies

    Section 2: Data Description and PresentationLecture 8

    Afterword: units

    There are dierent expressions for the units used with incidence rate.

    The simplest (I think) is new cases per person-years. This is the same as

    x cases per person per year.

    where x is some fraction.Sometimes you see

    cases per 100 people per year (multiply x by 100 for this)

    cases per 100,000 people per year. (multiply x by 100,000 for this)

    The following mean the same

    0.002 cases per person per year

    0.2 cases per 100 people per year

    200 cases per 100,000 people per year

    Afterword: units again

    In the same way we can change the period of time. The following are thesame

    0.002 cases per person-year

    0.002 cases per person per year

    0.2 cases per 100 people per year

    0.02 cases per person per 10 years2 cases per person per 10 years

    Sometimes we get sloppy....

    Example. Frequency of hepatitis in two regions.

    New cases ReportingLocation of hepatitis Period Population

    Region A 58 1985 25,000Region B 35 1984-1985 7000

    In Region A,

    Number cases = 58

    Person-years = 1 25000

    Incidence rate = 58/25000 per year

    = 2.32 per 1000 per year.

  • 7/28/2019 Statistics level 1

    57/98

    Sometimes we get sloppy....

    Example. Frequency of hepatitis in two regions.

    New cases ReportingLocation of hepatitis Period PopulationRegion A 58 1985 25,000

    Region B 35 1984-1985 7000

    In Region A,

    Number cases = 58

    Person-years = 1 25000

    Incidence rate = 58/25000 cases per person-year

    = 2.32 cases per 1000 people per year.

    Overview

    In the last few lectures we looks at statistics and graphs used tosummarise the data in a sample.

    Today we will look at statistics which are used to assess diseaseassociation. This is an extremely important field of statistical inference.The study of associations between diseases and dierent factors of groupsis an important step in identifying causes and/or potential treatments.

    Example association

    Data from cohort study of oral contraceptive use (OC) and bacteria in theurine among women aged 16-49 years over 3 years.

    Recall from lecture two: a cohort study is one where we start a completepopulation.

    In this case, the researchers visited every household in specificItalian-American neighbourhood.

    Data

    Bacteria PresentYes No Total

    OC Use Yes 27 455 482No 77 1831 1908

    Total 104 2286 2390

    Bacteria Present is called the Disease Category (the outcome variable)

    OC use is called the Exposure Category (explanatory variable).

    Cumulative incidence

    OC users: 27/482 = 0.056, or 56 cases per 1000 in the 3 years.

    Non users: 77/1908 = 0.040 or 40 cases per 1000 in the 3 years.

  • 7/28/2019 Statistics level 1

    58/98

    Measuring association

    We will look at two dierent ways to measure association in the data:

    1 The Relative eect (also known as the relative risk). This expresses(or more correctly, estimates) the incidence rate of the disease within

    one population (group) relative to the rate in the other population(group).

    2 The Absolute eect expresses the absolute dierence in incidencebetween the two groups.

    Both measures are still statistics and might still reflect random error. Laterin the course youll learn how to model and test whether a givenassociation is statistically significant or not.

    Relative risk

    The relative eect (RR) (also called relative risk) is the ratio of the(cumulative) incidence in the exposed group Ie to the incidence inunexposed group (I0).

    RR =Ie

    I0

    If RR is...

    8>:

    > 1 (exposure ! disease)

    = 1 (no association between exposure and disease)

    < 1 (exposure is protective).

    Indicates how much more likely disease is to develop in the exposedgroup than in the unexposed group.Good measure of strength of an association, and the usual measure instudies of causation of disease.We can also calculate ratios of prevalences, but the interpretation isdierent.

    Attributable risk

    The absolute eect, or attributable risk (AR) is the dierence in incidencebetween exposed and unexposed groups:

    AR = Ie I0.

    Assuming a cause-eect relationship between exposure and disease, we say:

    If AR is...

    8>:> 0 AR equals number of cases attributed to exposure= 0 (no association between exposure and disease)

    < 0 -AR is number of cases prevented to exposure.

    AR has the same units as the incidence rate (cases per person-time).

    Association with OC

    Bacteria PresentYes No Total

    OC Use Yes 27 455 482No 77 1831 1908

    Total 104 2286 2390

    For the three year period:

    Ie = 27/482 = 56 per 1000 I0 = 77/1908 = 49 per 1000

    RR =56

    40= 1.4, AR = 56 40.

  • 7/28/2019 Statistics level 1

    59/98

    Does infra-red treatment help with arthritis?

    A randomised trial of the eectiveness of infra-red stimulation comparedwith placebo on pain caused by cervical osteoarthritis (degenerative jointdisease in the neck) carried out over two months. The control group weregiven a placebo treatment.

    Treatment ControlImprovement in pain 18 8

    No improvement in pain 7 17

    Total 25 25

    Improvement/No improvement is the Disease Category (the outcomevariable)

    Treatment/control is the Exposure Category (explanatory variable).

    Does infra-red treatment help with arthritis?

    Treatment ControlImprovement in pain 18 8

    No improvement in pain 7 17Total 25 25

    Cumulative incidence over the two months:

    Ie = 18/25 I0 = 8/25.

    Relative risk is

    RR =Ie

    I0=

    18/25

    8/25= 2.3.

    Absolute risk isAR = Ie I0 = 18 8 = 10.

    Variations on a theme

    Sometimes the setup or presentation of data and relative risk varies.

    1 Relative and absolute risk can be computed in terms of prevalencerather than just cumulative incidence. The resulting quantities have aslightly dierent interpretation, but the computation is the same. Seethe Framingham study

    2 Relative and absolute risk can also be computed as a ratio ofincidence rates, for the same reasons that we often look at incidence

    rates instead of cumulative incidence (e.g. variable risk times forpeople in the sample). See the hormone and heart study

    3 It can be informative to compare relative risk of di erent diseases orconditions. We will see that in common diseases, a smaller relativerisk leads to a larger absolute risk. See the British doctors study

    Framingham study: relative risk from prevalence

    Prevalence of coronary heart disease (CHD) at initial examination among4469 persons age 30-62 years of age in the Framingham Study.

    Number Number Prevalenceexamined with CHD per 1,000

    Males 2024 48 23.7Females 2 445 28 11.5

    Compute relative and absolute risk using prevalence per 1000 (note23.7 = 48

    2024

    1000).

    Consider males as the exposed group.

    RR = (23.7/11.5) = 2.1 AR = 23.7 11.5 = 12.2.

    Heart disease is almost twice as common among the males, and there are12.2 more cases of heart disease in 1000 men than in 1000 women.

  • 7/28/2019 Statistics level 1

    60/98

    Hormone and heart study: relative risk from incidence rates

    Data from a cohort study of postmenopausal hormone use and coronaryheart disease among female nurses.

    Coronary heartYes No Per son -year s

    Postmenopausal Yes 3 0 - 54,308.7hormone use No 60 - 51,477.5

    Incidence rate:

    Users: 30/54308.7 = 55 per 100,000 person-yearsNon-users: 60/51477.5 = 117 per 100,000 person years

    Attributable Risk:

    55-117 =-62 cases of CHD per 100,000 person yearsHormone use prevents 62 cases per 100,000 person years

    Relative Risk: 55/117 = 0.47The risk of CHD among users is 0.47 times the risk in non-users (a 53%reduction in risk).

    Doctors study: comparing relative risk

    Relative and attributable risks of mortality from lung cancer and coronaryheart disease among cigarette smokers in a cohort study in British malephysicians.

    Annual mortality rate per 100,000

    Lung cancer Heart diseaseSmokers 140 669Non-smokers 10 413

    RR 14.0 1.6AR 130 256

    (Risk is p er 100,000 per year)

    Heart disease is more common therefore a smaller relative increase in riskproduces more people with disease.

    Notes

    Relative risks

    provide information on the strength of an association;

    can be used to assist in assessment of the likelihood of a causalassociation.

    Attributable risks

    measure the impact of an exposure, (assuming that it is causal).If a disease is common a small relative risk will translate to a largeattributable risk. [see previous example]

  • 7/28/2019 Statistics level 1

    61/98

    STAT115

    Introduction to Biostatistics

    Dr Tilman Davies

    University of Otago

    Section 3. ProbabilityLecture 9

    First of all: What is Probability?

    There are varying definitions of probability

    Statisticians can be split into two main groups who have diering

    views on probability.

    Frequentistsconsider probability to be the relative frequency in thelong run of outcomes.

    Bayesians consider probability to be a way to represent an individualsdegree of belief in a statement given the evidence.

    Consider these statements.

    We can quantify these probabilities (Frequentist)What is the probability I win lotto tonight?What is the probability I roll a 6?

    Based on personal and subjective belief (Bayesian)What is the chance I pass Stat110?What is chance I will do an OE after graduating?

    Some definitions

    Experiment = process by which observations/measurements areobtainede.g. tossing a fair die

    Event = outcome of experimentrolling a 2, rolling a 5, etc.

    Sample space = the set of all possible outcomesIn this case, {1, 2, 3, 4, 5, 6}

  • 7/28/2019 Statistics level 1

    62/98

    Conditions for a valid probability

    Each probability is between 0 and 1

    The sum of the probabilities over all possible events is 1In other words, the sum of the total probability for all possibleoutcomes is equal to 1

    If the event A cannot happen then Pr(A) = 0

    If the event A is certain to happen then Pr(A) = 1

    Calculating Probabilities

    Probability of an event A is

    Pr(A) = number of trials for which A is truetotal number of trials

    = nAN

    We dont always need to conduct the experiments to measure these,as we can make logical arguments.For example, flipping a coin: Pr(heads) = Pr(tails) = 1/2

    Some more things to know

    Complementary Events

    Two events are complementary if exactly one of them is always true.For example, the coin flip. Will always be either heads or tails.

    A is called the complement of A.

    Pr(A) + Pr(A) = 1

    The Rules

    Addition Rule

    Pr(A or B) = Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)

    ExampleA = Rolling an even number {2,4,6}B = Rolling 3 or less {1,2,3}Pr(A or B) = Pr(A) + Pr(B) - Pr(A \ B)

    = 3 /6 + 3/ 6 - 1/ 6= 5/6

    Intuitively, this makes sense!

  • 7/28/2019 Statistics level 1

    63/98

    The Rules

    Multiplication Rule

    Pr(A and B) = Pr(A \ B) = Pr(A)Pr(B|A)

    Please read this as probability of B given AExampleA = Rolling an even number {2,4,6}B = Rolling 3 or less {1,2,3}Pr(A and B) = Pr(A) x Pr(B|A)

    = 3/6 x 1/3= 1/6

    If you think carefully, this makes sense too!

    Addition rule - special case

    Mutually Exclusive Events

    There is no intersection between the two events.In other words, events are mutually exclusive if they cannot both

    occur.Back to coin flipping: getting heads and tails.

    In this case the addition rule simplifies to

    Pr(A or B) = Pr(A [ B) = Pr(A) + Pr(B)

    Because Pr(A \ B) cannot occur, in fact Pr(A \ B) = 0

    Multiplication rule - special case

    Independent Events

    When the occurrence of one event does not eect the outcome ofanother event.For example flipping 3 heads in a row.

    In this case, Pr(B|A) = Pr(B).

    The multiplication rule simplifies to

    Pr(A and B) = Pr(A \ B) = Pr(A)Pr(B)

    Think about it:A = First 3 flips are heads.B = The 4th flip is heads.Pr(B|A) = Pr(B)..... A and B are independent, for sure!

    Blood donor example

    The probability of being in each of the 4 blood groups (Dunedindonor centre)

    Blood Type Probability

    A 0.38B 0.11

    AB 0.04O 0.47

  • 7/28/2019 Statistics level 1

    64/98

    Blood donor example - Addition Rule

    What is the probability that a person is either A or B?

    Pr(A or B) = Pr(A) + Pr(B)

    = 0.38 + 0.11

    = 0.49

    Note than A and B are mutually exclusive,that is they cannot both occur, and Pr(A and B) = 0.

    Blood donor example - Multiplication Rule

    What is the probability that 3 randomly selected people have blood groupO?

    Pr(O) Pr(O) Pr (O) = 0.473

    = 0.104

    (under the assumption of independence)

    Hospital Patients

    A survey of hospital patients shows that the probability a patient hashigh blood pressure given he/she is diabetic is 0.85. If 10% of thepatients are diabetic and 25% have high blood pressure:

    Find the probability a patient has both diabetes and high bloodpressure.

    Are the conditions of diabetes and high blood pressure independent?

    Hospital Patients - Relevant information

    A survey of hospital patients shows that the probability a patient hashigh blood pressure given he/she is diabetic is 0.85. If 10% of thepatients are diabetic and 25% have high blood pressure

    Let A be the event A patient has high blood pressure

    Let B be the event A patient is diabetic

    Pr(A|B) = 0.85

    Pr(B) = 0.10

    Pr(A) = 0.25

  • 7/28/2019 Statistics level 1

    65/98

    Hospital Patients - Question 1

    Pr(A|B) = 0.85

    Pr(B) = 0.10

    Pr(A) = 0.25

    Find the probability a patient has both diabetes and high bloodpressure.

    Pr(A \ B) = Pr(A | B) Pr(B)

    = 0.85 0.10

    = 0.085

    Hospital Patients - Question 2

    Pr(A|B) = 0.85

    Pr(B) = 0.10

    Pr(A) = 0.25

    Are the conditions of diabetes and high blood pressure independent?

    Remember when discussing the special case of the multiplication rule

    we said if A and B are independent then:

    Pr(A | B) = Pr(A)

    We can use this to test for independence.

    Pr(A | B) = 0.85

    Pr (A) = 0.25

    Pr(A) 6= Pr(A | B)

    ) A and B are not independent

    Summary

    Calculating a probability.

    Pr(A) =number of trials for which A is true

    total number of trials=

    nAN

    Addition Rule.Pr(A or B) = Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)

    Multiplication Rule.Pr(A and B) = Pr(A \ B) = Pr(A)Pr(B | A )

    Read this as probability of B given A.

    Addition Rule - Mutually ExclusivePr(A [ B) = Pr(A) + Pr(B)

    Multiplication Rule - Independent EventsPr(A \ B) = Pr(A)Pr(B)

    Questions

    For each question, assume a standard deck of 52 playing cards:

    What is the probability the first card drawn is a 4 or a 5?

    Say the first card drawn is a 5. What is the probability the second carddrawn is an Ace?(Think of this as What is the probability the second card drawn is an

    Ace, given thefi

    rst card drawn was a 5?)

    Dierent question: What is the probability the first card drawn is a 5and the second card drawn is an Ace?

  • 7/28/2019 Statistics level 1

    66/98

  • 7/28/2019 Statistics level 1

    67/98

    Fair Die Example - Visualisation

    A fair die is thrown.

    Define Event A: a number greater than 3 is thrown

    Define Event B: an even number is thrown

    A

    B

    46

    1

    3

    5

    2

    Pr(A [ B) = {2 4 5 6}, therefore Pr(A [ B) = 4/6 = 2/3

    Pr(A \ B) = {4 6}, therefore Pr(A \ B) = 2/6 = 1/3

    Event A: a number greater than 3 is thrown

    Event B: an even number is thrown

    Calculate Pr(A \ B): Multiplication rule

    Brief thought exercise:

    Pr(B|A) = Pr(even number, given we have rolled a 4, 5, or 6)= 2/3

    Pr(A \ B) = Pr(A)Pr(B | A)

    Pr(A \ B) = 1/2 2/3

    Pr(A \ B) = 1/3

    The same result we observed in our diagram.

    Find Pr(A [ B)

    Use addition rule

    Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)

    Pr(A [ B) = 1/2 + 1/2 1/3

    Pr(A [ B) = 2/3

    The same result we observed in our diagram.

    Tree Diagrams

    Useful for examining combinations of several random variables.

    Always use A, and A for your Event names.

    Choose what letters you use wisely!

    Rules:

    Add Vertically.Multiply across.

  • 7/28/2019 Statistics level 1

    68/98

    Basic Tree Diagrams

    A

    B

    B

    A

    B

    B

    Tree Diagrams

    A

    B Pr(A \ B)Pr( B| A)

    B Pr(A \ B)Pr(

    B |A)Pr( A)

    A

    B Pr(A \ B)

    Pr( B|A)

    B Pr(A \ B)Pr(

    B |A)

    Pr(A

    )

    Independent Stages

    Stephens Island is an uninhabited island in Cook Strait where tuatara arebeing re-established. For some years three locations have been visited onthe island and tuatara have been found at dierent locations withprobability 0.4. At any visit X represents the number of locations out ofthree at which tuatara are observed. X can take values 0, 1, 2, or 3. Findthe probabilities that 0, 1, 2, or 3 locations have tuatara on a visit. T isthe event location has tuatara and T is the complementary event

    location has no tuatara

    Define Event T: Location has tuataraDefine Event T: Location does NOT have tuatara

    Tree Diagrams - Building Step 1

    T

    0.6

    T

    0.4

  • 7/28/2019 Statistics level 1

    69/98

    Tree Diagrams - Building Step 2

    T

    T0.6

    T0.40.6

    T

    T

    0.6

    T0.4

    0.4

    Tree Diagrams - Building Step 3

    T

    T

    T0.6

    T0.40.6

    T

    T0.6

    T0.4

    0.40.6

    T

    T

    T0.6

    T0.40.6

    T

    T0.6

    T0.4

    0.4

    0.4

    Tree Diagrams - Building Step 4

    T

    T

    T X = 00.6

    T X = 10.40.6

    TT X = 10.6

    T X = 20.4

    0.40.6

    T

    T

    T X = 10.6

    T X = 20.40.6

    T

    T X = 20.6

    T X = 30.4

    0.4

    0.4

    Probabilities of observing tuatara

    Find the probability of seeing tuatara at the first site observed.

    This was given in the problem statement:

    Pr(T) = 0.4

    Also found by following one T branch in the tree.

    Find the probability of seeing tuatara at the first two sites

    (we assume independence of each site)

    Pr(T \ T) = Pr(T) Pr(T) = 0.4 0.4 = 0.16

    This is also found by multiplying the probabilities following two Tbranches on the tree diagram. (multiple across rule)

  • 7/28/2019 Statistics level 1

    70/98

    Probabilities of observing tuatara

    Find the probability of seeing tuatara at all three sites.(again assume independence)

    Pr(T \ T \ T) = Pr(T)Pr(T)Pr(T) = 0.4 0.4 0.4 = 0.064

    This is also found by multiplying the probabilities along three Tbranches on the tree diagram. (multiple across rule)

    Add down rule

    Find the probability of seeing tuatara at two of the three sites:

    Tree Diagrams - Building Step 4

    T

    T

    T X = 00.6

    T X = 10.40.6

    TT X = 10.6

    T X = 20.4

    0.40.6

    T

    T

    T X = 10.6

    T X = 20.40.6

    T

    T X = 20.6

    T X = 30.4

    0.4

    0.4

    Add down rule

    Find the probability of seeing tuatara at two of the three sites:

    Pr(X = 2) = Pr(TTT [ TTT [ TTT)

    = Pr(TTT) + Pr(TTT) + Pr(TTT)

    = 0.096 + 0.096 + 0.096

    = 0.288

  • 7/28/2019 Statistics level 1

    71/98

    Find the probability of seeing tuatara at one of the three sites

    Take advantage of the fact that all possibilities add to 1

    Pr(X = 2) = 0.288

    Pr(X = 0) = 0.6 0.6 0.6 = 0.216

    Pr(X = 3) = 0.4 0.4 0.4 = 0.064

    Pr(X = 1) = 1 0.288 0.216 0.064

    = 0.432

    Summary

    Conditional Probability

    Pr(A \ B) = Pr(A)Pr(B | A)

    Pr(B | A) =Pr(A \ B)

    Pr(A)

    Introduction to Tree diagrams

    Multiply across, Add down

    Tree Diagrams

    A

    B Pr(A \ B)Pr( B| A)

    B Pr(A\

    B)Pr(B |

    A)Pr( A)

    A

    B Pr(A \ B)Pr( B|A)

    B Pr(A \ B)Pr(

    B |A)

    Pr(A

    )

    Questions

    100 Otago students were asked if they like L&P.Of the 75 males (M) surveyed: 50 said they like L&P (L).Of the 25 females (M) surveyed: 20 said they like L&P (L).

    Find:

    Pr(L)Pr(L|M)

    Pr(L|M)

    Are the events M and L independent?

  • 7/28/2019 Statistics level 1

    72/98

    STAT115

    Introduction to Biostatistics

    Dr Tilman Davies

    University of Otago

    Section 3. ProbabilityLecture 11

    Summary

    Conditional Probability

    Pr(A \ B) = Pr(A)Pr(B | A)

    Pr(B | A) =Pr(A \ B)

    Pr(A)

    Introduction to Tree diagrams

    Multiply across, Add down

    Tree Diagrams

    A

    B Pr(A \ B)Pr( B| A)

    B Pr(A\

    B)Pr(B |

    A)Pr( A)

    A

    B Pr(A \ B)Pr( B|A)

    B Pr(A \ B)Pr(B

    | A)

    Pr(A

    )

    Sensitivity and Specificity

    Given the following Events:

    D: some condition (D) is present.T: the related test (T) for the presence of D is positive.Note: This test result may or may not be correct.

    SENSITIVITY = Pr(T| D)Think of this as the probability of a positive test result, given theperson actually has the disease

    SPECIFICTY = Pr(T| D)Think of this as the probability of a negative test result, given theperson does NOT have the disease

    SENSITIVITY and SPECIFICTY will appear very naturally in yourtree diagrams!

  • 7/28/2019 Statistics level 1

    73/98

    PPV and NPV

    POSITIVE PREDICTIVE VALUE = Pr(D|T)The proportion of patients with positive test results who are correctlydiagnosed.

    NEGATIVE PREDICTIVE VALUE = Pr(D|T)The proportion of patients with negative test results who are correctlydiagnosed.

    Notice the dierent order of the D and the T events.

    These will NOT appear naturally in your tree diagrams.See slide later for how to calculate these.

    Screening Programmes

    A patient with certain symptoms consulted her doctor to be checkedfor a cancer, and she undergoes a biopsy.

    With this test there is a probability of 0.90 that a woman with the

    cancer shows a positive biopsy, and a probability of only 0.001 that ahealthy woman incorrectly shows a positive biopsy.

    Historical information also suggests that the prevalence of this cancerin the population is 1 in 10000.

    Find the probability that a woman has the cancer given the biopsysays she does (i.e. does the biopsy diagnose true patient status?).

    Let C be the event woman has the cancer and T be the event testis positive.

    Pr(C) = 1/10000 = 0.0001 (disease prevalence)

    Pr(T | C) = 0.90 (conditional probability)

    Pr(T | C)= 0.001

    Tree Diagrams

    C

    Pr( C)=0.9999

    C

    Pr(C

    ) =0.000

    1

  • 7/28/2019 Statistics level 1

    74/98

    C

    Pr( C)=0.9999

    C

    T

    Pr( T|C)=0.10

    T

    Pr(T|C)

    = 0.90

    Pr(C)

    =0.000

    1

    C

    T

    Pr( T|C)=0.999

    T

    Pr(T|C)

    = 0.0010.9999

    C

    T0.10

    T0.90

    0.000

    1

    C

    T Pr(T and C) =0.99890

    0.999

    T Pr(T and C) =0.001000.00

    10.9999

    C

    T Pr(T and C) =0.00001

    0.10

    T Pr(T and C) =0.000090.90

    0.000

    1

    P(T) = P(T\C)+P(T\C) = 0.00009+0.00100 = 0.00109

    Positive Predictive value

    Find the positive predictive value Pr(C|T). To calculate this we usethe conditional probability formula

    Pr(C | T) =Pr(C \ T)

    Pr(T)

    Pr(C | T) =0.00009

    0.00109Pr(C | T) = 0.083

    Only 8.3% of those women identified as having the disease actuallydo.

  • 7/28/2019 Statistics level 1

    75/98

    Negative Predictive value

    Find the negative predictive value Pr(C|T). To calculate this we usethe conditional probability formula

    Pr(C | T) =Pr(C \ T)

    Pr(T)

    Pr(C | T) =0.99890

    0.99891Pr(C | T) = 0.9999

    99.99% of those women identified as not having the disease, do nothave it.

    Classification table

    Sometimes the information is presented in a dierent manner.

    Table layout for 2 Random Variables

    Closure of the squid fishery in the sub Antartic islands due to Hookersea lion bycatch is a costly issue fo