Unit 8 Textbook

download Unit 8 Textbook

of 47

Transcript of Unit 8 Textbook

  • 7/30/2019 Unit 8 Textbook

    1/47

    UNIT

    USING

    STATISTICS

    FOR SCIENCE

    BTEC (Extended) Diploma

    Applied Science (Forensics) Level 3

    Steve Bishop

    November 2012

  • 7/30/2019 Unit 8 Textbook

    2/47

    Unit 8 Steve Bishop

    2

    Contents

    1 BE ABLE TO USE STATISTICAL TECHNIQUES TO INVESTIGATE SCIENTIFIC

    PROBLEMS ................................................................................................................. 3

    Statistical techniques ................................................................................................... 5

    Measures of location ................................................................................................ 5

    Measures of dispersion ............................................................................................ 6

    Normal distribution 1 .................................................................................................... 9

    Confidence limits .................................................................................................... 11

    Shapes of distributions ........................................................................................... 13

    The normal distribution 2 ........................................................................................... 14

    Finding probabilities with negative values of z....................................................... 17

    Standardising a normal distribution ........................................................................... 19

    Probability introduction .............................................................................................. 22

    Conditional probability ............................................................................................... 24

    Statistics and probability questions ........................................................................... 26

    2 BE ABLE TO PERFORM STATISTICAL TESTS TO INVESTIGATE SCIENTIFIC

    PROBLEMS ............................................................................................................... 28

    Chi-squared (2

    ! ) test ............................................................................................. 29

    Practice questions ................................................................................................. 35

    Type I and type II errors ............................................................................................ 36

    The angel of death: guilty or not guilty? ..................................................................... 37

    Students t-test ........................................................................................................... 39

    t-test for matched pairs .......................................................................................... 41

    Independent samples ............................................................................................. 42

    Independent t-test .................................................................................................. 44

    STATISTICAL TABLES ............................................................................................. 45

  • 7/30/2019 Unit 8 Textbook

    3/47

    Unit 8 Steve Bishop

    3

    1 BE ABLE TO USE STATISTICAL TECHNIQUES

    TO INVESTIGATE SCIENTIFIC PROBLEMS

    Probability: addition and multiplication rules; conditional probability, eg lottery,

    Mendelian inheritance

    Frequency distributions: discrete data; continuous data (grouped and ungrouped)

    Shape of distributions: unimodal distributions (normal distributions and skewed

    distributions); bimodal distributions (qualitative explanation)

    Statistical data calculations: calculation of the mean, ; mode;

    median; calculations of standard deviation, ; using ICT

    equipment to calculate the standard deviation; entering statistical data into ICT

    equipment; retrieving statistical information from ICT equipment; standard error of

    the mean; confidence limits

    Normal distribution: mean; variance; use of tables of the cumulative distribution

    function; application of the normal distribution in science

    Sampling: random sampling (quadrant in field sampling); population and sample(Gallup or Mori poll); standard error of the mean (the uncertainty in the average

    value of a set of measurements, eg the calorific value of oil)

    P1 carry out statistical calculations to investigate a scientific problem

    M1 perform a calculation using probability to investigate a scientific problem

    D1 interpret shapes of distributions in scientific data

  • 7/30/2019 Unit 8 Textbook

    4/47

    Unit 8 Steve Bishop

    4

  • 7/30/2019 Unit 8 Textbook

    5/47

    Unit 8 Steve Bishop

    5

    Statistical techniques

    Measures of location

    There are three types of average: the mean (x

    _

    ),

    mode and median. These are known as measures of

    location. This provides a single value that represent

    the data.

    Now try this

    Find the mean, median and mode of the following data:

    (a) 1, 1, 1, 3, 4, 5, 6

    (b) 0, 1, 1, 1, 1 ,1 ,1, 9

    Which of the three measures of location is most affected by an extreme value?

    When might the mode be of more use than either the median or the mean?

    What is the advantage of the mean?

    Measures of location doesnt tell us how spread out our data are how dispersed

    they are.

    Be able to use statistical

    techniques to investigate

    scientific problems

    Frequency distributions

    Shape of distributions

    Statistical data calculations: mean,

    mode, median and standard

    deviation

    Samples and populations standard

    error of the mean

    Using spreadsheets and calculators

  • 7/30/2019 Unit 8 Textbook

    6/47

    Unit 8 Steve Bishop

    6

    Measures of dispersion

    One measure of dispersionor spread is the range. Another is the standard

    deviation, sor !.

    The formula for standard deviation is: s =

    ! x" x_#

    $%

    &

    '(2

    n"1

    This is not quite so scary as it looks!

    It involves a few simple steps

    1. Find the mean

    2. Subtract it from all the values to find the deviation and then square it3. Total up the deviation squared

    4. Divide 3 by the total number of data points less one.

    5. Square root the answer from 4. This is the samplestandard deviation.

    Done manually, it is best done in a table:

    Example

    These are the number of break-ins in a housing estate over a twelve-month period.

    Find the mean and the standard deviation of the data:

    1, 3, 3, 4, 2, 0, 0, 3, 4, 3, 0, 1

    1. Find the mean

    x

    _

    =

    1+3+3+4+2+0+0+3+4+3+0+1

    12=

    24

    12= 2

    2. Subtract the mean from all the other values and square it x! x_"

    #$

    %

    &'2

    sis the standard deviation

    xis the individual data

    points

    x

    _

    (x bar) is the mean

    n is the number of data

    points

  • 7/30/2019 Unit 8 Textbook

    7/47

    Unit 8 Steve Bishop

    7

    x" x_#

    $

    %

    &

    '

    (

    x" x_#

    $

    %

    &

    '

    (

    2

    1 1 - 2 -1 1

    3 3 2 1 1

    3 3 2 1 14 4 2 2 4

    2 2 2 0 0

    0 0 2 -2 4

    0 0 2 -2 4

    3 3 2 1 1

    4 4 2 2 4

    3 3 2 1 1

    0 0 2 -2 4

    1 1 2 -1 1

    Total 26

    3. Find the total of the deviations " x# x_$

    %&

    '

    ()

    2

    = 26

    4. Divide the above by the number of data points (n) less 1 (n-1)

    12-1 =11

    ! x" x_#

    $%

    &

    '(2

    n"1=26

    11=2.3636363 (Dont round up yet!)

    5. To find the standard deviation square root the answer above

    s =

    ! x" x_#

    $%

    &

    '(

    n"1

    2

    = 2.363636... = 1.5374

    = 1.54 (3 sig fig)

  • 7/30/2019 Unit 8 Textbook

    8/47

    Unit 8 Steve Bishop

    8

    This can be done on a spreadsheet using

    the insert function.

    Enter the data in a column.

    Ensure the correct data points are chosen

    Insert function choose statistical >

    STDEV

    Alternatively use the function statement

    =STDEV(cell range)

    .

  • 7/30/2019 Unit 8 Textbook

    9/47

    Unit 8 Steve Bishop

    9

    Normal distribution 1

    The standard deviation can be used to find the confidence intervalfor a set of

    measurements.

    We expect 95% of measured values to lie within 2 standard deviations above and

    below the mean.

    The distribution of the height of 1000 people might look like this.

    The shape is known as a bell shape.

    The mean, median and mode will all have the same value

    It is symmetrical around the mean value.

    Many biological variables such as weight, height, blood pressure, life span have this

    same distribution shape.

    Given enough data points the curve will be a smooth bell shape

  • 7/30/2019 Unit 8 Textbook

    10/47

    Unit 8 Steve Bishop

    10

    On a normal distribution:

    68% of the data items will be within 1 standard deviation from the mean

    95.5% of the data items will be within 2 standard deviation from the mean

    99.7% of the data items will be within 3 standard deviation from the mean

    However, the mean is only an estimate of the exact value and we only have a small

    sample of values so we have to use this equation. There will be a sampling error,as

    we cannot always sample the whole population.

    We then need to calculate the standard errorof the mean:

    Standard error =s

    n

    Any data that is more than

    3 standard deviations

    from the mean is

    considered to be an

    outlier.

    If the whole population is

    sampled then this is

    known as a census

  • 7/30/2019 Unit 8 Textbook

    11/47

    Unit 8 Steve Bishop

    11

    Now try this

    Complete the following table

    Sample size Mean (cm) Standard deviation Standard error of

    the mean

    10 150 2

    100 150 2

    1000 150 2

    10 000 150 2

    What happens to the standard error of the mean as the sample size increases?

    Confidence limits

    To find how confident we can be in the data we can find the confidence limits

    these are related to the standard error.

    For data that is normally distributed approximately 95% is

    within 2 standard deviations. The 95% confidence level is

    adequate for most scientific investigations.

    95% confidence limit = mean 1.96 x standard error of the mean

    In forensic situations or in

    clinical trials a 99.7%confidence limit is often

    required

  • 7/30/2019 Unit 8 Textbook

    12/47

    Unit 8 Steve Bishop

    12

    Now try these

    1. The diameter of a piece of wire is measured using a micrometer. The

    following results in mm were obtained:

    2.34, 2.34, 2.35, 2.37, 2.38

    Calculate the mean and standard deviation.

    2. The mean of five diameter values of a piece of wire is 2.36 mm and the

    standard deviation is 0.018 mm.

    What is the standard error of the mean?

    A piece of wire is 2.39 mm. Can you be 95% confident that it is a correct

    measurement of the diameter?

  • 7/30/2019 Unit 8 Textbook

    13/47

    Unit 8 Steve Bishop

    13

    3. The volumes of acid to determine the end point of a titration are the

    following:

    (a) Calculate the mean and standard deviation

    (b) Find the standard error of the mean.

    (c) What are the 95% confidence limits, assuming the data is normally

    distributed?

    Shapes of distributions

  • 7/30/2019 Unit 8 Textbook

    14/47

    Unit 8 Steve Bishop

    14

    The normal distribution 2

    The normal distributionis a veryimportant distribution.

    It is described by:

    X~ N(x

    _

    , s!)

    It has the following features:

    bell-shaped

    symmetrical about "

    it extends from ! to +!

    the maximum value of f(x) = 1

    ! 2"

    the total area under the curve is 1

    95%

    !2" +2"

    99.9%

    !3" +3"

    Approximately 95% of thedistribution lies between 2 SDs of

    the mean

    Approximately 99.9% of the

    distribution lies between 3 SDs ofthe mean

    f(x)

    mean variance

    s

  • 7/30/2019 Unit 8 Textbook

    15/47

    The probability that X lies between aand bis written

    as: P(a

  • 7/30/2019 Unit 8 Textbook

    16/47

    Unit 8 Steve Bishop

    16

    Now try these

    Draw sketches to illustrate your answers

    If Z ~N (0, 1), find

    1. P (Z 0.87) 3. P (Z< 0.544) 4. P (Z> 0.544)

  • 7/30/2019 Unit 8 Textbook

    17/47

    Unit 8 Steve Bishop

    17

    Finding probabilities with negative values of z

    For negative values of Zwe use !(-z) = 1 !(z). Remembering that thecurves are symmetrical and that the total area under the curves is 1.

    Above shows, P(Z< -a) = !(-a) = 1 !(a)

    This shows that P(Z> -a) = !(a)

    Example

    Find (a) P (Z< 0.411) (b) P (Z> - 0.411) (c) P (Z> 0.411) (d) P (Z< - 0.411)

    Solution

    (a)

    P(Z< 0.411) = [from tables 6591 + 4] = 0.6595

    (b)P(Z> - 0.411) = P(Z< 0.411) = 0.6595 (from (a))

    (c)P(Z> 0.411) = 1 !(0.411) = 1 0.6595 = 0.3405

    (d)P(Z< - 0.411) = P(Z> 0.411) = 0.3405

    -a

    !(-a)1-

    a

    !(a)

    a

    ! (a)

    -a

    P(Z< -a) = P(Z> a)

    - =

  • 7/30/2019 Unit 8 Textbook

    18/47

    Unit 8 Steve Bishop

    18

    Now try these

    1. P (Z> - 0.314) 2. P (Z< - 0.314) 3. P (Z> 0.111) 4. P (Z> - 0.111)

    P(a < Z < b) = !(b) !(a)

    Example:

    Find P(0.345 < Z< 1.751)

    = !(1.751) !(0.345) = 0.9600 0.6350

    = 0.3250

    Now try these

    Find(a)

    P(0.35 < Z< 1.50)

    (b)P(0.45 < Z< 1.51)

    (c)P(0.354 < Z< 1.541)

    (d)P(0.349 < Z< 1.716)

    Answers

    Now try these

    1. 0.80078

    2. 1-0.8078 = 0.1922

    3. 0.7068

    4. 1- 0.7068 = 0.2932

    5.! (0.314) = 0.6231 ! by symmetry 0.6231

    6. 1- 0.6231 = 0.3769

    7. 1- 0.5442 = 0.4558

    8. 0.5442

    Now try these

    (a) ! (1.50) - ! (0.35) = 0.0332 0.6368 = 0.2964(b) 0.9345 0.6736 = 0.2609

    (c) 0.9383 0.6382 = 0.3001

    (d) 0.9569-0.6363 = 0.3206

    a b

  • 7/30/2019 Unit 8 Textbook

    19/47

    Unit 8 Steve Bishop

    19

    Standardising a normal distribution

    To standardise Xwhere X~ N(!, "#)

    subtract the mean and then divide by the standard deviation:

    Z =X

    !

    where Z~ N(0,1)

    ExampleIf X~ N(100, 25), find P(X> 110)

    SolutionFirst standardise the random variable:

    P(X> 110) =

    P "# Z >

    110 100

    5

    %& = P(Z > 2)

    P (Z > 2) = 1 P(Z$2)

    = 1 0.9772

    = 0.0228

    Now try these1. If X~ N(116, 64), find P(X< 100)

    2. If X~ N(100, 16), find P(X> 90)

    100 110

    0 2

    X~N (100,25)

    Z~N (0,1)

  • 7/30/2019 Unit 8 Textbook

    20/47

    Unit 8 Steve Bishop

    20150 165x:

    z: 0 0.5145-05

    ExampleLengths of a murder victims hair are normally distributed with a mean lengthof 150 cm and a standard deviation of 10 cm.

    Find the probability that the length of a randomly selected strip is shorter than165 cm

    Solution

    Here X ~N (150, 10!)

    (a)This means we have to find P(X

  • 7/30/2019 Unit 8 Textbook

    21/47

    Unit 8 Steve Bishop

    21

    Now try these 2

    1. The masses of packages from a particular machine are normally

    distributed with a mean of 200g and a standard deviation of 2 g. Find

    the probability that a randomly selected package from the machineweighs

    (a)less than 197 g

    (b)more than 200.5 g

    (c)between 198.5 and 199.5 g.

    2. The heights of boys at a particular age follow a normal distribution with

    mean 105.3 cm and variance 25 cm.

    find the probability that a boy picked at random from this group has

    height

    (a)less than 153 cm

    (b)

    more than 158 cm(c)between 150 cm and 158 cm

    (d)more than 10 cm difference from the mean height.

    Answers

    Now try these

    1. 0.0228

    2. 0.8994

    Now try these 2

    1. (a) 0.0668 (b) 0.4013 (c) 0.1747

    2. (a) 0.7054 (b) 0.0618 (c) 0.4621 (d) 0.0456

  • 7/30/2019 Unit 8 Textbook

    22/47

    Unit 8 Steve Bishop

    22

    Probability introduction

    Random events happen by chance. Probability is a measure of how likely they are.

    It is measured on a scale from 0 (impossible) to 1 (certain).A random event has various outcomes.

    In a trial(or experiment) the things that happen are called outcomes.

    Eventsare groups of one or more outcomes.

    When an outcome is equally likelythe probability of an event is determined by

    counting the outcomes.

    P(event) =Number of outcomes where event happens

    Total number of possible outcomes

    Example

    A bag with 10 balls in 4 are red, 3 are blue, 2 are white and 1 is black.

    What is the probability of picking a blue ball? a white ball? a green ball? a ball that

    is notred?

    A sample spaceis the set of all possible outcomes.

    Example

    Complete the following sample space for the score on rolling 2 dice:

    Scores 1 2 3 4 5 6

    1

    2

    3

    4

    5

    6

    Find the probability of scoring: a total of 12; a total of 7; a score of less than 4.

    Venn diagrams can be used to show which outcome corresponds to which event.

    The shaded area in the middle The shaded area shows A or B

    shows A and B P(A!B) P(AUB)

    P(AUB) = P(A) + P(B) P(A!B)

    A B

  • 7/30/2019 Unit 8 Textbook

    23/47

    Unit 8 Steve Bishop

    23

    Example

    If you roll a dice, event A is an even number, and event B is a number >4, then the

    Venn diagram would be:

    A B

    2

    4

    6 5

    3

    1

    Find: P(A); P(B) ; P (A!

    B); P(A)'

  • 7/30/2019 Unit 8 Textbook

    24/47

    Unit 8 Steve Bishop

    24

    Conditional probabilityIn a small prison there are 100 prisoners. 50 are imprisoned for burglary, 29 arson

    and 34 for other crimes.

    First draw a Venn diagram. Work out how many are in for burglary and arson:

    (50+29+24) 100 = 13. These must have been counted twice, so they are the ones

    in for both. So those who charged for burglary only must be 50 13 = 37 and arson

    only 29 13 = 16. Place these numbers on the Venn diagram

    50 2937 1613

    34

    Maths Science

    What is the probability of choosing someone from the prison who is not in for burglary

    or arson?

    1. What is the probability of choosing someone who is in for burglary and arson?

    2. What is the probability of choosing someone who is in for arson?

    3. What is the probability of this person in for burglary as well?

    This last question is known as conditional probability. It is often phrased as What

    is the probability of choosing someone who is convicted for burglary given thatthey

    are convicted for arson?

    This is written as: P(B|A) (the probability of B givenA).

    From the diagram, the answer is straightforward 13/29.

    The 13 represent those in Burglary andArson and 29 represents those in Arson

    So this can be written as:

    P(A|B) = P (A and B)

    P()

    Burglary

  • 7/30/2019 Unit 8 Textbook

    25/47

    Unit 8 Steve Bishop

    25

    Now try these

    1. Two dice are thrown. What is the probability that the total is: (i) 7; (ii) a prime

    number; (iii) 7, given that it is a prime number.

    2. A forensics company is worried about the high turnover of its employees and

    decides to investigate whether they are more likely to stay if they are given

    training. On 1stJanuary one year the company employed 256 people (excluding

    those about to retire). During the year a record was kept of who received training

    as well as who left the company. The results are summarised below:

    Still employed Left company Total

    Given training 109 43 152

    Not given training 60 44 104

    Total 169 87 256

    Find the probability that a random selected employee:

    (i) received training

    (ii) did not leave the company

    (iii) received training and did not leave the company(iv) did not leave the company, given that the person had received training

    (v) received training, given the person had not left the company.

    3. 100 cars are entered for a road-worthiness test which is in two parts, mechanical

    and electrical. A car passes only if it passes both parts. Half the cars fail the

    electrical test and 62 pass the mechanical. 15 pass the electrical but fail the

    mechanical test. Find the probability that a car chosen at random.

    (i) passes overall (ii) fails one test only (iii) given that it has failed, failed the

    mechanical test, only.

  • 7/30/2019 Unit 8 Textbook

    26/47

    Unit 8 Steve Bishop

    26

    Statistics and probability questionsFor each task show all your workings. Give the final answer where

    appropriate to 3 significant figures. Hand in your completed working andsolution.

    Task 1A. Use the data from your titration experiment.(a) Find the mean and median of the volume of HCl used to determine the endpoint.(b) Determine the standard deviation using an appropriate method.

    B. On a particular corpse some unidentified tissue has been found. A sampleof 11 cells have been taken and measured. The diameters (in m) are asfollows:

    123, 126, 129, 122, 125, 128, 125, 124, 125, 126, 122

    (a) Find the mean, median and mode of the diameters(b) Determine the standard deviation manually andby ICT (if you use aspreadsheet include a screen shot).(c) Calculate the standard error of the mean. What is the 95% confidencelimit?

    (P1)

    Task 2You have been investigating the probability of certain ballistic trace evidencebeen found at a crime scene. The probability of one type A is 0.3 and theprobability of type B is 0.5. The probability of P(A|B) = 0.25.

    (a) Find the probability of finding A and B at a crime scene(b) Find the probability of finding A or B at the scene.

    (P1 part; M1 part)

    Task 3A forensic anthropologist has asked your advice. She was investigating thelifespan of insects on a human corpse. The mean lifespan for one insect is 144days and the standard deviation is 16 days. Find the probability that one insect

    will live less than 140 days and another more than 156 days.

    (M1 part, D1 part)

  • 7/30/2019 Unit 8 Textbook

    27/47

    Unit 8 Steve Bishop

    27

    Task 4The following distributions have had their labels removed.

    A B

    C D

    Identify:(a) the bimodal distribution(b) the positively skewed distribution(c) the negatively skewed distribution and(d) the normal distribution.

    Which distribution matches the following:(i) An easy science examination(ii) The salary of workers in a large laboratory

    (iii) The heights of males and females in the UK(iv) The mass of males in a large science laboratory.

    (D1 part)

  • 7/30/2019 Unit 8 Textbook

    28/47

    Unit 8 Steve Bishop

    28

    2 BE ABLE TO PERFORM STATISTICAL TESTS

    TO INVESTIGATE SCIENTIFIC PROBLEMS

    Chi-squared test: , where O is the observed frequency and Eis the expected frequency); degrees of freedom; contingency tables; science

    related applications of the Chi-squared test, eg colour blindness, psychology,

    genetics, drug tests, any other science related test

    P2 perform a chi-squared test to support a scientific hypothesis

    M2 interpret the results of the chi-squared test

    D2evaluate the validity of the interpretation of the results of the chi-squared test

    The t-test: independent samples; related samples (matched pairs); applications,

    eg equal number of seeds in two different composts, test whether a particular

    fertilizer improves yield of tomatoes, any other science related test

    P3 perform a t-test on data collected from a laboratory experiment

    M3 interpret the results of the t-test

    D3 evaluate the validity of the interpretation of the results of the t-test

    Correlation testing: graphical test, eg line of best fit; linear regression, eg using a

    calculator in linear regression mode; testing for power law, eg radioactivity

    experiments, electrical experiments, any other science related example

    P4 carry out an appropriate correlation method to investigate data collected from a

    laboratory experiment.

    M4 interpret the results of the correlation.

    D4 evaluate the validity of the interpretation of the results of the correlation.

  • 7/30/2019 Unit 8 Textbook

    29/47

    Unit 8 Steve Bishop

    29

    Chi-squared ( 2

    ! ) test

    2! is pronounced kai-squared, and sometimes written chi-squared. The 2! test

    helps discover if there is any connection between two variables that can be arrangedinto categories (eg colours, countries, gender). (It cannot be used with continuous

    data.)

    Example 1

    50 men and 50 women are interviewed.43 men can name over 15 clubs in the premier league

    27 women can name over 15 clubs in the premier league.Is there a connection between gender and football interest (assuming being able to

    name over 15 clubs means that the person has an interest in football)?

    1. Define the null and alternative hypotheses

    H0: there is no difference between gendersH1: there is a difference between genders.

    2. Arrange the data into a contingency table

    Interestedin football

    Notinterested

    in football

    Total

    Men 43 7 50

    Women

    27

    23

    50

    Total 70 30 100

    This is a 2 !2 table there are 2 categories for each variable.

    "2=

    (O # E)2

    E$ where Ois the observed values and Eis the expected values.

    The contingency table above gives the observed values. We now have to find theexpected values.

    3. Find the expected values

    This is found by multiplying the column total by the row total and dividing by thegrand total.

    Interestedin football

    Notinterested

    in football

    Total

    Men 43 7 50

    Women 27 23 50

    Total 70 30 100

    column !row

    overall

  • 7/30/2019 Unit 8 Textbook

    30/47

    Unit 8 Steve Bishop

    30

    Hence for men:

    interested in football we would expect: (70 !50) 100 = 35

    not interested in football: (30 !50) 100 = 15

    For women:

    interested in football: (70 !50) 100 = 35not interested in football: (30 !50) 100 = 15

    The expected table would then read:

    Interestedin football

    Notinterested

    in football

    Total

    Men 35 15 50

    Women 35 15 50

    Total 70 30 100

    The totals will remain unchanged.

    4. Calculate the residual table

    The residual is the difference between the observed and the expected values.

    Observed - Expected = Residual

    43 7 - 35 15 = + 8 -8

    27 23 35 15 -8 +8

    In 2 !2 tables the numbers will always be the same, with only the signs differing.

    5. Calculate2

    !

    ! "

    =

    E

    EO 2

    2 )(# where Ois the observed values and Eis the expected values.

    The residual table was found (O - E)

    So,2

    ! =35

    )8( 2++

    15

    )8( 2!

    +35

    )8( 2!

    +15

    )8( 2+= 12.19

    We now have to decide if the2

    ! value is high enough to conclude that it is unlikely

    to get such a number by chance.

    To do this we have to look at the concept of degrees of freedom

  • 7/30/2019 Unit 8 Textbook

    31/47

    Unit 8 Steve Bishop

    31

    Degrees of freedom

    In a 2 !2 contingency table, the value of one entry determines all the others:

    Total

    43

    50

    50

    70 30 100

    However, in a 3 !3 table we need 4 values before we can know what all the othervalues are:

    Total

    37 22 70

    8 10 20

    60

    60

    50

    40

    One in the first example, and four in the second example are called the degrees offreedom.

    The degrees of freedom can be calculated using:

    degrees of freedom = (r1) !(c-1)

    where r= the numbers of rows and c= the number of columns.

    Knowing the degrees of freedom, the2

    ! value and a table of critical valueswe can

    find out if there is any relations hip between gender and interest in football.

    One-tail 5% 2.5% 1.25% 0.5% 0.25% 0.005%

    Two-tail 10% 5% 2.5% 1% 0.5% 0.01%

    0.9 0.95 0.975 0.99 0.995 0.999

    !=1 2.706 3.841 5.024 6.635 7.8794 10.83

    !=2 4.605 5.991 7.378 9.210 12.84 16.27

    !=3 6.251 7.815 9.348 11.34 14.86 18.47

    !=4

    7.779

    9.488

    11.14

    13.28

    16.75

    20.51

    With one degree of freedom and a test at the 5% level gives us a value of 3.841.This means that 5% of the time we would expect a number greater than 3.81.

    As the2

    ! value is 12.9, we can say that at the 5% level we are confident that there is

    a relationship between football and gender.

  • 7/30/2019 Unit 8 Textbook

    32/47

    Unit 8 Steve Bishop

    32

    Example 2

    A sociologist wants to know if middle-class men are more likely to change babies

    nappies than working-class men. The sociologist interviews 40 middle-class and 60working-class men. 17 middle-class men change nappies and 13 working-class menchange nappies.

    1. Define the null and alternative hypothesesH0: There is no connection between social class and nappy changingH1: The two variables are related.

    2. Arrange the data into a contingency table3. Find the expected values

    4. Calculate the residual table

    17 23 - 12 28 = +5 -5

    13 47 18 42 -5 +5

    5. Calculate 2!

    So,2

    ! =12

    )5( 2++

    28

    )5( 2!+

    18

    )5( 2!+

    42

    )5( 2+= 4.96

    6. Find the degrees of freedom

    Degrees of freedom = (2 1 ) !(2 1 ) = 1

    7. Use the tables

    The chance that2

    ! will be 3.841 or more by chance if H0is true will be 5%.2

    ! = 4.96, so this suggests that we reject H0and conclude that there is some

    connection between social class and nappy changing.

  • 7/30/2019 Unit 8 Textbook

    33/47

    Unit 8 Steve Bishop

    33

    Now try these

    1. Find the expected values for the following tables.

    (a)

    18

    32

    (b)

    25

    16

    8 42 22 37

    (c) 40 60 60

    60 50 50

    20 50 10

    2. Find the residual tables for the tables in question 1.

    3. Calculate the2

    ! for the tables in question 1.

    4. How many degrees of freedom will there be for each of the following contingency tables?

    (a) 5 !3 (b) 7 !5 (c) 6 !2 (d) 10 !17

    5. The table below shows the results of a drug test on an infection. Is there any evidencethat treatment is related to cure?

    Treated Not treated

    Cured 24 57

    Not cured 53 257

    6. Murder Inc., a forensic science firm, carried out a survey to find out the political affiliation

    of its employees. Carry out a2

    ! test on the table to determine whether there is any

    association between political affiliation and type of work

    Lab-based Non lab-based Total

    Conservative 22 16 38

    Labour 53 8 61

    LibDem 20 11 31

    Total 95 35 130

    7. A researcher in genetics is investigating whether eye colour bears any relationship toplace of residence. From the table below, is there any evidence of such a relationship?

    Brown

    Blue

    OtherLeicester 72 80 28

    Bournemouth 20 62 18

    Aberdeen 67 120 44

  • 7/30/2019 Unit 8 Textbook

    34/47

    Unit 8 Steve Bishop

    34

    Answers

    1.

    (a) 13 37 (b) 19.27 21.73

    13 37 27.73 31.27

    (c)

    48

    64

    48

    48 64 48

    24 32 24

    2.

    (a) + 5 - 5 (b) + 5.73 -5.73

    - 5 + 5 -5.73 + 5.73

    (c) -8 -4 +12

    +12 -14 +2

    -4 +18 -14

    3.(a)2

    ! = 25/13 + 25/37 + 25/13 + 25/37 = 5.20 (b) 5.45 (c) 29.69

    4. (a) 4 !2 = 8 (b) 6 !4 = 24 (c) 5 (d) 144

    5. 11.59 significant at !%, so there is evidence of an association

    6. 6.38, significant at 2 %, so there is evidence of an association7. 13.5, 4 degrees of freedom, significant at 1% so there is evidence of an association.

  • 7/30/2019 Unit 8 Textbook

    35/47

    Unit 8 Steve Bishop

    35

    Practice questions

    1. Is there a connection at the 5% level between burglary and house type?

    Burglary No burglary Total

    House 3 2

    Bungalow 4 1

    Total

    2. Is there a connection between the type of area and fatal traffic accidents (figuresin thousands) at the 5% level?

    Fatal Non-fatal Total

    Motorway 5 15 20

    Urban 4 24 28

    Rural 3 12 15

    Total 12 51 63

    Solutions

    1. Degrees of freedom: 1

    Chi-square = 0.476For significance at the 5% level, chi-square should be greater than or equal to 3.84.

    The distribution is not significant.

    2. Degrees of freedom: 2

    Chi-square = 0.88

    For significance at the 5% level, chi-square should be greater than or equal to 5.99.

    The distribution is not significant.

  • 7/30/2019 Unit 8 Textbook

    36/47

    Unit 8 Steve Bishop

    36

    Type I and type II errors

    There are four possible conclusions when conducting a significance test:

    True situation Our conclusion

    H0is true Accept H0 Correct decision

    H0is true Reject H0 Wrong decision Type I error

    H0is false Accept H0 Wrong decision Type II error

    H0is false Reject H0 Correct decision

    A type I error is known as a false positive.For example a court finding a person guilty for a crime they did not commit.

    The probability of a type I error is the same as the significance level

    A type II error is a false negative.A court finding a person not guilty of a crime they did commit.

    A third type of error has also been proposed: type IIIRejecting the null hypothesis for the wrong reason!

    Justice System - Trial

    Defendant

    InnocentDefendant

    Guilty

    Reject

    Presumption of

    Innocence

    (Guilty Verdict)

    Type I Error Correct

    Fail to Reject

    Presumption of

    Innocence (Not

    Guilty Verdict)

    Correct Type II Error

    Statistics - Hypothesis Test

    Null Hypoth

    TrueNull Hypoth

    False

    Reject Null

    Hypothesis Type I Error Correct

    Fail to Reject

    Null Hypothesis Correct Type II Error

  • 7/30/2019 Unit 8 Textbook

    37/47

    Unit 8 Steve Bishop

    37

    The angel of death: guilty or not guilty?

    Kirsten Gibert was a nurse on Ward C at the Veterans Affairs Medial

    centre in Northampton Massachusetts, USA. She earned the

    nickname Angel of Death as she was often the first to notice that a

    patient was going into a cardiac arrest. She was calm and competent

    and would be able to administer the correct drug to save the patient.

    However, there were growing suspicions about her behaviour. There

    had been a high number of deaths on her particular ward. As well as

    shortages of the amphetamine-type drug epinephrine that can be

    used to cause cardiac arrest.

    A hospital investigation found nothing untoward. Some staff were still concerned, so a

    second investigation took place, this time involving statistician Stephen Gehlbach. Gehlbach

    plotted the annual number of deaths, broken down by shift and year (below). Gilbert started

    to work on Ward C in March 1990 and stopped working at the hospital in February 1996.

    Total deaths at the hospital, by shift and year [source: Devlin & Lorden (2007, p. 16)]

    What pattern does the bar chart show?

  • 7/30/2019 Unit 8 Textbook

    38/47

    Unit 8 Steve Bishop

    38

    Is there evidence to secure a conviction? Could it be a coincidence? To determine this we

    can use a chi-squared test.

    Here is the data the investigators had:

    Gilbert Present Death on shift

    Yes No Total

    Yes 40 217

    No 34 1350

    Total

    Perform a chi-squared test to support the following one-tail hypothesis at the 0.01 level (P2):

    HA: Significantly more patients will be found to die on a shift where the subject is

    working than on shifts when the subject is not working.

    State clearly your conclusion.

    What are the implications of the result? (M2)

    How valid are your results? How valid is your interpretation? Is Kirsten Gilbert really gulity or

    non-guilty? (D2)

    Bibliography

    Kelly M. Pyrek (2009). Kristen Gilbert Case Explored in New Book Forensic Nurse[online:

    http://www.forensicnursemag.com/webx/391webx1.html [accessed 22 Jan 2010]]

    K Devlin and G. Lorden (2007). The Numbers behind Numb3rs. Plume: New York.

  • 7/30/2019 Unit 8 Textbook

    39/47

    Unit 8 Steve Bishop

    39

    Students t-testStudent was W. S. Gossett. He published his test anonymously as

    Student because he was working for the brewers Guinness as astatistician and Guinness did not want the competition knowing that they

    were using statistics to help improve the brewing process.

    The test is used to compare samples from two different batches.

    This may be beer brewed under different circumstances, soil from

    different areas or evidence from two different crime scenes.

    It is usually used with small (

  • 7/30/2019 Unit 8 Textbook

    40/47

    Unit 8 Steve Bishop

    40

    4. Calculate the standard deviation of the difference

    ! !!!!!!!

    !!!!

    !!!!! !!

    !!!!!!!!!!!!

    !"!!= 1.51

    5. Calculate the standard error

    SE =!

    !

    =!!!"

    !"

    = 0.478

    6. Calculate the value of t

    ! ! !

    !"

    =!!!

    !!!"#= 5.0

    7. Calculate the number of degrees of freedom and find the critical value

    No of pairs of data 1 = n 1

    10-1 = 9

    8. From the table with 9 degrees of freedom 1-tail at 0.05 level:

    9. Determine if there is a difference or not

    t > tcritical (5.0 > 1.833)

    So, the null hypothesis is rejected and the alternative hypothesis is accepted.

    The ninhydrin does make a positive difference.

  • 7/30/2019 Unit 8 Textbook

    41/47

    Unit 8 Steve Bishop

    41

    t-test for matched pairs

    1 Set up the null and alternative

    hypotheses and determine if it is a one-

    or two-tail test

    H0

    HA

    2 Calculate the differences between the

    pairs in the samples (D)

    3 Calculate the mean of the differences

    ! ! !

    !

    4 Calculate the standard deviation of the

    differences! !!!!!!!

    !!!

    5 Calculate the standard error of the

    differences SE =!

    !

    6 Calculate the value of t

    ! ! !

    !"

    7 Calculate the number of degrees of

    freedom

    No of pairs of data 1 = n 1

    8. Find the critical value from the table

    9. Determine if there is a difference or not

    If t< critical value then there is no

    significant difference between the two sets

    of data and the null hypothesis is accepted.

    If t!critical value then the null hypothesis

    is rejected. Then the two sets of data differ

    significantly.

  • 7/30/2019 Unit 8 Textbook

    42/47

    Unit 8 Steve Bishop

    42

    Independent samples

    If there is no before and after relationship between the samples then the independent

    samples test is used.

    ! !

    !! ! !!

    !!

    !

    !!

    !!!

    !

    !!

    Example

    Some brown dog hairs were found on the clothing of a victim at a crime scene involving a

    dog.

    The five of the hairs were measured: 46, 57, 54, 51, 38 !m.

    A suspect is the owner of a dog with similar brown hairs. A sample of the hairs has been

    taken and their widths measured: 31, 35, 50, 35, 36 !m.

    Is it possible that the hairs found on the victim were left by the suspects dog? Test at the %5

    level.[From D. Lucy Introduction to Statistics for Forensic ScientistsChichester: Wiley, 2005 p. 44.]

    Solution

    1. Calculate the mean and standard deviation for the data sets !!and !!

    Dog A Dog B

    46 31

    57 35

    54 50

    51 35

    38 36

    Total 246 187

    Mean 49.2 37.4

    Standard

    deviation

    7.463 7.301

    2. Calculate the magnitude of the difference between the two means.!!!- !!!

    49.2 37.4 = 11.8

    3. Calculate the standard error !

    !

    in the difference:!!

    !

    !!

    !!!

    !

    !!

    .

    !!!"#!

    !!

    !!!"#!

    != !"!!"!! !"!!!!!!

    = 4.669 "4.67 (3 sf)

  • 7/30/2019 Unit 8 Textbook

    43/47

    Unit 8 Steve Bishop

    43

    4. Calculate the value of t:

    t= difference between the means standard error in the difference

    11.8!4.669 = 2.527

    !2.53 (3 sig fig)

    5. Calculate the degrees of freedom = !!+ !! 2

    5 + 5 -2 = 8

    6. Find the critical value for the particular significance you are working to and find the

    critical value from the table

    At the 0.05 level tcrit= 2.306

    If t< critical value then there is no significant difference between the two sets of data

    If t> critical value then there is a significant difference between the two sets of data

    So, at 0.05 level there is a significant difference between the two data sets.

    So it could not come from the same dog.

  • 7/30/2019 Unit 8 Textbook

    44/47

    Unit 8 Steve Bishop

    44

    Independent t-test

    1 Calculate the mean and standard

    deviation for the data sets !!and !!

    2 Calculate the magnitude of the

    difference between the two means.

    !!!- !!!

    3 Calculate the standard error!

    !

    in the

    difference:!!

    !

    !!

    !!!

    !

    !!

    .

    4 Calculate the value of t:

    t= difference between the means

    standard error in the difference [step 2

    step 3]

    5 Calculate the degrees of freedom = !!+ !! 2

    6 Find the critical value for the particularsignificance you are working to.

    7 If t< critical value then there is nosignificant difference between the two sets

    of data and the null hypothesis is accepted.

    If t"critical value then the null hypothesis isrejected. Then the two sets of data differ

    significantly.

  • 7/30/2019 Unit 8 Textbook

    45/47

    Unit 8 Steve Bishop

    45

    STATISTICAL TABLES

  • 7/30/2019 Unit 8 Textbook

    46/47

    Unit 8 Steve Bishop

    46

  • 7/30/2019 Unit 8 Textbook

    47/47

    Unit 8 Steve Bishop