Sampling distributions, p-value, significance & confidence

download Sampling distributions, p-value, significance & confidence

of 30

Transcript of Sampling distributions, p-value, significance & confidence

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    1/30

    Sampling Distributions, p-value,Significance & Confidence levels

    Data Analysis : Tests of significancebased on T, F & Z Distribution andChi- Square test

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    2/30

    Sampling & SamplingDistributions Characteristics of a sample are called statistic.

    Characteristics of population are called parameter.

    We try to estimate the population parameter based onsample statistic.

    Estimates are subject to two types of Errors Type I &TypeII

    Characteristic Symbols

    Population Sample

    Size = N Size = n

    Mean = Mean= x-bar

    Standard Deviation= Standard Deviation =s

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    3/30

    Sampling Distributions The Concept

    Suppose you, a team of 4 students, have been asked to collectsample of 40 from a city of population of 1,00,000 usingprobability sampling method.

    Each student needs to collect 10 samples of 20-25 year old men.

    Your objective is to find mean height of the samples so as to infer

    the mean height of the population. For each sample, the mean height and standard deviation are

    calculated.

    The mean heights and standard deviations for the 4 samples aredifferent.

    A probability distribution of all the possible means of the samples

    is called sampling distribution of the mean.

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    4/30

    Examples of Population,Sample, Sample Statistic &

    Sampling DistributionsPopulation Sample Sample

    StatisticSamplingDistribution

    Water in a river 10-gallons ofwater Mean no. of partsof impurity permillion parts ofwater

    Samplingdistribution ofmean

    All IPL teams Group of 5

    players

    Median height Sampling

    distribution ofmedian

    All partsproduced by amanufacturing

    process

    50 parts ProportionDefective

    Samplingdistribution ofproportion

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    5/30

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    6/30

    Standard Error (2) We would observe different sample mean.

    This variability in the sample statistic is due to chance i.e.

    differences are solely due to the elements we happened to

    choose for the samples.

    The standard deviation of the distribution of sample means

    measures the extent to which we expect the means from the

    different samples to vary because of this chance error in the

    sampling process. So it is called standard error.

    Standard error indicates not only the size of the chance error, but

    also the accuracy we are likely to get if we use a sample statisticto estimate a population parameter.

    A distribution of sample means with less spread (with less

    standard error) is a better estimate of the population.

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    7/30

    amp ng NormaDistribution

    Experience of Five Bike Owners with Tyre Life

    Owner C D E F G

    Tyre Life(months)

    3 3 7 9 14

    Population consists of only five people We will take all possible samples of the

    owners in groups of 3.

    Compute the sample means x-bar an computethe mean of the sampling distribution s.

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    8/30

    amp ng ormaDistribution

    Calculation of Sample Mean Tyre Life, n =3Samples of

    ThreeSample Data Sum Sample Mean

    EFG 7+9 +14 30 10

    DFG 3+9+14 26 8.6667

    DEG 3+7+14 24 8

    DEF 3+7+9 19 6.3333

    CFG 3+9+14 26 8.6667

    CEG 3+7+14 24 8

    CEF 3+7+9 19 6.3333CDF 3+3+9 15 5

    CDE 3+3+7 13 4.3333

    CDG 3+3+14 20 6.6667

    Total 72

    Mean of the sampling distribution =72/10 =7.2

    The table with calculation is shown as under--

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    9/30

    Sampling & NormalDistribution

    3

    7

    6 9 12 15

    9 14

    Tyre life in months

    Pro

    ba

    bilit

    y

    2 4 6 8 10Tyre life in months

    Pro

    ba

    bilit

    y

    4.6667

    6.3333

    10

    PopulationDistribution

    Samplingdistribution of themean with n=3

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    10/30

    Sampling & NormalDistribution

    n = 2

    Tyre life in months

    Pro

    ba

    bilit

    y

    Tyre life in months

    Pro

    ba

    bilit

    y

    n = 4

    Sampling

    distribution of themean with n=2

    Samplingdistribution of themean with n=4

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    11/30

    Sampling & NormalDistribution

    n = 8

    Pro

    ba

    bilit

    y

    Pro

    ba

    bilit

    y

    n=20

    Samplingdistribution of the

    mean with n=8

    Samplingdistribution of the

    mean with n=20

    If population size is increased to 40 andwe take larger sample sizes of 8 and 20

    Calculate x-bar and s

    Plot the distributions

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    12/30

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    13/30

    P-value Consider an experiment where you've measured

    values in two samples, and the means are different.How sure are you that the population means aredifferent as well? There are two possibilities:

    The populations have different means.

    The populations have the same mean, and thedifference you observed is a coincidence ofrandom sampling.

    The P value is a probability, with a value rangingfrom zero to one.

    It is the answer to this question: If the populationsreally have the same mean overall, what is theprobability that random sampling would lead to adifference between sample means as large (orlarger) than you observed?

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    14/30

    P value

    Many people misunderstand what question a P value answers.

    If the P value is 0.03, that means that there is a 3% chance ofobserving a difference as large as you observed even if the twopopulation means are identical.

    It is tempting to conclude, therefore, that there is a 97% chancethat the difference you observed reflects a real differencebetween populations and a 3% chance that the difference is

    due to chance. Wrong. What you can say is that random sampling from

    identical populations would lead to a difference smaller thanyou observed in 97% of experiments and larger than youobserved in 3% of experiments.

    You have to choose. Would you rather believe in a 3%coincidence? Or that the population means are really different?

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    15/30

    a s ca ypo es stesting

    The P value is a fraction.

    The steps of statistical hypothesis testing are- Set a threshold P value before you do the experiment. In

    fact, the threshold value (called alpha) is traditionallyalmost always set to 0.05.

    Define the null hypothesis. If you are comparing two

    means, the null hypothesis is that the two populationshave the same mean.

    Do the appropriate statistical test to compute the P value.

    Compare the P value to the preset threshold value.

    If the P value is less than the threshold, state that you

    "reject the null hypothesis" and that the difference is"statistically significant".

    If the P value is greater than the threshold, state that you"do not reject the null hypothesis" and that the differenceis "not statistically significant"

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    16/30

    Significance Level The term significantis seductive, and it is easy to

    misinterpret it. A result is said to be statistically significantwhen the result

    would be surprising if the populations were really identical. Aresult is said to be statistically significant when the P value isless than a preset threshold value.

    It is easy to read far too much into the word significant

    because the statistical use of the word has a meaningentirely distinct from its usual meaning. Just because adifference is statistically significantdoes not mean that it isimportant or interesting.

    And a result that is not statistically significant(in the first

    experiment) may turn out to be very important.

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    17/30

    Significance Level

    If a result is statistically significant, there are two possible

    explanations:

    The populations are identical, so there really is no difference. You

    happened to randomly obtain larger values in one group and

    smaller values in the other, and the difference was large enough to

    generate a P value less than the threshold you set. Finding a

    statistically significant result when the populations are identical is

    called making a Type I error.

    The populations really are different, so your conclusion is correct.

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    18/30

    Significance Level

    There are also two explanations for a result that is not statistically

    significant:

    The populations are identical, so there really is no difference. Any

    difference you observed in the experiment was a coincidence. Your

    conclusion of no significant difference is correct.

    The populations really are different, but you missed the difference

    due to some combination of small sample size, high variability and

    bad luck. The difference in your experiment was not large enough to

    be statistically significant. Finding results that are not statistically

    significant when the populations are different is called making a Type

    II error.

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    19/30

    mean

    Statistical calculations produce two kinds of results that

    help you make inferences about the populations from the

    samples. You've already learned about P values. The

    second kind of result is a confidence interval.

    95% confidence interval of a mean

    Although the calculation is exact, the mean you calculate

    from a sample is only an estimate of the population mean.

    How good is the estimate? It depends on how large your

    sample is and how much the values differ from one

    another.

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    20/30

    Statistical calculations combine sample size and variability to generate a

    confidence interval for the population mean.

    You can calculate intervals for any desired degree of confidence, but 95%

    confidence intervals are used most commonly. If you assume that your

    sample is randomly selected from some population, you can be 95% sure

    that the confidence interval includes the population mean.

    More precisely, if you generate many 95% CI from many data sets, you

    expect the CI to include the true population mean in 95% of the cases and

    not to include the true mean value in the other 5%.

    Since you don't know the population mean, you'll never know for sure

    whether or not your confidence interval contains the true mean.

    con ence n erva o amean

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    21/30

    Why 95%?

    There is nothing special about 95%. It is just convention that

    confidence intervals are usually calculated for 95% confidence.

    In theory, confidence intervals can be computed for any degree

    of confidence. If you want more confidence, the intervals will be

    wider. If you are willing to accept less confidence, the intervals

    will be narrower.

    S li Di t ib ti f th

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    22/30

    Sampling Distribution of the meanwhen population is normally

    distributed Sampling distribution has a mean equal to the population

    x-bar =

    Sampling distribution has a standard deviation (standard error)

    equal to the population standard deviation divided by the

    square root of the sample size

    s = /n

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    23/30

    T t f Si ifi T t f

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    24/30

    Test of Significance or Test ofHypothesis The theory of hypothesis testing begins with assumption about the

    parameter of the population.

    The assumption is termed hypothesis, made on the basis of sample

    observation.

    The validity of hypothesis is tested by analyzing the sample.

    The procedure is calledTest of Significance or Test of Hypothesis.

    Conventional approach is to set up two different hypotheses, which is

    so constructed that if one hypothesis is accepted, the other is rejected

    and vice-versa.

    Hypothesis are i) Null Hypothesis (Ho) ii) Alternate Hypothesis.(Ha) Example :

    Ho : = 100 , Ha: 100

    M th d t T t

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    25/30

    Methods to TestHypothesis

    Z-Test

    T-test

    F-TestChi-square Test

    ANOVA

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    26/30

    Z-value

    Sample Population

    s

    xx

    z

    =

    =

    x

    z

    Z = difference between observed value and expected

    value/standard deviation

    The z-value tells us how many standard deviations aboveor below the mean our data value x is.

    Positive z-values are above the mean,

    Negative z-values are below the mean

    Z l

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    27/30

    The area covered between mean and the Z-value isthe probability.

    For every Z-value, there is a table, which givescorresponding area and therefore the probability

    Probability =0.4875, meaning values would lie withinthe limit specified by z=2.24 from the mean.

    Z-value

    0.4875 of

    area

    Z =2.24

    Standard Deviations & Z

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    28/30

    Standard Deviations & Z-Value

    - +

    0.6828

    of area

    -2+2

    0.9554of area

    -3 +3

    0.9974of area

    -1.64 +1.64

    0.9 ofarea

    Z =1

    Z =2

    Z = 3

    Z =1.64

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    29/30

    P=0.05 and Z-value

    +1.96-1.96

    Acceptance Region

    RegionofRejection

    RegionofRejection

  • 8/7/2019 Sampling distributions, p-value, significance & confidence

    30/30

    Z-value example

    For a sample of females, the mean BMI (body massindex) was 26.20 and the standard deviation was6.57.

    A person with a BMI of 19.2 has a z score of:

    s

    xx

    z

    = 07.157.6

    20.262.19=

    =

    So this person has a BMI 1.07 standard deviations below the

    mean