Normality,Sampling & Hypothesis Testing and sample size estimation Jobayer Hossain, PhD Larry...

Normality,Sampling & Hypothesis Normality,Sampling & Hypothesis Testing and sample size estimationTesting and sample size estimation

Jobayer Hossain, PhD

Larry Holmes, Jr, PhD

October 23, 2008

RESEARCH STATISTICS

Bell-shaped HistogramBell-shaped Histogram

Left half of a bell shaped or symmetric histogram is the mirror image of the right half histogram.

Normal DistributionNormal Distribution The Normal Distribution is a density curve based on the following

formula.– It’s completely defined by two parameters: mean; and standard deviation.

A density function describes the overall pattern of a distribution.

The total area under the curve is always 1.0. The normal distribution is symmetrical. mmetrical.

– What does this mean?What does this mean? TheThe mean, medianmean, median, and mode are all the same.

xexfx

, 2

1)(

22

)(2

1

The beauty the Normal DistributionThe beauty the Normal Distribution

The 68-95-99.7 Rule :

In the normal distribution with mean µ and standard deviation σ:

68% of the observations fall within σ of the mean µ.

95% of the observations fall within 2σ of the mean µ.

99.7% of the observations fall within 3σ of the mean µ.

No matter what (mean) and (standard deviation) are, the area between -

and + is about 68%; the area between -2 and +2 is about 95%; and the

area between -3 and +3 is about 99.7%.

Almost all values fall within 3 standard deviations. The is called 68-

95-99.7 rule.

68-95-99.7 Rule68-95-99.7 Rule

68% of the data

95% of the data

99.7% of the data

Graph illustrating normal distribution by SDs. Credit: SU

- +

+2

+3-3

-2

Normal DistributionNormal Distribution

Standardizing and z-ScoresStandardizing and z-ScoresIf x is an observation from a distribution that has mean µ and standard deviation σ, the standardized value of x is,

A standardized value is often called a z-score. If x is a normal variable with mean µ and standard deviation σ, then z is a standard normal variable with mean 0 and standard deviation 1.

.

xz

, 2

1)(

, variablenormal standard a offunction density The

2

2

zezf

zz

Normal DistributionNormal Distribution Let x1, x2, …., xn be n random variables each with mean µ and standard

deviation σ, then sum of them ∑xi be also a normal with mean nµ and standard deviation σ√n. The distribution of mean is also a normal with mean µ and standard deviation σ/√n.

The standardized score of the mean is,

The mean of this standardized random variable is 0 and standard deviation is 1.

n

xz

/

x

x

Are the data normally distributed?Are the data normally distributed?

1. Look at the histogram! Does it appear bell shaped?

2. Compute descriptive summary measures—are mean, median, and mode similar?

3. Do 2/3 of observations lie within 1 std dev of the mean? Do 95% of observations

lie within 2 std dev of the mean?

4. Look at a normal probability plot—is it approximately linear?

5. Or Look at normal quantile plot?

6. Run tests of normality (such as Kolmogorov-Smirnov (K-S) or Shapiro-Wilk W

statistic).

• To perform a K-S test or Shapiro-Wilk test for Normality in SPSS, Analyze>

Descriptive statistics -> Explore -> Select variable in the dependent list ->

select plots -> select normality plot with tests -> Continue -> OK

Normal quantile plotNormal quantile plot

-2 -1 0 1 2

-2-1

01

2

Normal q-q plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

q-q plot of 100 sample observations from a normal distribution with mean 0 and standard deviation 1

If points lie on or close to a straight diagonal line, it indicates the data are normal

Point (s) far away from over all pattern indicates outlier (s).

Systematic deviations from a straight line indicates deviation from normality

Population and SamplePopulation and Sample

Population and samplePopulation and sample

Population: The entire collection of individuals, objects or measurements that we want information about.

Sample: A subset (part) of the population that we select to examine in order to gather information.

– Primary objective is to create a sample so that the distribution of the sample is similar to the distribution of the population. That is to create a subset of population whose center, spread and shape are as close as that of population.

– Methods of sampling: Random sampling, stratified sampling, systematic sampling, cluster sampling, multistage sampling, area sampling, qoata sampling etc.


Random Sample: A simple random sample of size n from a

population is a subset of n elements from that population where

the subset is chosen in such a way that every possible unit of

population has the same chance of being selected.

Example: Consider a population of 5 numbers (1, 2, 3, 4, 5).

How many random samples (without replacement) of size 2 can

we draw from this population ?

(1,2), (1,3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3,4), (3,5), (4,5)


Population mean of the five numbers in previous slide is 3.

Averages of 10 samples of sizes 2 are 1.5, 2, 2.5, 3, 2.5, 3, 3.5,

3.5, 4, 4.5. Mean of this 10 averages (1.5 +2 + 2.5 + 3 + 2.5 + 3+

3.5+ 3.5+ 4+ 4.5)/10 =3 which is the same as the population

mean.

Why do we need randomness in sampling?

It reduces the possibility of subjective and other biases.

Mean and variance of a random sample is an unbiased estimate of the

population mean and variance respectively.

Sampling error and biasSampling error and bias

Sampling Variability and standard errorSampling Variability and standard error

If we repeat an experiment or measurement on the same

number of subjects, the statistic varies as sample varies.

This variability is known sampling variability

Standard error (SE) measures the sampling variability or

the precision of an estimate.

– It indicates how precisely one can estimate a population value from

a given sample.

– For a large sample, approximately 68% of times sample estimate

will be with in one SE of population value.

Parameter vs StatisticsParameter vs Statistics

Parameter:

– Any statistical characteristic of a population.

– Population mean, population median, population standard

deviation, difference of two population means are examples

of parameters.

e.g: The mean systolic BP of all AIDHC employees is 112 Hg mm.

– Parameters describe the distribution of a population

– Parameters are fixed and usually unknown

Parameter vs StatisticParameter vs Statistic

Statistic: Any statistical characteristic of a sample.

– Sample mean, sample median, sample standard deviation, sample proportion, odds ratio, sample correlation coefficient are some examples of statistics.

– Mean systolic BP of a sample of 50 AIDHC emplyees or the difference of means systolic BP for a sample of 25 women and 25 men at AIDHC.

– Statistic describes the distribution of population

– Value of a statistic is known and is varies for different samples

– STATISTIC are used for making inference on parameter

Statistical inference is the process by which we acquire information about populations from samples.

Two types of estimates for making inferences:– Point estimation. e.g mean SBP– Interval estimation e.g. CI

Statistical Inference

Statistical Inference

Sample Population

Elements/Steps in hypothesisElements/Steps in hypothesis

Hypothesis testing steps:

– 1. Null (Ho) and alternative (H1)hypothesis specification

– 2. Selection of significance level (alpha) - 0.05 or 0.01

– 3. Calculating the test statistic –e.g. t, F, Chi-square

– 4. Calculating the probability value (p-value) or confidence

Interval?

– 5. Describing the result and statistic in an understandable

way.

A hypothesis is an assumption about the population parameter.

– A parameter is a characteristic of the population, like its mean or variance.

– The parameter (mean) must be identified before analysis.

We assume the mean SBP of men at AIDH is 135 Hg mm

What is a Hypothesis?

States the Assumption (numerical) to be tested e.g. The mean SBP AIDH employee = 130 Hg/mm

Begin with the assumption that the null hypothesis is TRUE.

(Similar to the notion of innocent until proven guilty)

The Null Hypothesis, H0

•Refers to the Status Quo•Always contains the ‘ = ‘ sign

•The Null Hypothesis may or may not be rejected.

Is the opposite of the null hypothesis E.g. The mean SBP AIDH employee is not 130 Hg/mm

Challenges the Status Quo Never contains the ‘=‘ sign The Alternative Hypothesis may or may not be

accepted Is generally the hypothesis that is believed to be

true by the researcher

The Alternative Hypothesis, H1

Steps:

– State the Null Hypothesis (H0: = 130)

– State its opposite, the Alternative Hypothesis (H1: < 130)

Hypotheses are mutually exclusive & exhaustive

Sometimes it is easier to form the alternative hypothesis first.

Identify the Problem

Population

Assume thepopulationmean age is 130 Hg/mm(Null Hypothesis)

REJECT

The SampleMean Is 130

SampleNull Hypothesis

?120130 XIs

Hypothesis Testing Process

No, not likely!

Hypothesis TestingHypothesis Testing

Real Situation Ho is true Ho is false Reject Ho Type I

error Correct Decision

D e c i s i o n

Accept Ho Correct Decision

Type II Error

)()( ErrorIITypePErrorITypeP

• Goal: Keep , reasonably small

Reduce probability of one error and the other one goes up.

& Have an Inverse Relationship

True Value of Population Parameter– Increases When Difference Between Hypothesized

Parameter & True Value Decreases

Significance Level – Increases When Decreases

Population Standard Deviation – Increases When Increases

Factors Affecting Type II Error,

True Value of Population Parameter– Increases When Difference Between Hypothesized

Parameter & True Value Decreases

Significance Level – Increases When Decreases

Population Standard Deviation – Increases When Increases

Sample Size n– Increases When n Decreases

Factors Affecting Type II Error,

n

Choice depends on the cost of the error Choose little type I error when the cost of

rejecting the maintained hypothesis or standard treatment is high

Choose large type I error when you have an interest in changing the the standard treatment

How to choose between Type I and Type II errors

Point estimator

Sample distribution

Parameter

?

Population distribution

• A point estimate draws inference about a population by estimating the value of an unknown parameter using a single value or a point.

Point Estimation

Interval estimatorSample distribution

• An interval estimator draws inferences about a population by

estimating the value of an unknown parameter using an interval.

Population distribution Parameter

Interval Estimation

Confidence Interval (CI)Confidence Interval (CI)

point estimate (measure of how confident we want to be) (standard error)

The value of the statistic in my sample (eg., mean)

Critical value for a statistic

Standard error of the statistic.

What effect does larger sample size have on the confidence interval? It reduces standard error and makes CI narrower indicating more precision of estimate

P-Value versus the Confidence IntervalP-Value versus the Confidence Interval

Two main ways to assess study precision and the role of

chance in a study.

– P value measures ( in probability) the evidence against

the null hypothesis.

– A p-value of 0.05 means that in about 5 of 100

experiments, a result would appear significant just by

chance (“Type I error”).

P-Value versus the Confidence IntervalP-Value versus the Confidence Interval

– A confidence interval (CI) is an interval within which the value of

the parameter lies with a specified probability

– CI measures the precision of an estimate (when sampling

variability is high, the interval is wide to reflect the uncertainty of

the estimate)

– A 95% CI implies that if one repeats a study 100 times, the true

measure of association will lie inside the CI in 95 out of 100

measures. If a parameter does not lie within 95% CI, indicates

significance at 5% level of significance

Procedures for sample size Procedures for sample size calculationcalculation

Selection of primary variables of interest and formulation

of hypotheses

Information of standard deviation ( if numeric) or

proportion (if categorical)

A tolerance level of significance ()

Selection of reasonable test statistic

Power or Confidence level

A scientifically or clinically meaning effect/ difference

Useful links for sample size CalculationUseful links for sample size Calculation

1)http://hedwig.mgh.harvard.edu/sample_size/size.html

2)http://www.stat.uiowa.edu/~rlenth/Power/index.html

3)http://cct.jhsph.edu/javamarc/index.htm

4)http://stat.ubc.ca/~rollin/stats/ssize/index.html

5)http://statpages.org/#Power

What sample size is needed to be 95% confident of being correct within ± 6? A previous study suggested that the standard

deviation is 40.

Example: Sample Size for Mean using CI

1716.170 6

4096.1)(2

22

2

22

Error

SDzn

What sample size is needed to be within ± 5% with a 95% confidence to estimate the proportion of AIDHC employees with Flu shot already? Suppose in a very small sample it has been seen that 40% of AIDHC employees had flu shot already.

Example: Sample Size for Proportion using CI

369

79.36805.

)60)(.40(.96.1)1(2

2

2

2

error

ppZn

CreditsCredits

Thanks are due to Faith Goa of the Golden State University for the implied permission to utilize some of the illustrations from their slides on “Fundamentals of Hypothesis Testing” for education purposes only.

Other sources consulted during the preparation of these slides are herein acknowledged as well.

QuestionsQuestions

Normality,Sampling & Hypothesis Testing and sample size estimation Jobayer Hossain, PhD Larry...

Documents

Transcript of Normality,Sampling & Hypothesis Testing and sample size estimation Jobayer Hossain, PhD Larry...