Sampling distribution concepts

UNIT-V

Population & Sample

Population Sample

Population in statistics means the whole of the information that comes under the purview of statistical investigation.

It is the totality of all the observations of a statistical inquiry.

It is also known as “UNIVERSE”

A population may be finite or infinite

A part of the population selected for study is called a SAMPLE.

Hence, Sample is nothing but the selection of a group of items from a population in such a way that this group represents the population.

The number of individuals included in the finite sample is called the SIZE OF THE SAMPLE.

Parameter & Statistic

Parameter Statistic

Any statistical measure (such as mean, mode , S.D.) computed from population data is known as PARAMETER.

Any statistical measure computed from sample data is known as STATISTIC.

STATISTIC computed from a sample drawn from the parent population plays an important role in

A) The Theory of Estimation

B) Testing of Hypothesis

Notations used

Notations

Statistical Measure Population Sample

Mean µ X

Standard deviation σ S

Size N n

Sampling & Sampling TheorySampling Sampling theory

It is the process of selecting a sample from the population.

Sampling can also be defined as the process of drawing a sample from the population & compiling a suitable statistic in order to estimate the parameter drawn from the parent population & to test the significance of the statistic computed from such sample.

Sampling theory is based on Sampling

It deals with statistical inferences drawn from sampling results, which are of three types:

i. Statistical Estimation,

ii. Tests of significance, and

iii. Statistical inference

Objects of Sampling theory To estimate population parameter on

the basis of sample statistic. To set the limits of accuracy &

degree of confidence of the estimates of the population parameter computed on the basis of sample statistic.

To test significance about the population characteristic on the basis of sample statistic.

Methods of Sampling

Random (Probability) Sampling

Non-random Sampling

Simple Random sampling

Stratified Sampling

Systematic Sampling

Multi-stage Sampling

Judgment Sampling

Quota Sampling

Convenience Sampling

Random Sampling Methods

Simple Random sampling

This method refers to the sampling technique in which each and every item of the population is given a chance of being included in the sample;

The selection is free from personal bias;

This method is also known as method of chance selection.

It is sometimes also referred to as “representative sampling” (if the sample is chosen at random and if the size of the sample is sufficiently large, it’ll represent all groups in the population)

Contd..

It is a probability sampling because every item of the population has an equal opportunity of being selected in the sample;

Methods of obtaining a Simple Random Sample:

1. Lottery method2. Table of random numbers ( a number of

random tables are available such as Tippets table; Fisher and Yates numbers; Kendall and Babington Smith numbers)

Stratified Sampling

It is one of the restricted random methods which by using available information concerning the data attempts to design a more efficient sample than that obtained by the simple random procedure;

The process of stratification requires that the populationmay be divided into homogeneous groups or classes called strata

then a sample may be taken from each group by simple random method

And the resulting sample is called a stratified sample

Contd..

A stratified sample may be either proportional or disproportionate.

In a proportional stratified sampling plan, the number of items drawn from each stratum is proportional to the size of the strata.

For example, if the population is divided into 4 strata, their respective sizes being 15, 10,20 ,55 % of the population and a sample of 1000 is to be drawn, the desired proportional sample may be obtained in the following manner:

Contd..

From stratum one 1000 (0.15) 150 items

From stratum two 1000 (0.10) 100

From stratum three 1000 (0.20) 200

From stratum four 1000 (0.55) 550

Sample Size 1000

Disproportionate Stratified sampling includes procedures of taking an equal number of items from each stratum irrespective of its size.

Systematic Sampling

This method is popularly used in such cases where a complete list of the population from which sampling is to be drawn is available;

The method is to select every kth item from the list where ‘k’ refers to the sampling interval;

k = size of population / sample size (N/n);

The starting point between the first & the kth is selected at random

Contd..

For example, if a complete list of 1000 students is available and we want to draw a sample of 200 students; this means we must take every 5th item.

But the first item between one and five shall be selected at random.

Let it be three, now we shall go on adding 5 & obtain numbers of desired sample.

Cluster Sampling

It is different from stratified sampling in a way that each strata consists of homogeneous items but the groups in clusters are mutually exclusive and not exactly homogeneous;

Multi- stage sampling is a type of cluster sampling;

Multi-stage Sampling

As the name suggests this method refers to a sampling procedure which is carried out in several stages;

The material is regarded as made up of a number of first stage sampling units, each made up of a number of second stage units;

At first the first stage units are sampled by some suitable method such as random sampling, then, a sample of second stage is selected from each of the selected first stage units again by some suitable method which may be the same or different from the method employed for the first stage units.

Non-Random Sampling Methods

Judgment Sampling

In this method of sampling the choice of sample items depends exclusively on the judgment of the of the investigator;

This method, though simple , is not scientific;

This method is used in solving many types of economic & business problems such as

i. When sample size is small;ii. With the help of Judgment sampling,

estimation can be made available quickly;

Quota Sampling

It is a type of judgment sampling;

In a quota sample, quotas are set up according to given criteria but within quotas the selection of sample items depends on personal judgment.

Convenience Sampling

It is also known as the Chunk; A Chunk is a fraction of one population

taken for investigation because of its convenient availability;

Hence chunk is selected neither by probability nor by judgment but by convenience;

Convenience samples are sometimes called accidental samples because those entering into the sample enter by ‘accident’;

Errors in Sampling: Discrepancies in Statistical measure of population (Parameter) & of the sample drawn from the same population (Statistic).Sampling Errors Non Sampling Errors

These are of two types

a. Biased arise due to any bias in selection , estimation, tec

b. Unbiased errors arise due to chance factors

Occurs primarily due to the following reasons:

1. Faulty selection of the sample

2. Substitution

May arise in the following ways:

1. Due to negligence & carelessness on the part of investigator;

2. Due to incomplete investigation & sample survey;

3. Due to negligence & non response on the part of the respondents;

4. Errors in data processing.

Principles of Sampling

Principle of “Statistical Regularity”: This principle lays down that a moderately large number of items chosen at random from a large group are almost sure on an average to possess the characteristics of the large group.

Principle of “Inertia of Large Numbers”: this is principle is corollary of the above principle.

It states that, other things being equal, larger the size of sample, more accurate the results are likely to be.

Theory of Estimation

Statistical estimation is the procedure of using a sample statistic to estimate a population parameter.

A Statistic is used to estimate a parameter is called an estimator, and

The value taken by the estimator is called an estimate.

for example, the sample mean(say 7.65) is an estimator of the population mean.

Statistical estimation is divided into two major categories:Point Estimation Interval Estimation

In point estimation, a single statistic is used to provide an estimate of the population parameter;

Change in sample will cause deviation in estimate;

An interval estimate is a range of values within which a researcher can say with some confidence that the population parameter falls;

This range is called confidence interval;

Qualities of a good estimator: A good estimator is one which is

close to the true value of the parameter as possible.

A good estimator must possess the following characteristics:

i. Unbiasednessii. Consistencyiii. Efficiency andiv. Sufficiency

Contd.. Unbiasedness: this is a desirable property for a

good estimator to have; “unbiasedness” refers to the fact that a sample mean is an unbiased estimator of a population mean because the mean of the sampling distribution of a sample means taken from the same population is equal to the population mean itself;

Efficiency: it refers to the size of the standard error of the statistic; if two statistic are compared from a sample of the same size & try to decide which is a good estimator; the statistic that has a smaller standard error or standard deviation of the sampling distribution will be selected.

Contd..

Consistency: a statistic is a consistent estimator if the sample size increases, it becomes almost certain that the value of statistic comes very close to the value of the population parameter;

Sufficiency: an estimator is sufficient if it makes so much use of the information in the sample that no other estimator could extract from the sample additional information about the population estimator being estimated;

Hypothesis Testing

Hypothesis testing is based on hypothesis;

“Hypothesis” is an assumption about an unknown population parameter;

Hypothesis testing is a well defined procedure which helps in deciding objectively whether to accept or reject the hypothesis based on the information available from the sample;

Hypothesis Testing Procedure

STEP 1: SET NULL & ALTERNATIVE HYPOTHESIS: The assumption which we want to test is called

the NULL hypothesis; It is symbolized as Ho; Null hypothesis is set with no difference (i.e.

status quo) & considered true, unless and until it is proved by the collected sample data;

Example, Ho :µ =500“the null hypothesis is that the population mean is equal to

500”

Contd.. The Alternative hypothesis, generally

referred by H1 or Ha is the logical opposite of the null hypothesis;

H1 :µ ≠500; ( Ho :µ >500; or H1 :µ <500)

In other words, when null hypothesis is found to be true, the alternative hypothesis must be false; or vice versa;

Rejection in null hypothesis indicates that the difference have statistical significance & acceptance in null hypothesis indicates that the difference are due to chance;

STEP2: SET UP A SUITABLE SIGNIFICANCE The level of significance, generally denoted by

‘α’ is the probability, which is attached to a null hypothesis, which may be rejected even when it is true;

The level of significance is also known as the size of rejection region or size of critical region;

It is generally specified before any samples are drawn, so that results obtained will not influence the direction to be taken;

Any level of significance can be adopted in practice we either take 5% or 1% level of significance;

Contd.. When we take 5% level of significance then

there are about 5 chances out of 100 that we would reject the null hypothesis when it should be accepted i.e. we are about 95% confident that we have made the right decision;

When the null hypothesis is rejected at α=0.5, test result is said to be significant;

When the null hypothesis is rejected at α=0.01, test result is said to be highly significant;

STEP3: DETERMINATION OF A SUITABLE TEST STATISTIC

Many of the test statistic that we shall encounter will have the following form:

Test statistic = Sample Statistic- hypothesized population parameter

Standard Error of the sample statistic

STEP4 : SET THE DECISION RULE The next step for the researcher is to

establish a critical region Acceptance region : when null

hypothesis is accepted; Rejection region ; when null

hypothesis is rejected;

STEP5: COLLECT THE SAMPLE DATA

Data is now collected;

Appropriate sample statistic are computed;

STEP6: ANALYSE THE DATA

This involves selection of an appropriate probability distribution for a particular test;

For example, when the sample is small (n<30) the use of normal probability distribution (Z) is not an accurate choice, (t) distribution needs to be used in this case;

Some commonly used testing procedures are

Z, t, F & Chi square

STEP7: ARRIVE AT A STATISTICAL CONCLUSION & BUSINESS IMPLICATION Statistical conclusion is a decision to

accept or reject a null hypothesis;

This depends on whether the computed test statistic falls in acceptance region or rejection region;

Types of Errors in Hypothesis Testing

Correct Decision

Type I error (α)

Type II error (β)

Correct Decision

Decision

Condition

Ho: true Ho: false

Accept

Reject

Z-test Hypothesis testing for large samples i.e. n>=

30; Based on the assumption that the population ,

from which the sample is drawn, has a normal distribution;

As a result, the sampling distribution of mean is also normally distributed;

Application:1. For testing hypothesis about a single

population mean;2. Hypothesis testing for the difference between

two population means;3. Hypothesis testing for attributes.

Formula for single population mean (finite population) Z = x - µ

σ √nWhere ,µ = population meanx = sample meanσ = population standard deviationn = sample size

Q A marketing research firm conducted a survey 10 yrs ago & found that an average household income of a particular geographic is Rs 10000. Mr. gupta who recently joined the firm a VP expresses doubts. For verifying the data, firm decides to take a random sample of 200 households that yield a sample mean of Rs 11000. assume that the population S.D is Rs 1200. verify Mr. Gupta’s doubts using α=0.05?

Step 1: set null & alternative hypothesis

Ho: µ=10000

H1: µ≠10000 Step2: Determine the appropriate statistical test

Since sample size >=30, so z-test can be used for hypothesis testing

Step3: set the level of significance

The level of significance is known (α=0.05) Step4: Set the decision rule

Acceptance region covers 95% of the area & rejection region 5%

Critical area can be calculated from the table ( + 1.96)

Step5: collect the sample dataA sample of 200 respondents yield a sample mean of Rs 11000

Step6: Analyze the datan=200µ=10 000x=11000 σ=1200 Z = x - µ = 11000-10000 = 11.79

σ 1200 √n √ 200 Step7: Arrive at a statistical conclusion & business

implicationZ value is 11.79 which is greater than +1.96, hence

null hypothesis is rejected and alternative hypothesis is accepted. Hence Mr. Gupta’s doubt about household income was right.

Formula for single population mean (infinite

population) Z = x - µ σ x √N-n

√n √N-1

When population Standard deviation is not known:

Z = x - µ s

√n where s= sample standard deviation

Hypothesis testing for the difference between two population means Z = (x1 – x2) – (µ1 - µ2)

√ σ12 + σ2

2

√n1 + n2

Hypothesis for attributes

Z = x- µ √ npq

Where,n=sample sizeµ= npp=probability of happeningq=chance of not happening

Q In 600 throws of 6-faced dice, odd points appeared 360 times, would you say that the dice is fair at 5% level of significance

Ho=dice is fair P=q=½ n=600 np=300 x=360

Z = x-np = 360-300 =4.9 √ npq √ 600* ½*½ Z is greater than 1.96(at 5%), Ho is rejected. Hence, dice is not fair.

t-test

Given by W.S. Gosset in 1908 under the pen name of student’s test

t-test can be applied when:1. When a researcher draws a small

random sample (n<30) to estimate the population (µ);

2. When the population standard deviation (σ) is unknown;

3. The population is normally distributed

Application of t-test

Hypothesis testing for single population mean;

Hypothesis testing for the difference between two independent population means;

Hypothesis testing for the difference between two dependent population means;

Hypothesis testing for single population mean t = x - µ

s √nWith degree of freedom (n-1)Where ,µ = population meanx = sample means = sample standard deviationn = sample size

Q: Royal tyre has launched a new brand of tyres for tractors & claims that under normal circumstances the average life of tyres is 40000 km. a retailer wants to test this claim & has taken a random sample of 8 tyres. He tests the life of tyres under normal circumstances. The results obtained are:

Tyres

1 2 3 4 5 6 7 8

Km 35 000

38 000

42 000

41 000

39 000

41 500

43 000

38 500Use α = 0.05 for testing the hypothesis

Step1: Set null & alternative hypothesisNull hypothesis: Ho: µ = 40 000Alternative hypothesis: Ho: µ ≠ 40 000Step2:Determine the appropriate statistical testThe sample size is less than 30, so t test will be an appropriate testStep3:Set the level of significanceThe level of significance, i.e. α = 0.05 Step4: Set the decision ruleThe t distribution value for a two-tailed test is t0.025 = 2.365 for degrees of freedom 7. so if computed t value is outside the + 2.365 range, the null hypothesis will be rejected; otherwise accepted.

Step 5: Collect the sample data:

Step 6: Analyze the dataX=39750; µ=40000; s=2618.61 n=8; df=n-

1=7 ;Table value of t0.025,7=2.365 t = x - µ =39750-40000 = -0.27

s 2618.61 √n √ 8 Step 7: Arrive at a statistical conclusion &

Business implicationThe observed t value is -0.27 which falls within the

acceptance region & hence null hypothesis is accepted i.e. Ho: µ = 40 000

Tyres

1 2 3 4 5 6 7 8

Km 350000

38000 42000 41000 39000 41500 43000 38500

Hypothesis testing for the difference between two independent population means t= (x1 – x2) – (µ1 - µ2)

σ √ 1 + 1 √n1 + n2

σ can be estimated by pooling two sample variances & computing pooled standard deviation

σ= s pooled = √ s12 (n1 -1) + s2

2 (n2 -1)

n1 + n2– 2

F-test

Is named after R.A. Fisher who first studied it in 1934;

This distribution is usually defined in terms of the ratio of the variances of two normally distributed populations

The quantitys1

2 / σ12

s22 / σ2

2

is distributed as F-distributed with (n1 – 1) & (n2 -1) degree of freedom

Contd..

Where

s12 = Σ (x1 – x1)2

(n1 – 1)

s22 = Σ (x2 – x2)2

(n2 – 1)

Chi Square test

Chi square is related to categorical data (as counting of frequencies from one or more variables);

Some researchers place chi-square in the category of Non-parametric tests

X2 test was developed by Karl Pearson in 1900;

the symbol X stands for the Greek letter “chi”;

X2 is a function of its degree of freedom;

Contd..

Being a sum of square quantities X2 distribution can never be a negative value;

X2 is a continuous probability distribution with range zero to infinity;

X2 = Σ (O-E)2

EWith df =(r-1)(c-1)E= row total x column total

Grand total

Decision rule

If X2 calculated > X2 critical, reject the null hypothesis;

If X2 calculated < X2 critical, accept the null hypothesis;

Conditions to apply chi- square test Data should not be in % or ratios

rather they should be expressed in original units;

The sample should consist of atleast 50 observations & should be drawn randomly & individual observation in a sample should be independent from each other;

Sampling distribution concepts

Education

Transcript of Sampling distribution concepts