Goals in Statistic

Chapter Three Describing Data: Numerical Measures

GOALSWhen you have completed this chapter, you will be able to:ONECalculate the arithmetic mean, median, mode, weighted mean, and the geometric mean.TWO Explain the characteristics, uses, advantages, and disadvantages of each measure of location.THREEIdentify the position of the arithmetic mean, median, and mode for both a symmetrical and a skewed distribution.

Goals

3- 1

FOURCompute and interpret the range, the mean deviation, the variance, and the standard deviation of ungrouped data.

Describing Data: Numerical Measures

FIVEExplain the characteristics, uses, advantages, and disadvantages of each measure of dispersion.

SIXUnderstand Chebyshev’s theorem and the Empirical Rule as they relate to a set of observations.

Goals

Chapter Three 3- 2

Characteristics of the Mean

It is calculated by summing the values and dividing by the number of values.

It requires the interval scale.All values are used.It is unique.The sum of the deviations from the mean is 0.

The Arithmetic Mean is the most widely used measure of location and shows the central value of the data.

The major characteristics of the mean are: A verag e J oe

3- 3

Population Mean

N

X

where µ is the population meanN is the total number of observations.X is a particular value. indicates the operation of adding.

For ungrouped data, the

Population Mean is the sum of all the population values divided by the total number of population values:

3- 4

Example 1

500,484

000,73...000,56

N

X

Find the mean mileage for the cars.

A Parameter is a measurable characteristic of a population.

The Kiers family owns four cars. The following is the current mileage on each of the four cars.

56,000

23,000

42,000

73,000

3- 5

Sample Mean

n

XX

where n is the total number of values in the sample.

For ungrouped data, the sample mean is the sum of all the sample values divided by the number of sample values:

3- 6

Example 2

4.155

77

5

0.15...0.14

n

XX

A statistic is a measurable characteristic of a sample.

A sample of five executives received the following bonus last year ($000):

14.0, 15.0, 17.0, 16.0, 15.0

3- 7

Properties of the Arithmetic Mean

Every set of interval-level and ratio-level data has a mean.

All the values are included in computing the mean.

A set of data has a unique mean.

The mean is affected by unusually large or small data values.

The arithmetic mean is the only measure of location where the sum of the deviations of each value from the mean is zero.

Properties of the Arithmetic Mean 3- 8

Example 3

0)54()58()53()( XX

Consider the set of values: 3, 8, and 4. The mean is 5. Illustrating the fifth

property

3- 9

Weighted Mean

)21

)2211

...(

...(

n

nnw

www

XwXwXwX

The Weighted Mean of a set of numbers X1, X2, ..., Xn, with

corresponding weights w1, w2, ...,wn, is computed from the following

formula:

3- 10

Example 4

89.0$50

50.44$1515155

)15.1($15)90.0($15)75.0($15)50.0($5

wX

During a one hour period on a hot Saturday afternoon cabana boy Chris served fifty drinks. He sold five drinks for $0.50, fifteen for $0.75, fifteen for $0.90, and fifteen for $1.10.

Compute the weighted mean of the price of the drinks.

3- 11

The Median

There are as many values above the median as below it in the data array.

For an even set of values, the median will be the arithmetic average of the two middle numbers and is

found at the (n+1)/2 ranked observation.

The Median is the midpoint of the values after they have been ordered from the smallest to the largest.

3- 12

The ages for a sample of five college students are:21, 25, 19, 20, 22.

Arranging the data in ascending order gives:

19, 20, 21, 22, 25.

Thus the median is 21.

The median (continued)

3- 13

Example 5

Arranging the data in ascending order gives:

73, 75, 76, 80

Thus the median is 75.5.

The heights of four basketball players, in inches, are: 76, 73, 80, 75.

The median is found at the (n+1)/2 =

(4+1)/2 =2.5th data point.

3- 14

Properties of the Median

There is a unique median for each data set.It is not affected by extremely large or small values and is therefore a valuable measure of location when such values occur.It can be computed for ratio-level, interval-level, and ordinal-level data.It can be computed for an open-ended frequency distribution if the median does not lie in an open-ended class.

Properties of the Median

3- 15

The Mode: Example 6

Example 6: The exam scores for ten nursing students are: 81, 93, 84, 75, 68, 87, 81, 75, 81, 87. Because the score of 81 occurs the most often, it is the mode.

Data can have more than one mode. If it has two modes, it is referred to as bimodal, three modes, trimodal, and the like.

The Mode is another measure of location and represents the value of the observation that appears most frequently.

3- 16

Symmetric distribution: A distribution having the same shape on either side of the center

Skewed distribution: One whose shapes on either side of the center differ; a nonsymmetrical distribution.

Can be negatively skewed, or positively skewed

The Relative Positions of the Mean, Median, and Mode

3- 17

The Relative Positions of the Mean, Median, and Mode: Symmetric Distribution

Zero skewness Mean

=Median

=Mode

M o d e

M ed ia n

M ea n

3- 18

The Relative Positions of the Mean, Median, and Mode: Right Skewed Distribution

• Positively skewed: Mean and median are to the right of the mode.

Mean>Median>Mode

M o d e

M ed ia n

M ea n

3- 19

Negatively Skewed: Mean and Median are to the left of the Mode.

Mean<Median<Mode

The Relative Positions of the Mean, Median, and Mode: Left Skewed Distribution

M o d eM ea n

M ed ia n

3- 20

Dispersion refers to the spread or variability in the data.

Measures of dispersion include the following: range, mean deviation, variance, and standard deviation.

Range = Largest value – Smallest value

Measures of Dispersion

0

5

10

15

20

25

30

0 2 4 6 8 10 12

3- 21

The following represents the scores of the 25 master of Nursing students during the first long examination.

65.0 60.0 59.0 81.0 73.048.0 95.0 63.0 92.0 53.078.0 56.0 79.0 95.0 78.080.0 89.0 79.0 97.0 69.065.0 77.0 80.0 83.0 75.0

Example 9

Highest score:97 Lowest score:48

Range = Highest value – lowest value= 97 - 48= 49

3- 22

Mean Deviation

The arithmetic mean of the

absolute values of the

deviations from the arithmetic

mean.

The main features of the mean deviation are:

All values are used in the calculation.

It is not unduly influenced by large or small values.

Mean Deviation

M D = X - X

n

3- 23

The weights of a sample of crates containing books for the bookstore (in pounds ) are:

103, 97, 101, 106, 103Find the mean deviation.

X = 102

The mean deviation is:

4.25

541515

102103...102103

n

XXMD

Example 10

3- 24

Variance: the arithmetic mean of the squared

deviations from the mean.

Standard deviation: The square root of the variance.

Variance and standard Deviation

3- 25

Not influenced by extreme values.All values are used in the calculation.

The major characteristics of the

Population Variance are:

Population Variance

3- 26

Population Variance formula:

(X - )2

N =

X is the value of an observation in the population

m is the arithmetic mean of the population

N is the number of observations in the population

Population Standard Deviation formula:

2Variance and standard deviation

3- 27

Sample variance (s2)

s 2 =(X - X ) 2

n -1

Sample standard deviation (s)

2ss

Sample variance and standard deviation

3- 28

40.75

37

n

XX

30.515

2.2115

4.76...4.77

1

2222

n

XXs

Example 11

The hourly wages earned by a sample of five students are:

$7, $5, $11, $8, $6.

Find the sample variance and standard deviation.

30.230.52 ss

3- 29

Chebyshev’s theorem: For any set of observations, the minimum proportion of the values that lie within k standard deviations of the mean is at

least:

where k is any constant greater than 1.

2

11

k

Chebyshev’s theorem

3- 30

Empirical Rule: For any symmetrical, bell-shaped distribution:

About 68% of the observations will lie within 1s of the mean

About 95% of the observations will lie within 2s of the mean

Virtually all the observations will be within 3s of the mean

Interpretation and Uses of the Standard Deviation

3- 31

Bell -Shaped Curve showing the relationship between and .

-m 3s -m s -1m s m +1m s +m s +3m s

68%

95%99.7%

Interpretation and Uses of the Standard Deviation

3- 32

The Mean of Grouped Data

n

fXX

The Mean of a sample of data organized in a frequency

distribution is computed by the following formula:

3- 33

From the example on hours spent in studying, we have

Hours studying Class Midpoint(X) Frequency(f) fX

30.2 – 35.1 32.65 1 32.65

25.2 – 30.1 27.65 3 82.95

20.2 – 25.1 22.65 7 158.55

15.2 – 20.1 17.65 11 194.15

10.2 – 15.1 12.65 8 101.2

∑fX=569.5

Therefore, the mean hours spent in studying is 569.5 / 30 = 18.98 hours

The Median of Grouped Data

)(2 if

CFn

LMedian

where L is the lower limit of the median class, CF is the cumulative frequency preceding the median class, f is the frequency of the median class, and i is the class width.

The Median of a sample of data organized in a frequency distribution is computed by:

3- 35

Finding the Median Class

To determine the median class for grouped data

Construct a cumulative frequency distribution.

Divide the total number of data values by 2.

Determine which class will contain this value. For example, if n=30, 30/2 = 15, then determine which class will contain the 15th value.

3- 36

From the example on hours spent in studying

Hours Studying

Lower Limit

f Cumulative Frequency

30.2 – 35.1 30.2 1 30

25.2 – 30.1 25.2 3 29

20.2 – 25.1 20.2 7 26

15.2 – 20.1 15.2 11 19

10.2 – 15.1 10.2 8 8

n/2 = 30/2 = 15. The second class (15.2 – 20.1) having a cumulative frequency of 19 is the median class. The lower limit is 15.2. The cumulative frequency (CF) that precedes the median class is 8. The frequency (f) of the median class is 11.The class width, i = 5.

Therefore, the median is:Median = 15.2 + (15-8)(5) = 18.4 11

The Mode of Grouped Data

The Mode for grouped data is approximated by the formula

3- 39

))((21

1 idd

dLMode

21

1

dd

dLbMo (i)

whereL = lower limit of the modal class intervald1 = diff. between the freq. of the modal CI

and the next class lower in valued2 = diff. between the freq. of the modal CI

and the next class higher in value i = class width

From the example on the hours spent in studying, we have

• the modal class is 15.2 – 20.1 with the highest frequency of 11(L = 15.2)

• d1 =11-8 = 3• d2 = 11-7 = 4• i = 5• Therefore, the mode is

Mode = 15.2 +[3/(3+4)](5) = 17.3Since the Mean = 18.98 > median = 18.4> mode = 17.3,

therefore the distribution of the number of hours spent in studying is positively skewed.

Variance for Grouped Data

222

)1(

nn

fxfxns

wheren = sample sizef = frequencyx = class marks

From the example on hours spent in studying, we have

Hours studying X f fX fX2

30.2 – 35.1 32.65 1 32.65 1066.02

25.2 – 30.1 27.65 3 82.95 2293.57

20.2 – 25.1 22.65 7 158.55 3591.16

15.2 – 20.1 17.65 11 194.15 3426.75

10.2 – 15.1 12.65 8 101.2 1280.18

∑fX=569.5 ∑fX2=11657.68

19.29870

25.25399

)130(30

)5.569()65.11657(30 22

s

And the standard deviation is

S = (29.19)1/2 = 5.4

Therefore, the variance is

Chapter FourOne-Sample Tests of Hypothesis

GOALSWhen you have completed this chapter, you will be able to:

ONEDefine a hypothesis and hypothesis testing.TWO Describe the five step hypothesis testing procedure.THREEDistinguish between a one-tailed and a two-tailed test of hypothesis.FOURConduct a test of hypothesis about a population mean.

Chapter Ten continued


FIVE Conduct a test of hypothesis about a population proportion.SIXDefine Type I and Type II errors.

One-Sample Tests of Hypothesis

What is a Hypothesis?

What is a Hypothesis?

A HYPOTHESIS is defined by Webster as “a tentative theory or supposition provisionally adopted to explain certain facts and to guide in the investigation of others”.

A statistical hypothesis is an assertion or statement that may or may not be true concerning one or more population.

Example of Statistical Hypotheses

• A leading drug in the treatment of hypertension has an advertised therapeutic success rate of 84%. A medical researcher believes he has found a new drug for treating hypertensive patients that has higher therapeutic success rate than the leading drug but with fewer side effects. He should assume that it is no better than the leading drug and then set out to reject this contention. The two statements: the new drug is no better than the old one (p = 0.84) and the new drug is better than the old one (p > 0.84) are examples of statistical hypothesis.

What is Hypothesis Testing?

Hypothesis testing

Based on sample evidence and

probability theory

Used to determine whether the hypothesis is a reasonable statement and should not be rejected, or is unreasonable

and should be rejected

Types of Hypothesis1. Null Hypothesis (Ho) - the hypothesis that two or more variables are not related or that two or more statistics (e.g. means for two different groups) are not significantly different. It is a negation of the theory that the researcher would like to derive. In the above example, the statement ‘the new drug is no better than the old one’ is an example of a null hypothesis. It is usually constructed to enable the researcher to evaluate his own theory or the research hypothesis. In other words, the null hypothesis is stated with the sole purpose of rejecting it, thereby accepting the research hypothesis. Equality symbol (=) is commonly used in stating the null hypothesis.

Types of Hypothesis2. Alternative Hypothesis (H1) – the hypothesis derived from the theory of the investigator and generally state a specified relationship between two or more variables or that two or more statistics significantly differ. In other words, it is the operational statement of the investigator’s research hypothesis. The second statement ‘the new drug is better than the old one’ (above example) is an example of alternative hypothesis. The symbols commonly used are >, < and .

Two ways of stating the alternative hypothesis: Predictive H1 (One-tailed or directional) – specifies the type of relationship existing between two or more variables (e.g. direct or inverse relationship) or specifies the direction of the difference between two or more statistics (e.g. 1> 2 or 1 < 2).Non-predictive H1 (Two-tailed or non-directional )– does not specify the type of relationship or the direction of the difference ( e.g. 1 2)

One-Tailed and Two-Tailed TestsA test of any hypothesis where the alternative hypothesis is predictive such asHo: 1 = 2

H1 : 1 > 2 or 1 < 2

is called a one-tailed test. The rejection region for the alternative hypothesis 1 > 2 lies entirely in the right tail of the distribution, while the rejection region for the alternative hypothesis 1 < 2 lies entirely in the left tail.

A test of any hypothesis where the alternative hypothesis is non-predictive such asHo: 1 = 2

H1 : 1 2

is called a two-tailed test, since the rejection region is split into two equal parts placed in each tail of the distribution.

More Examples of Stating HypothesisExample1. Suppose you want to study the association between job satisfaction of the employees and the labor turnover in a certain private hospital.

Ho: There is no significant relationship between job satisfaction and the labor turnover in a certain private school or stated differently, job satisfaction does not significantly affect labor turnover

H1: There is an inverse relationship between job satisfaction and labor turnover, more specifically, when job satisfaction decreases, labor turnover increases.(predictive)

More Examples . . .Example 2. A researcher is conducting a study to determine if suicide incidence among teenagers can be attributed to drug use.

Ho: There is no significant difference between the suicide rates of teenagers who use drugs and those who do not.

H1: Suicidal rates of teenagers who use drugs are significantly higher than the suicidal rates of non-users.(predictive)

H1 : There is a significant difference between the suicide rates of teenagers who use drugs and those who do not.(non-predictive)

More Examples . . .Example 3. A social researcher is conducting a study to determine if the level of women’s participation in community extension programs of the barangay can be affected by their educational attainment, occupation, income, civil status, and age.

Ho : The level of women’s participation in community extension programs is not affected by their educational attainment, occupation, income, civil status and age.

H1 : The level of women’s participation in community extension programs is affected by their educational attainment, occupation, income, civil status and age.

Hypothesis Testing

D o n o t re jec t n u ll R e jec t n u ll an d accep t a lte rn a te

S tep 5 : Take a sam p le , a rrive a t a d ec is ion

S tep 4 : F orm u la te a d ec is ion ru le

S tep 3 : Id en tify th e tes t s ta tis t ic

S tep 2 : S e lec t a leve l o f s ig n ifican ce

S tep 1 : S ta te n u ll an d a lte rn a te h yp o th eses

Three possibilities

regarding means

H0: m = 0H1: m = 0

H0: m = 0H1: m > 0

H0: m = 0H1: m < 0

Step One: State the null and alternate hypotheses

The null hypothesis

always contains equality.

3 hypotheses about means

Step Two: Select a Level of Significance.

The probability of rejecting the null hypothesis when it is actually true; the

level of risk in so doing.

Rejecting the null hypothesis when it is actually true ( ).a

Accepting the null hypothesis when it is actually false ( ).b

Level of Significance

Type I Error

Type II Error

Step Two: Select a Level of Significance.

Researcher Null Accepts RejectsHypothesis Ho Ho

Ho is true

Ho is false

Correctdecision

Type I error( )a

Type IIError

( )b

Correctdecision

Risk table

Step Three: Select the test statistic.

A value, determined from sample information, used to determine whether or not to reject the null hypothesis.

Examples: z, t, F, c2

Test statistic z Distribution as a test statistic

n/

X

z

The z value is based on the sampling distribution of X, which is normally distributed when the sample is reasonably large (recall Central Limit Theorem).

Step Four: Formulate the decision rule.Critical value: The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected.

0 1.65

D o not

re ject

[P robability = .95]

R egion of

re jection

[P robability= .05]

C ritica l va lue

Sampling DistributionOf the Statistic z, aRight-Tailed Test, .05Level of Significance

Reject the null hypothesis and accept the alternate hypothesis if

Computed -z < Critical -z

or

Computed z > Critical z

Decision Rule

Decision Rule

Using the p-Value in Hypothesis Testing

If the p-Value is larger than or equal to the significance level, a, H0 is not rejected.

p-ValueThe probability, assuming that the null hypothesis is true, of finding a value of the test statistic at least as extreme as the computed value for the test

Calculated from the probability distribution function or by computer

Decision Rule

If the p-Value is smaller than the significance level, a, H0 is rejected.

> .0 5 .1 0p

> .0 1 .0 5p

Interpreting p-values

SOME evidence Ho is not true

> .0 0 1 .0 1p

STRONG evidence Ho is not true VERY STRONG evidence Ho is not true

Step Five: Make a decision.

Movie

Test Concerning Means ( is known or n 30) – a sample mean is to be compared with the population mean.a.) Ho: = o b.) Ho: = o c.) Ho: = o

H1: < o H1: > o H1: o

Test Statistic:

Rejection Region:a.) Z < -Z b.) Z > -Z c.) Z < -Z Z > Z/2

nx

z o )(

An Example A private university hypothesized that the mean starting monthly salary of its

graduates is P8000 with a standard deviation of P1500. A sample of 25 employed graduates showed an average starting salary of P7500. Test this hypothesis at 5% level of significance. Solution:

Ho: = 8000H1: 8000Level of significance: /2 = 0.05/2 = 0.025 (two-tailed)

Rejection Region: Z 1.96 and Z -1.96 (refer to Table A.1 Appendix)

Computation of the Test Statistic: Given: = 7500, = 900, n = 25

Decision: Since the value of the test statistic (z = -1.67) is greater than the tabular value of –1.96, then do not reject Ho and conclude that the claim of the private university is true.

67.11500

25)80007500()(

nx

z o

Test Concerning Means ( is unknown and n < 30)a.) Ho: = o b.) Ho: = o c.) Ho: = o

H1: < o H1: > o H1: o

Test Statistic:

Rejection Region: a.) t < -t b.) .) t > -t c.) .) t < -t/2 and t > t/2

s

nxt o )(

An Example A perfume company claims that their best selling brand, Blue Ginger, has an average purity of 87.9%. Twenty bottles were examined to have an average purity of 85.6% with a standard deviation of 8.3%. Test at the 1% significance level whether the perfume company is telling the truth.Solution:HO : The average purity of blue ginger is equal to 87.9% ( = 0.879).H1 : The average purity of blue ginger is less than 87.9%, or ( < 0.879).Significance Level : 1%Rejection Region: t - 2.539 (with = 0.01 and degrees of freedom of 19, i.e.20 – 1 = 19

Computation of the Test Statistic: t – test

Decision: Since –1.24 is outside the rejection region, then we accept HO: at the 1% significance level. Therefore, the average purity of blue ginger is equal or greater than 87.9%. Therefore, the perfume company is telling the truth.

856.0x 879.0O 083.0s 20n

n

sx

t O

20

083.0879.0856.0

24.1

Test Concerning Proportion Tests of hypotheses concerning proportions are required in many areas. A drug manufacturer is certainly interested in knowing what fraction of the patients recover after taking the medicine. In business, manufacturing firms are concerned about the proportion of defective when shipment is made. A social researcher might be interested in determining the proportion of people favoring the legalization of abortion in the Philippines.

Steps in testing a proportion1. Ho: p = po

2. H1: p < po , p > po or p po

Choose a level of significance of size .Establish the rejection regionComputation: where p= sample proportion

p0 = population proportionn = sample size

Make a decision that is, reject Ho if z falls in the rejection region, otherwise do not reject Ho.

n

pp

ppz

OO

O

)1(

An Example A television station intends to stop airing a “telenovela” show, “Maganda ang Buhay”, if less than 35% of its intended televiewers watch the said program. A random sample of 1500 households yields an average viewing percentage of 32.9%. Should the television company pull out the airing of the show? Use the 5% significance level.

Solution:HO: The average viewing percentage of the telenovela “Maganda ang Buhay” is equal 35% (p = .35). H1: The average viewing percentage of the telenovela “Maganda ang Buhay” is less than 35% (p <.35) Significance Level: 5% Rejection Region:

zz 65.1z

Computation of Test Statistics: z – test

Decision: Since –1.71 is within the rejection region, then we reject HO at the 5% significance level. Therefore, the average viewing percentage of the telenovela “Maganda ang Buhay” is less than 35% and the television company should pull out the airing of the show.

1500n 329.0p 35.0Op

n

pp

ppz

OO

O

)1(

1500

)35.01(35.0

35.0329.0

71.1

Exercises1. A sample of size 60 has a mean of 12.8 and a standard deviation of 2.5. Test the

hypothesis that the population mean is 12 using a.) a one-tailed test at 0.01 level and b.) a two-tailed test at 0.05 level.

2. Suppose it is known that the mean annual income of assembly line workers in a certain plant is P100,000 with a standard deviation of P7000. You suspect that workers with active union interests have higher than the average incomes and take a random sample of 85 of these active members, obtaining a mean of P105,000. Can you say that active union members have significantly higher incomes? Use 5% level of significance.

3. Last year the employees of the city sanitation department donated an average of 250 pesos to the volunteer rescue squad. Test the hypothesis at 0.01 level that the average contribution this year is still 250 if a random sample of 15 employees showed an average of 265 pesos with a standard deviation of 15 pesos. Assume that the donations are approximately normally distributed.4. Suppose that in the past 40% of all adults favored capital punishment. Do we have reason to believe that the proportion of adults favoring capital punishment today has increased if, in a random sample of 200 adults 120 favor capital punishment? Use a 0.05 level of significance.

5. The average height of males in the freshmen class of a certain university has been 65.8 inches, with a standard deviation of 3.2 inches. Is there reason to believe that there has been an increase in the average height if a random sample of 40 males in the present freshmen class have an average height of 67.5 inches? Use a 1% level of significance. 6.A researcher knows that the average height of Filipino women is 1.525 meters. A random sample of 25 women was taken and was found to have a mean height of 1.572 meters, with a standard deviation of .12 meters. Is there reason to believe that the 25 women in the sample are significantly taller than the others at .05 level of significance?

7. A random sample of 100 recorded deaths in the Philippines during the past year showed an average life span of 71.8 years with a standard deviation of 8.9 years. Does this seem to indicate that the average life span today is greater than 70 years? Use a 0.05 level of significance.8. A mayoralty candidate in a certain city expects that approximately 60% of the voters will favor him in the coming election. To support his claim, he let a social researcher conduct a survey consisting of 100 randomly selected voters in the different barangays comprising the city. Results showed that 70 interviewed voters favor this particular mayoralty candidate. Is this sufficient evidence to conclude that the proportion of voters favoring this candidate in the coming election is higher than what he expected. Use .01 level of significance.

Chapter Five

Two-Sample Tests of HypothesisGOALS

When you have completed this chapter, you will be able to:

TWOConduct a test of hypothesis regarding the difference in two population proportions.

THREEConduct a test of hypothesis about the mean difference between paired or dependent observations.

ONEConduct a test of hypothesis about the difference between two independent population means.

Chapter Eleven continued

Two Sample Tests of Hypothesis


FOURUnderstand the difference between dependent and independent samples.

Difference – of – Means Test ( 1 and 2 are known) – two sample means are being compared.

a.) Ho: 1 = 2 b.) Ho: 1 = 2 c.) Ho: 1

= 2

H1: 1 < 2 H1: 1 > 2 H1: 1 2

Test Statistic:

2

22

1

21

021 )(

nn

dxxz

Rejection Region:a.) Z < -Z b.) Z > Z c.) Z < -Z/2 or Z > Z/2

An Example A university investigation team conducted a study to determine whether car ownership of students affects their academic performance. A random sample of 50 students who are non-car owners showed a grade point average (GPA) of 85 with a standard deviation of 10.2 while a group of 60 car owners got an average grade of 80 with a standard deviation of 8.9. Do the data provide sufficient evidence to indicate that the non-car owners are better than car owners in terms of academic performance? Test using 5 % level of significance.

Solution:Let 1 = mean grade for non-car owners

2 = mean grade for car owners

Ho: 1 = 2

H1: 1 > 2 (one-tailed)

Level of significance: = 0.05Rejection Region: Z 1.645 ( refer to Table A.1 Appendix)

Computation of the Test Statistic:Given:

60,9.8,80

50,2.10,85

2222

111 1

nsx

nsx

The value of the z statistic will be

Decision: Since the value of the test statistic (z= 2.71) is greater than the tabular value of 1.645, then reject Ho. The data provide sufficient evidence to indicate non-car owners have better academic performance than car owners.

71.2

60

)9.8(

50

)2.10(

0)8085()(22

2

22

1

21

021

nn

dxxz

Difference – of – Means Test ( 1 = 2 = but unknown)

a.) Ho: 1 = 2 b.) Ho: 1 = 2 c.) Ho: 1 = 2

H1: 1 < 2 H1: 1 > 2 H1: 1 2

Test Statistic:

where

with degrees of freedom v = n1 + n2 -2

21

021

11

)(

nnSp

dxxt

2

)1()1(

21

222

211

nn

snsnSp

Rejection Region: a.) t < -t b.) t > t c.) t < -t/2

t > t/2

An Example A businessman is planning to put up either a computer store or a video store. He randomly selected 10 computer stores and 10 video stores. The average weekly incomes are P45,000 and P53,000 respectively with corresponding standard deviations of P17,500 and P15,000. At the 1% significance level, indicate whether he should be convinced to put up a computer store or a video store.

Solution:Let 1 = mean weekly income of a computer stores

2 = mean weekly income of video stores

Ho: 1 = 2

H1: 1 ≠ 2 (two-tailed)

Significance Level: 1%Rejection Region: and and t> 2.878

18,005.tt 18,005.tt

878.2t

Computation of Test Statistics: t – test

2121

222

211

21

112

)1()1(nnnn

snsn

xxt

1021 nn

000,451 Px 000,532 Px 500,171 P 000,152 P

1.1

10

1

10

1

21010

)000,15)(110()500,17)(110(

000,53000,4522

t

Decision: Since –1.1 is not within the rejection region, then we accept HO : at the 1% significance level. Therefore, there is no significant difference whether the businessman puts up either a computer store or video store. Therefore, opening a computer store is as profitable as opening a video store.

Testing the Difference of Two Proportions

a) Ho: P1 = P 2 b.) Ho: P1 = P2 c.) Ho: P1 = P2

H1: P1 < P2 H1: P1 > P2 H1: P1 P2

Test Statistic:

where Rejection Region:

a.) Z < -Z b.) Z > Z c.) Z < -Z/2 and Z > Z/2

21

21

11)1(

nnpp

ppz

2

22

1

11 ,

n

xp

n

xp

21

2211

nn

pnpnp

An Example

Consider random samples of 85 married couples last year and another 100 couples this year. Data show that 70 of last year’s married couples and 95 of this year’s married couples are entrepreneurs. Using the 1% significance level, can it be stated that the proportion of entrepreneurs has significantly increased from last year to this year?

Solution: Let P1 and P2 be the true proportion of married couples last year and this year who are entrepreneurs, respectively.

HO: The proportion of entrepreneurs has not significantly increased from last year to this year, or (p1 = p2 ).

H1: The proportion of entrepreneurs significantly increased from last year to this year, or . (p1 < p2 ).

Significance Level : 1%Rejection Region:

2/zz

005.zz 33.2z

Computation of Test Statistics: z – test

21

21

11)1(

nnpp

ppz

851 n 1002 n 823.085

701 p 95.0

100

952 p

89.010085

)95.0(100)823.0(85

21

2211

nn

pnpnp

73.2

100

1

85

1)89.01)(89.0(

95.0823.0

z

Decision: Since –2.73 is within the rejection region, then reject HO : at the 1% significance level. Therefore, the proportion of entrepreneurs has significantly increased from last year to this year.

Dependent Samples

Samples are called dependent if any one of the following cases is true.

1. Before and after experiments - an experimental variable is being measured in terms of its effectiveness. This can be done by comparing the observations taken from the same group before and after the introduction of the experimental variable. If significant difference exists, then the experimental variable is said to be effective.

2.Two different groups matched pair by pair with respect to some relevant characteristics. The primary purpose of matching is to ensure that observations may differ because of the experimental variable being tested.

• A study is conducted to determine the effect of fraternity membership to the performance of the students. One way to measure the effect of fraternity membership (experimental variable) is to consider say for instance 30 students and their performance (grades) before and after membership to fraternity are being compared (case 1). Another way is to compare the performance of two groups of students, those who are fraternity members and those who are non-fraternity members. The students in the first group should be carefully matched with the students in the second group with respect to some relevant characteristics such as IQ level, study habits, gender, etc. This will ensure that if a significant difference in the performance exists, then this could solely be attributed to the fact that the first group of students are fraternity members and the second group are non-members.

a.) Ho: 1 = 2 b.) Ho: 1 = 2 c.) Ho: 1 = 2

H1: 1 < 2 H1: 1 > 2 H1: 1 2

Test Statistic:

where:

Sd

nddt

)( 0

)1(

)()( 2

nn

ddnSd i

ei

with v = n –1, n is the number of pairs

Rejection Region:a.) t < -t b.) t > t c.) t < -t/2 and t > t/2

Example. If you wished to measure the effectiveness of a new diet you would weigh the dieters at the start and at the finish of the program

Chapter SixAnalysis of Variance


ONE Understand the purpose of Analysis of Variance.TWO

Discuss the general idea of analysis of variance.

THREE

Organize data into a one-way

Chapter Twelve continuedAnalysis of Variance


FOURDefine and understand the terms treatments and blocks.FIVEConduct a test of hypothesis among three or more treatment means.

Purpose of ANOVA

• The Analysis of Variance (ANOVA for short) is a technique designed to test whether or not more than two sample means differ significantly from each other.

• In the preceding chapter, we learned that the t-test or the z-test is used to test for the significance of difference between two sample means. ANOVA, therefore, which can be used to test for the equality of several means simultaneously is an extension of the t-test or the z-test which can handle only two means at a time.

Illustration• Suppose for example, three methods of teaching (A,

B, C) are being compared in terms of effectiveness and are being employed to three groups of students. If the researcher uses the t-test, he will have to test separately for the following pairs: A and B, A and C, and B and C. In other words, the researcher would be using the t-test formula three times which would mean spending so much time and effort. Of course there is a possibility that none of the pairs are significantly different and so it is a waste of time and effort on the part of the researcher.

Purpose of ANOVA• Now, if the researcher uses ANOVA, he could take all

the three sample means simultaneously and the test stops if the conclusion arrived at is that of no significant difference. However, if the conclusion arrived at in using ANOVA shows significant difference between the three sample means (A, B, C), then the t-test must be used to find which pair of means differ significantly. Another way is to use the Duncan’s Multiple Range Test (DMRT) which is beyond the scope of this manual.

Why the name analysis of variance? • The phrase analysis of variance means that we will be

analyzing the total variation in a set of observations that can be attributed to specific sources or causes of variation. With reference to the above example, two such specific sources of variations might be (1) actual differences in the three methods that can be shown in the varying performance of the three group of students (treatment), and (2) chance, which in problems like this is usually called experimental error.

The populations have equal standard deviations.

ANOVA requires the following conditions

Underlying assumptions for ANOVA

The sampled populations follow the normal distribution.

The samples are independent

Steps in ANOVA

• The following are the steps and computations involved in the Analysis of Variance technique.

Ho: 1 = 2 = 3 = . . . = k ( for k sample means)

H1 : At least two means differ significantly

Test Statistic: Rejection Region: F > F tabular (, k-1, n-k)where

MSE

MSTrF

1)(

k

SSTrTreatmentMeanSquareMSTr

n

GrandTotal

n

ct

n

ct

n

ct

tesTreatmenSumofSquarSSTr

k

k22

2

22

1

21 )(

...

)(

Computational Formulas

kn

SSEErrorMeanSquareMSE

)(

SSTrTSSesErrorSumofSquarSSE )(

n

GrandTotalxSquaresTotalSumofTSS i

n

i

22

1

)()(

Source of Variation Degrees of Freedom(df)

Sum of Squares(SS)

Mean Square(MS)

FValue

Treatment k-1 SSTr MSTr F = MSTr/ MSE

Error n-k SSE MSE

Total n-1 TSS

Analysis of Variance (ANOVA) Table

ECP Restaurants specialize in meals for families. The owner recently developed a new meat loaf dinner. Before making it a part of the regular menu she decides to test it in several of her restaurants.

Example 1

She would like to know if there is a difference in the mean number of dinners sold per day at Restaurants A, B, and C. Use the .05 significance level.

Number of Dinners Sold by Restaurant

RestaurantDay

A B C

Day 1Day 2Day 3Day 4Day 5

13121412

10121311

1816171717

Example 1 continued

Solution:Ho: mA = mB = mC H1: At least 2 means differ significantly

Level of significance is .05.

Example 2 continued

Rejection Region:The numerator degrees of freedom, k-1, equal 3-1 or 2. The denominator degrees of freedom, n-k, equal 13-3 or 10. The value of F at 2 and 10 degrees of freedom is 4.10. Thus, H0 is rejected if F>4.10

Example 2 continued

Using the data provided, the ANOVA calculations follow.

Computations

= 132 + 122 + . . .+ 172 - (182) 2 / 13= 2634 – 2548 = 86

n

GrandTotalxSquaresTotalSumofTSS i

n

i

22

1

)()(

25.76254825.262413

)182()

5

85

4

46

4

51(

)(...

)(

2222

22

2

22

1

21

n

GrandTotal

n

ct

n

ct

n

ct

tesTreatmenSumofSquarSSTr

k

k

75.925.7686)( SSTrTSSesErrorSumofSquarSSE

ANOVA TableSource of Variation

Sum of Squares

Degrees of

Freedom

Mean Square

F

Treatments 76.25 3-1=2

76.25/2=38.125 38.125

.975= 39.103

Error 9.75 13-3=10

9.75/10=.975

Total 86.00 13-1=12

Example 1 continued

Example 1 continued

The ANOVA tables on the next two slides are from the SPSS and EXCEL systems

The mean number of meals sold at the three locations is not the same. Specifically, as shown in the Duncan’s Multiple Range Test, the meals sold at Restaurant C is significantly higher than A and B, but A and B do not differ significantly.

Since an F of 39.103 > the critical F of 4.10, the decision is to reject the null hypothesis and conclude that

At least two of the treatment means are not the same.

Example 1 continuedANOVAvolsold

Sum of Squares df Mean Square F Sig.Between Groups 76.250 2 38.125 39.103 .000

Within Groups 9.750 10 .975Total 86.000 12

volsoldDuncanRestaurant N Subset for alpha = 0.05

1 22 4 11.50001 4 12.75003 5 17.0000Sig. .094 1.000Means for groups in homogeneous subsets are displayed.

SUMMARY

Groups Count Sum Average Variance

Aynor 4 51 12.75 0.92

Loris 4 46 11.50 1.67

Lander 5 85 17.00 0.50

ANOVA

Source of Variation SS df MS F P-value F crit

Between Groups 76.25 2 38.13 39.10 2E-05 4.10

Within Groups 9.75 10 0.98

Total 86.00 12

Anova: Single Factor

Example 2 continued

Some common Post Hoc Tests are Duncan’s Multiple Range Test of DMRT and Tukey’s Test.

When I reject the null hypothesis that the

means are equal, I want to know which treatment

means differ.

Chapter SevenLinear Regression and Correlation


ONEDraw a scatter diagram.TWO Understand and interpret the terms dependent variable and independent variable.THREECalculate and interpret the coefficient of correlation, the coefficient of determination, and the standard error of estimate.FOURConduct a test of hypothesis to determine if the population coefficient of correlation is different from zero.

Goals

Chapter Thirteen continuedLinear Regression and Correlation


FIVE Calculate the least squares regression line and interpret the slope and intercept values.SIXConstruct and interpret a confidence interval and prediction interval for the dependent variable.SEVENSet up and interpret an ANOVA table.

Goals

Correlation Analysis

The Independent Variable provides the basis for estimation. It is the predictor variable.

Correlation Analysis is a group of statistical techniques to measure the association between two variables.

A Scatter Diagram is a chart that portrays the relationship between two variables.

The Dependent Variable is the variable being predicted or estimated.

Advertising Minutes and $ Sales

0

5

10

15

20

25

30

70 90 110 130 150 170 190

Advertising Minutes

Sale

s ($

thou

sand

s)

The Coefficient of Correlation, r

Negative values indicate an inverse relationship and positive values indicate a direct relationship.

The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables.

-1 10

P earson's r

Also called Pearson’s r and Pearson’s product moment correlation coefficient.

It requires interval or ratio-scaled data.

It can range from -1.00 to 1.00.

Values of -1.00 or 1.00 indicate perfect and strong correlation.

Values close to 0.0 indicate weak correlation.

Perfect Negative Correlation

0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0

X

Y

0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0

X

Y

Perfect Positive Correlation

0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0

X

Y

Zero Correlation

0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0

X

Y

Strong Positive Correlation

Formula for r

We calculate the coefficient of correlation from the following formula.

])()(][)()([

))((2222 yynxxn

yxxynr

Coefficient of Determination

It is the square of the coefficient of correlation. It ranges from 0 to 1.It does not give any information on the direction of the relationship between the variables.

The coefficient of determination (r2) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X).

Example

Suppose an investigator wants to determine the extent to which a relationship exists between size of income and the number of years of education the individual has completed. The following table shows the data of ten individuals. Compute and interpret r.

Individual # Y, Income (P,000) X, Education (years) 1 45 20

2 63 19 3 36 16 4 52 20 5 29 12 6 33 14 7 48 16 8 55 18 9 72 20 10 66 22

SolutionPreliminary computations of the above data: n = 10 x = 177 y = 499 xy= 9,173 x2

= 3,221 y2 = 26,793 Therefore,

])()(][)()([

))((2222 yynxxn

yxxynr

83.])499()26793(10][)177()3221(10[

)499)(177()9173(1022

r

The value of r2 which is (.83) 2 = .6889 means that 68.89% of the total variability in income is being explained by its linear relationship with education.

Testing for the Significance of rThe sample correlation coefficient r is a value computed from a random sample of n pairs of measurements. Different random samples of size n from the same population will generally produce different values of r. Therefore, we need to test for the significance of the computed r value. The null hypothesis will be = 0 against the alternative that 0. The rejection of Ho will lead to a conclusion that the existing relationship is significant at alpha level. However, failure to reject Ho would imply that the relationship is not significant and thus it can be attributed to chance.From the above result, test the null hypothesis that there is no linear relationship between income and education.. Use a 0.05 level of significance.

Solution: Ho: = 0H1 : 0 (two-tailed)

Level of Significance: 0.05 / 2 = 0.025Rejection Region: t > 2.306 and t < -2.306

Computation of the Test Statistic:

Decision: Since the value of the test statistic (t = 4.26) is greater

than the tabular value of 2.306, therefore reject the null hypothesis of no linear relationship and conclude that there is a significant relationship between education and income. Specifically, the higher the educational attainment, the higher the income earned

26.4)83(.1

)8(83.

1

)2(22

r

nrt

Regression Analysis

The least squares criterion is used to determine the equation. That is the term S(Y – Y’)2 is minimized.

In Regression Analysis we use the independent variable (X) to estimate the dependent variable (Y).

The relationship between the

variables is linear.

Both variables must be at least

interval scale.

The Y values are statistically independent. This means that in the selection of a sample, the Y values chosen for a particular X value do not depend on the Y values for any other X values.

For each value of X, there is a group of Y values, and these Y values are normally distributed.

Assumptions Underlying Linear Regression

The means of these normal distributions of Y values all lie on the straight line of regression.

The standard deviations of these normal distributions are the same.

Regression Analysis

The regression equation is Y(hat)= a + bX (1)

where Y(hat) is the predicted value of Y for any X.

a is the Y-intercept. It is the estimated Y value when X=0

b is the slope of the line, or the average change in Y’ for each change of one unit in X

The least squares principle is used to obtain a and b.

Deterministic ModelModel (1) is called a deterministic model. It

gives an exact relationship between x and y. This model expresses that y is determined exactly by x and for a given value of x there is one and only one value of y.

Probabilistic ModelHowever, in many cases the relationship between variables is

not exact. For instance, if y is food expenditure and x is income, then model (1) would state that food expenditure is determined by income only and that all households with the same income will spend the same amount of food. But food expenditure is determined by many other variables such as household size, taste and preference which explains why different households with the same income spend different amounts of money for food. Hence, to take these variables into considerations and to make our model complete, we add another term to the right side of model(1) called the random error term ( ).

• The complete regression model is written asy = A + Bx + (2)

The regression model (2) is called a probabilistic model or a statistical relationship.

The random error term () is included in the model to represent the following two phenomena.

• Missing or omitted variables. As mentioned earlier, food expenditure is affected by many variables other than income. The random error term is included to capture the effect of all those missing variables that have not been included in the model.

• Random variation. Human behavior is unpredictable. For example, a household may have many parties during the month and may spend more than usual on food during that month. This variation in food expenditure may be called random variation.

Regression Analysis

The least squares principle is used to obtain a and b. The equations to determine a and b are:

22 )()(

))((

xxn

yxxynb

• An ExampleWe have already established a significant linear relationshipbetween income and education (i.e. number of years ineducation). We may now formulate an equation allowing us topredict the income of a person given the number of years heattended for his education. Referring to the above example,

wefind thatn = 10 Xi = 177 Yi = 499 Xi2 = 3221 Yi2 = 2679 XiYi =

9173 Y(bar) = 49.9 X(bar) = 17.7

Therefore,

And

The model for prediction purposes is given by

87.3)177()3221(10

)499)(177()9173(10

)()(

))((222

xxn

yxxynb

59.18)7.17(87.39.49 xbya

xy 87.358.18

The equation can be interpreted as per additional year in

educational attainment, there corresponds 3.87 or 3,870 pesos

increase in the income of a person.

We can use the regression equation to estimate values of Y.

The estimated income of a person who have spent 16 years of education will be:

34.43)16(87.358.18 Income

Standard Error of Estimate (SEE) – a measure of the variability of the regression line, i.e. the

dispersion around the regression line – it tells how much variation there is in the dependent

variable between the raw value and the expected value in the regression

– this SEE allows us to generate the confidence interval on the regression line as we did in the estimation of means

ExerciseIt is generally known that the number of road accidents is

inversely proportional with road width. The following data show the results of a study indicating the number of accidents occurring annually at roads with different widths:

Road width (in feet) (X)75 52 60 33 22 40 70 35 55 80Number of accidents (Y)40 84 55 92 90 86 38 88 78 32

– Draw the scatter diagram.– Find the correlation coefficient and test for its significance.– Establish a regression equation of the form Y = a + bX– Predict the number of accidents for a 50 ft road width.

Goals in Statistic

Documents

Transcript of Goals in Statistic