Statistical thinking

I.4 Sampling Lecture Notes

1. Statistical Thinking

Statistical thinking will one day be as necessary for efficient cit-izenship as the ability to read and write. – H. G. Wells, authorof “War of the Worlds”

Definition: Statistics is the science of collecting, analyzing, and interpretingdata in such a way that the conclusions can be objectively evaluated.

2. Three Phases of Statistics

• Collect the data• Analyze the data

– order the data– graphical displays– numerical calculations (such as mean and standard dev)

• Interpret the results– use proper statistical techniques to substantiate or refute hypothe-

sized statements– match data to the appropriate technique– determine whether the proper assumptions are satisfied

3. Two types of statistics

• Descriptive statistics – summarize and describe a characteristic forsome group

• Inferential statistics – estimate, infer, predict, or conclude somethingabout a larger group

4. Examples

Descriptive Inferential

Batting Average PollsYards Per Carry Medical StudiesTest Scores Market Surveys

1

2

5. Two types of data

• Quantitative data – values recorded on a natural numerical scale• Qualitative data – classified into categories

6. Quantitative Data

• Weight of subjects in medical sample• Height of buildings in Chicago• Temperatures per day at Antarctica Weather Station

7. Qualitative Data

• Gender of subjects in medical sample• Political affilation of respondents in a poll survey• Class (fresh, soph, jr, sr) of Math 101 students

8. Vocabulary

• The population is the entire set of objects (people or things) underconsideration.

• A sample is a subset of the population that is available for the analysis.• A bias is a favoring of certain outcomes over others.• A census collects data from each member of the population.• A statistic is a statement of numerical information about a sample.• A parameter is a statement of numerical information about a popula-

tion.

9. Census versus Sample

Would you use a census or a sample to determine the following:

• Project the winner of an election• Calculate a baseball player’s batting average• Predict whether it will rain tomorrow

3

• Test whether the soup is too salty• Calculate Shaq’s free throw average• Use a market study to determine a new flavor of toothpaste• Report the Dow Jones Average• Generalize a medical study to other groups• The average score on the first test

10. Dealing with bias

Bias in some form occurs in the collecting of most, if not all, sets of data.

The bias may come from

• the portion of the population surveyed• the phrasing of the questions

11. Examples

• “Dewey defeats Truman” projection of Chicago Tribune based on 1948telephone poll

• “Are you in favor of Illinois banning cell phones in cars? Dial *91 onyour cellular phone to vote.”

• “Do you feel budget cuts are more important than humanitarian pro-grams that would need to be cut to obtain a balanced budget?”

12. Methods for Choosing Samples

• Judgement Sample

– Use the opinion of person(s) deemed qualified to choose membersof the sample.

– Example: to investigate study habits of atheletes, ask their coachesand teachers.

• Simple Random Selection

– Use random numbers to select the sample.

4

– Page 315 Random Digit Table:

72985547555515086461

• Stratefied Sampling

– Divide the population into relatively homogenous groups, draw asample from each group, and take their union.

13. Goals of a good sample

• from the correct population• chosen in an unbiased way• large enough to reflect total population

14. Normal Distribution of Random Events

Toss a coin 100 times and count the number of heads.

How many heads would you expect?

• about 50• exactly 50

It does not seem reasonable that the count will be exactly 50.

We would not be surprised if the number of heads turned out to be 48 or51 or even 55.

We would be surprised to see 80 heads, and would begin to suspect that thecoin was not fair.

15. Coin Toss Data

Experiment: A coin is tossed n = 100 times.

The experiment is repeated 1000 times.

Here are the results:

5

16. Frequency Table: No. of Heads

Heads Freq Heads Freq Heads Freq

1 0 45 54 58 27... 0 46 49 59 19

34 0 47 54 60 1135 2 48 66 61 1136 2 49 89 62 537 2 50 70 63 438 2 51 77 64 239 5 52 85 65 040 14 53 62 66 041 16 54 57 67 142 25 55 52 68 043 30 56 40

... 044 31 57 36 100 0

mean = 50.296

stand dev = 5.100

17. Coin Toss Histogram

30 40 50 60 70

6

18. Sampling Distributions

If we could examine all possible samples of size n of a population, then thefrequency distribution of the means of these samples is normally distributed.

• µ = the mean over the entire population• σ = the standard deviation over the entire population• x = the mean of the sampling distribution• σx = the standard deviation of the sampling distribution

19. Two Rules

Rule 1. x = µ

Rule 2. σx =σ√

n

We are assuming in Rule 2 that the size of the entire population is muchlarger than the sample size n.

20. Two Outcome Situations

Situation: Two outcomes (for–against; heads–tails; yes–no)

p = percent in favor

q = percent opposed

Written as decimals p + q = 1 Why?

21. Example

• 29 % of Americans favor Bush’s handling of the War in Iraq,• while 71 % do not.• p = .29 q = .71• p + q = .29 + .71 = 1

7

22. Quantitizing the Data

• We count a for (or yes) vote as X1 = 1• and an against (or no) vote as X2 = 0• Out of 100 people, we would expect• 100p yes votes and 100q no votes

23. To calculate the mean

Outcome (out of 100 cases):

Vote Frequency Freq ×Xi

X1 = 1 (yes) 100p 100pX2 = 0 (no) 100q 0

Total 100p

So the mean µ =100p

100= p

24. Standard Deviation

Out of 100 cases,

Vote Freq (Xi − µ)2 Freq×(Xi − µ)2

X1 = 1 100p (1 − p)2 100p(1 − p)2

X2 = 0 100q (0 − p)2 100q(0 − p)2

Total 100p(1 − p)2

+100q(0 − p)2

25. Calculating standard deviation

First divide the Total by n = 100 cases:

Total

100= p(1 − p)2 + q(0 − p)2

= p(1 − p)2 + qp2

= pq2 + qp2 [1-p=q]

8

= pq(q + p)

= pq [because p + q = 1]

Then to get σ, take the square root:

σ =√

pq

26. The p–q Rule

Suppose a coin has probability p of landing heads and q = 1 − p of landingtails.

(A value other than p = 1

2means the coin is not “fair.”)

The parameter which measures a head (X = 1) versus a tail (X = 0) has

mean µ = p and standard deviation σ =√

pq

27. Bush Popularity Example

29% think Bush is doing a good job71% do not

p = .29 and q = .71

µ = p = .29

σ =√

pq =√

(.29)(.71) = .4538

28. Fair Coin Toss

Heads = 1, Tails = 0

With a fair coin, we expect the percentage of heads to be 50%:

p = .5 and q = .5

µ = p = .5

σ =√

pq =√

(.5)(.5) =√

.25 = .5

9

29. Percents versus Actual Numbers

Sometimes our calculations are in terms of percents and sometimes they aregiven as actual numbers.

For example, suppose we flip a coin 340 times.

We would expect to have roughly 170 heads (and 170 tails).

We expect the percentage of heads to be 170

340= 1

2or 50%

p = 0.5 is the number used in our formulas (along with q = .5)

To convert from the percentage to the actual number of expected heads,simply multiply p by n

In this case, we expect 1

2× 340 = 170 heads.

30. Percents versus Actual Numbers Cont’d

The p–q formula computes the standard deviation σ for the population when

we are thinking in terms of percent

The formula σx = σ√

ncomputes the standard error of the mean when we are

thinking in terms of percent

To convert to actual numbers, multiply σx by n.

By properties of the square root functionσ√

n· n = σ ·

√n

31. Percents versus Actual Numbers

Flip a coin 340 times and count the number of heads.

Mean and Standard Deviation for the Entire Population

µ = 1

2= 0.5 σ =

√

(.5 × .5) = 0.5

Mean and Standard Deviation for Sample Size of n = 340 tosses

In terms of percents:

x = µ = 0.5 σx = σ√

340= .027

10

In terms of actual numbers, multiply by n = 340:

mean = 0.5 × 340 = 170 stan. dev. = .027 × 340 = 9.22

32. Interpetation

Since the sampling distribution is normally distributed with mean 170 andstandard deviation of 9.2, the 68–95–99 rule tells us:

If you flip a fair coin 340 you would expect the number of heads to be

between 161 and 179 68% of the time [1 standard deviation]

between 152 and 188 95% of the time [2 standard deviations]

between 142 and 198 99% of the time [3 standard deviations]

33. Coin–Toss Model

• Suppose a coin has probability p of landing heads and q = 1 − p oflanding tails.

• Suppose we flip the coin n times and record x, the number of heads foreach sample.

• The values of x will be normally distributed with mean and standarddeviation given as follows:

Distribution DistributionPopulation Sample Sample

Percents Actual Numbers

Mean p p p · n

Stan. Dev. σ =√

pqσ√

nσ ·

√n

34. Comparison with Previous Experiment

Toss a coin n = 100 times

Actual Value Predicted ValueMean 50.296 50Stan. Dev. 5.100 5

Statistical thinking

Leadership & Management

Transcript of Statistical thinking