146 04 data_collection

MATH& 146 Lesson 4 Section 1.3 Study Beginnings 1

Transcript of 146 04 data_collection

Page 1: 146 04 data_collection

MATH& 146

Lesson 4

Section 1.3

Study Beginnings


Page 2: 146 04 data_collection

Populations and Samples

The population is the complete collection of

individuals or objects that you wish to learn about.

To study larger populations, we select a sample. The

idea of sampling is to select a portion of the population

and study that portion to gain information about the



Page 3: 146 04 data_collection

Parameters and Statistics

A parameter is a value (usually a proportion or

average) that describes the population.

For every parameter there is a corresponding

sample statistic. The statistic is a numerical value

summarizing the sample data and describing the

sample the same way the parameter describes the



Page 4: 146 04 data_collection

Example 1

Define the population, sample, parameter, and

statistic from the following study:

We want to know the proportion of new students

that were satisfied with New Student Orientation at

YVC. 100 first year students at the college were

randomly sampled, and 72 said they were satisfied.


Page 5: 146 04 data_collection

Example 2

Define the population, sample, parameter, and

statistic from the following study:

You want to determine the average number of

glasses of milk college students drink per day.

Suppose yesterday, in your English class, you

asked five of your friends how many glasses of milk

they drank the day before. The answers were 1, 0,

1, 3, and 12 glasses of milk.


Page 6: 146 04 data_collection

Research Questions

The first step in conducting research is to identify

topics or questions that are to be investigated.

A clearly laid out research question is helpful in

identifying what subjects or cases should be

studied and what variables are important.


Page 7: 146 04 data_collection

Research Questions

A research question should refer to a target

population. Often times, however, it is too

expensive or difficult to collect data for every case

in a population. Instead, a sample is taken.

A sample represents a subset of the cases and is

often a small fraction (usually less than one-tenth)

of the population. Sample data is then used to

estimate the population parameter and answer the

research question.


Page 8: 146 04 data_collection

Example 3

Consider the following research question: "Over

the last 5 years, what is the average time to

degree for Duke undergraduate students?"

a) What are the target population and parameter?

b) Suppose the researcher met two students who

took more than 7 years to graduate from Duke.

Does that prove it takes longer to graduate at

Duke than at other colleges? Why or why not?


Page 9: 146 04 data_collection

Example 4

Consider the following research question: "Does a

new drug reduce the number of deaths in patients

with severe heart disease?"

a) What are the target population and parameter?

b) Suppose my friend's dad had a heart attack

and died after they gave him the new heart

disease drug. Does that prove that the drug

does not work? Why or why not?


Page 10: 146 04 data_collection

Anecdotal Evidence

Both of the conclusions of the last two examples

were based on some data. However, there were

two problems.

• First, the data only represent one or two cases.

• Second, it is unclear whether these cases are

actually representative of the population.

Data collected in this haphazard fashion are called

anecdotal evidence.


Page 11: 146 04 data_collection

Anecdotal Evidence

When anecdotal

evidence is cited, there

is no reason to expect

the individuals to be

representative of

anyone but themselves.

They can make nice

stories, but lousy



Page 12: 146 04 data_collection


If someone was permitted to pick and choose

exactly which cases were included in a sample, it

is entirely possible that the sample could be

skewed to that person's interests. This introduces

bias into a sample.

A biased sample causes problems because any

statistic computed from that sample has the

potential to be consistently erroneous.


Page 13: 146 04 data_collection

An Example of “Bad Data”

The 1936 presidential election

between Franklin Roosevelt

and Alf Landon is notable for

the Literary Digest poll, which

was based on over two million

returned postcards.

In its October 31 issue, Landon

was predicted to easily win with

370 electoral votes and 57% of

the popular vote.


Page 14: 146 04 data_collection

1936 Election Results

Landon's electoral vote total of eight is a tie for the record low for a major-party nominee since the current U.S. two-party system began in the 1850s. The Literary Digest was completely discredited because of the poll and was soon discontinued.

Predicted Vote Actual Vote

FDR 161 (~43%) 523 (60.8%)

Alf Landon 370 (57%) 8 (36.5%)


Page 15: 146 04 data_collection

Why did the Literary Digest fail?

The first major problem with the poll was in the

selection process for the names on the mailing list,

which were taken from telephone directories, club

membership lists, lists of magazine subscribers,


Such a list is guaranteed to be slanted toward

middle- and upper-class voters, and by default to

exclude lower-income voters.


Page 16: 146 04 data_collection

Why did the Literary Digest fail?

The second problem with the Literary Digest poll

was that out of the 10 million people whose names

were on the original mailing list, only about 2.4

million responded to the survey.

Thus, the size of the sample was about one-fourth

of what was originally intended. (In addition,

people who respond to surveys are different from

people who don't.).


Page 17: 146 04 data_collection


In general, there are three common types of bias that

might occur in a sample:

• Selection bias: The method for selection makes

the sample unrepresentative of the population.

• Nonresponse bias: A sample is chosen, but a

subset cannot or will not respond.

• Response bias: Participants to a survey provide

incorrect information, intentionally or unintentionally.


Page 18: 146 04 data_collection


Bias is the bane of sampling – the one thing above

all to avoid.

Conclusions based on samples drawn with biased

methods are inherently flawed. There is usually no

way to fix bias after the sample is drawn and no

way to salvage useful information from it.


Page 19: 146 04 data_collection

Example 5

Indicate whether the potential bias is a selection

bias, a nonresponse bias, or a response bias.

A survey question asked of unmarried men was

"What is the most important feature you consider

when deciding whether to date somebody?" The

results were found to depend on whether the

interviewer was male or female.


Page 20: 146 04 data_collection

Example 6

For each situation, explain why selection bias could be

introduced, and how it could affect your results.

a) A cage has 1000 rats, you pick the first 20 you can

catch for your experiment.

b) A public opinion poll is conducted using the

telephone directory.

c) You are conducting a study of a new diabetes drug;

you advertise for participants in the newspaper and


Page 21: 146 04 data_collection

Example 7

You need to conduct a study of longevity for

people who were born in the decade following the

end of World War II in 1945. If you were to visit

graveyards and use only the birth/death dates

listed on tombstones, would you get good results?

Why or why not?

Page 22: 146 04 data_collection

Example 8

"If you had to do it over again, would you have

children?" This is the question that advice columnist

Ann Landers asked her readers back in 1976. It turns

out that nearly 70% of the 10,000 responses she

received were "No." A professional poll by Newsday

found that 91% of randomly chosen respondents

would have children again.

Explain the apparent contradiction between these two

surveys using what you have learned about sampling.


Page 23: 146 04 data_collection

Types of Variables

In many studies more than one variable is

recorded per case or individual.

It is often the purpose of a study to determine if

and/or how one variable (called the explanatory

variable) affects another (called the response



Page 24: 146 04 data_collection

Types of Variables

Response Variable: The outcome of a study. A

variable you would be interested in predicting or


Explanatory Variable: Any variable that explains

the response variable.


Page 25: 146 04 data_collection

Example 9

Pick out which variable you think should be the

explanatory variable and which variable should be the


a) Weights of nuggets of gold (in ounces) and their

market value (in $) over the last few days are

provided, and you wish to use this to estimate the

value of a gold ring that weighs 4 ounces.


Page 26: 146 04 data_collection

Example 9 continued

b) You have data collected on the amount of time

since chlorine was added to the public swimming

pool and the concentration of chlorine still in the

pool. Chlorine was added at 8 AM, and you wish to

know what the concentration is now, at 3 PM.

c) You have data on the circumference of oak trees

(measured 12 inches from the ground) and their

age (in years). An oak tree in the park has a

circumference of 36 inches, and you wish to know

approximately how old it is.


Page 27: 146 04 data_collection

Example 10

Suppose your wanted to conduct a study to predict

a student's success. Using a student's GPA as the

response variable, what are some explanatory

variables that might be worth considering.

Determine the variable type (categorical,

numerical) of each explanatory variable. For each

numerical explanatory variable, guess whether the

association with the response will be positive,

negative, or none.