Lecture -- 5 -- Start - pitt.edusuper7/51011-52001/51431.pdf · TEST • Which of the ... Source: ....

Post on 17-Feb-2018

217 views 0 download

Transcript of Lecture -- 5 -- Start - pitt.edusuper7/51011-52001/51431.pdf · TEST • Which of the ... Source: ....

Lecture -- 5 -- Start

Outline

1. Science, Method & Measurement2. On Building An Index3. Correlation & Causality4. Probability & Statistics5. Samples & Surveys6. Experimental & Quasi-experimental Designs7. Conceptual Models8. Quantitative Models9. Complexity & Chaos10. Recapitulation - Envoi

Outline

1. Science, Method & Measurement2. On Building An Index3. Correlation & Causality4. Probability & Statistics5. Samples & Surveys6. Experimental & Quasi-experimental Designs7. Conceptual Models8. Quantitative Models9. Complexity & Chaos10. Recapitulation - Envoi

Quantitative Techniques for Social Science Research

Ismail SerageldinAlexandria

2012

Lecture # 5:Samples And Surveys

Sample Surveys are among the most studied and written about topics in statistics

So: no Textbooks.. Just follow the presentation

Why Do Sample Surveys

Why do we do sample surveys?

We want to know something about the Population so we study a small sample of the Population

(making sure that the sample is representative)

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

So we will discuss how to undertake sampling and how to do surveys

Let’s start with some definitions

Data, Variables, Statistics and Parameters

Variables

• A variable is an attribute that describes a person, place, thing, or idea.

• The value of the variable can "vary" from one entity to another.

• Qualitative Variables are categorical: e.g. The color of balls are green, red or blue.

• Quantitative Variables are numeric: e.g. the population of a city.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Quantitative Variables: Continuous and Discrete

• Continuous variables can take any value between the maximum/minimum range: e.g. the weight of the persons in a class.

• Discrete variables must have an integer value: e.g tossing a coin, how many times do we get heads? It can never be 2.7 times, it will have to be 1,2,3,…n

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

TEST

• Which of the following statements are true?– I. All variables can be classified as quantitative or

categorical variables. – II. Categorical variables can be continuous

variables. – III. Quantitative variables can be discrete

variables.

• Answer: I and III are correct

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

TEST

• Which of the following statements are true?– I. All variables can be classified as quantitative or

categorical variables. – II. Categorical variables can be continuous

variables. – III. Quantitative variables can be discrete

variables.

• Answer: I and III are correct

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Two Snapshots, Two “states”:Discrete variables imply sudden moves

from state to stateContinuous variables imply constantly

changing transitions between two snapshots

Transitions can be cut up in discrete states

But many transitions are really continuous

Example: Students leaving school and

entering the Labor Market

Later we will discuss how this fits in Markov chains and the manpower model

But let’s go back to the issues of Data Collection

Methods Of Data Collection

• There are four main methods of data collection.

• Census. A census is a study that obtains data from every member of a population. In most studies, a census is not practical, because of the cost and/or time required .

• Sample survey. A sample survey is a study that obtains data from a subset of a population, in order to estimate population attributes.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Methods of Data Collection (Cont’d)

• Experiment. An experiment is a controlled study in which the researcher attempts to understand cause-and -effect relationships.

• Observational study. The researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.

• (Case Studies are observations of one case.)• Note: Observational Studies do NOT allow

you to generalize the findings.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Why do Sample Surveys?

• The reason for conducting a sample survey is to estimate the value of some attribute of a population .

• It is much cheaper and easier than doing a whole census

• When done scientifically, we can define the error term accurately (e.g. ±3%)

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Pros and Cons

• Resources . A well-designed sample survey can provide very precise estimates of population parameters - quicker, cheaper, and with less manpower than a census.

• Generalizability . Applying findings from a study to a larger population. Generalizability requires random selection.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Pros and Cons (continued)

• Causal inference . Cause-and -effect relationships can be teased out when subjects are randomly assigned to groups.

• Therefore, experiments , which allow the researcher to control assignment of subjects to treatment groups, are the best method for investigating causal relationships

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

We will have a lot more to say on Experimental Designs later.

We must distinguish between the sample statistic

and the population parameter

From Population To Sample To Population:

(From Sample Statistic To Population Parameter)

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

Population Parameter vs. Sample Statistic

• Population parameter. A population parameter is the true value of a population attribute.

• Sample statistic . A sample statistic is an estimate , based on sample data, of a population parameter.

• The estimate comes with the error term (e.g . ±3%)

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Example Of Population Parameter vs. Sample Statistic

• Example. We want to know the percentage of voters that favor a new tax. – The actual percentage of all the voters is a popula tion

parameter. – The estimate of that percentage, based on sample da ta,

is a sample statistic.

• The quality of a sample statistic (i.e., accuracy, precision, representativeness) is strongly affected by the way that sample observations are chosen; that is, by the sampling method .

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Bad Surveys make for bad estimates

Estimates of the front runners in the Egyptian Presidential Election 2012

• Before the first Round:

1. Abdel MoneimAboulfotouh

2. Amr Moussa3. Mohamed Morsi4. Hamdein Sabahi5. Ahmed Shafik

• After the first Round:

1. Mohamed Morsi2. Ahmed Shafik3. Hamdein Sabahi4. Abdel Moneim

Aboulfotouh5. Amr Moussa

The US 1948 Presidential Election:Truman vs. Dewey

Bad (Inaccurate) Polls

• The 52% is the finding from the sample survey

• The Error term (±3%) is related to the Sampling error: it means that we think the real value is between 49 % and 55%

• The 95 % confidence level means that there are 95 chances in 100 that these values are correct; i.e. that the real figures in the population will fall in that range.

• The error term will vary according to the size of sample.

What does it mean to say: “the poll says 52% (±3%) at 95% confidence level?”

What is sampling error? (The margin of error, or the ± 3%)

• Sampling Error is the calculated statistical imprecision due to interviewing a random sample instead of the entire population.

• The margin of error provides an estimate of how much the results of the sample may differ due to chance when compared to what would have been found if the entire population was interviewed.

• The confidence level (95 % or 95 out of 100) says that we are that confident in that result within that ± error term.

Sampling error

• Sampling error is related to sample size, but it is not the only kind of error possible in a sample surveys.

• You can look it up in sampling error tables such as the one I can show you here

• This table is produced by Gallup for a sample from a target population of 200 million, with a confidence level of 95%

Recommended allowance for sampling error of a percentage *In Percentage Points (at 95 in 100 confidence level )**

SAMPLE SIZE

1,000 750 500 250 100

Percentage near 10 2% 2% 3% 4% 6%

Percentage near 20 3 3 4 5 9

Percentage near 30 3 4 4 6 10

Percentage near 40 3 4 5 7 10

Percentage near 50 3 4 5 7 11

Percentage near 60 3 4 5 7 10

Percentage near 70 3 4 4 6 10

Percentage near 80 3 3 4 5 9

Percentage near 90 2 2 3 4 6 Table extracted from 'The Gallup Poll Monthly'. Cit ed at

http://www.ropercenter.uconn.edu/education/polling_ fundamentals_error.html

An Important Observation:Statistical Error and sample size

• As the sample size increases, there are diminishing returns in percentage error.

• At percentages near 50 %, the statistical error drops from 7 to 5% as the sample size is increased from 250 to 500 .

• But, if the sample size is increased from 750 to 1,000, the statistical error drops from 4 to 3%.

• As the sample size rises above 1,000 , the decrease in marginal returns is even more noticeable.

Among others, Langer Research Associates offers a margin -of-error calculator -- MoEMachine -- as a convenient tool for data

producers and everyday data users. Access the MoE Machine at

http://langerresearch.com/moe.php.

So, let’s learn more about surveys and sampling…

Types of Samples

What is a Survey?

• A survey may refer to many different types or techniques of observation, but it most often involves a questionnaire used to measure the characteristics and/or attitudes of people.

• Since we do not do a coverage of all the population we select a sample .

• Different ways of contacting members of a sample once they have been selected is the subject of survey data collection.

What is Survey Sampling?

• In statistics, survey sampling describes the process of selecting a sample of elements from a target population in order to conduct a survey.

• The purpose of sampling is to reduce the cost and/or the amount of work that it would take to survey the entire target population.

• A survey that measures the entire target population is called a census .

Sampling

Two Kinds of Survey Samples

Non-Probability samples and

Probability samples

Sampling Methods

• Non-probability samples. We do not know the probability that each population element will be chosen, and/or we cannot be sure that each population element has a non -zero chance of being chosen .

• Probability samples. Each population element has a known (non -zero) chance of being chosen for the sample.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Non-Probability Sampling

Pros & cons of Non -Probability Sampling

• Advantages : convenience and cost. • Disadvantage : We cannot estimate the extent

to which sample statistics are likely to differ from population parameters.

• Only probability sampling methods permit that kind of analysis.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Two of the main types of non -probability sampling methods

• Voluntary sample. People who self-select into the survey. Often, these folks have a strong interest i n the main topic of the survey. E.g. those who call i n to talk show, or participate in an on-line poll. This would be a volunteer sample.

• Convenience sample. A convenience sample is made up of people who are easy to reach. E.g. interviewi ng my students or my employees or shoppers at a local mall, If the group or the location was chosen because it was a convenient this would be a convenience sample.

• Note: Neither allows generalization to the population.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Non-probability Sample Surveys

• Surveys that are not based on probability sampling have no way of measuring their bias or sampling error.

• Surveys based on non -probability samples are not externally valid. You cannot generalize from them to the general population. They can only be said to be representative of the people that have actually completed the survey.

Non-Probability Samples

• The relationship between the target population and the survey sample is immeasurable and potential bias is unknowable.

• Sophisticated users of non -probability survey samples tend to view the survey as an experimental condition, rather than a tool for population measurement

• Analysts examine the results for internally consistent relationships.

Examples Of Non -Probability Samples

• Judgment Samples: A researcher decides which population members to include in the sample based on his or her judgment. The researcher may provide some alternative justification for the representativeness of the sample.

• Snowball Samples: Often used when a target population is rare, members of the target population recruit other members of the population for the survey.

Examples Of Non -Probability Samples

• Quota Samples: The sample is designed to include a designated number of people with certain specified characteristics. For example, 100 coffee drinkers. This type of sampling is common in non -probability market research surveys.

• Convenience Samples: The sample is composed of whatever persons can be most easily accessed to fill out the survey.

Probability Sampling

Probability samples are the only ones whose results will be generalizable to the

entire population

Random Samples

Ronald Fisher (1890-1962)

Extract from table of random numbers

Main types of probability sampling

• Simple random sampling, • Stratified sampling, • Cluster sampling, • Multistage sampling, and • Systematic random sampling.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Probability Samples are representative

• The key benefit of all these probability sampling methods is that they guarantee that the sample chosen is representative of the population. This ensures that the statistical conclusions will be valid.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Hence the conclusions are generalizable

Simple Random sampling

• The population consists of N objects.• The sample consists of n objects.• If all possible samples of n objects are

equally likely to occur, the sampling method is called simple random sampling.

• Selection is done by a lottery method or using a table of random number or a computerized random number generator.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Stratified Sampling

• Stratified sampling . The population is divided into groups, based on some characteristic.

• The groups are called strata. • Then, within each group, a probability sample

(often a simple random sample) is selected.

• As a example, suppose we conduct a national survey. We might divide the population into groups or strata, based on geography - north, east, south, and west. Then, within each stratum, we might randomly select survey respondents.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Cluster sampling

• Cluster sampling. With cluster sampling, every member of the population is assigned to one, and only one, group. Each group is called a cluster.

• A sample of clusters is chosen, using a probability method (often simple random sampling).

• Only individuals within sampled clusters are surveyed.

• E.g. select a sample of BA units, survey all the staff in these units.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Multistage sampling.

• Multistage sampling. With multistage sampling, we select a sample by using combinations of different sampling methods.

• For example, in Stage 1, we might use cluster sampling to choose clusters from a population. Then, in Stage 2, we might use simple random sampling to select a subset of elements from each chosen cluster for the final sample.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Systematic random sampling.

• Systematic random sampling . With systematic random sampling, we create a list of every member of the population. From the list, we randomly select the first sample element from the first k elements on the population list. Thereafter, we select every kth element on the list.

• This method is different from simple random sampling since every possible sample of n elements is not equally likely.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

How To Select A Probability Sample

How to select a probability sample

Probability Sampling

• A probability-based survey sample is created by constructing a list of the target population, called the sample frame , a randomized process for selecting units from the sample frame, called a selection procedure , and a method of contacting selected units to and enabling them complete the survey, called a data collection method or mode.

Probability Sampling: Step 1

• Construct a Sample frame: A probability-based survey sample is created by constructing a list of the target population , called the sample frame.

• For some target populations this process may be easy, for example, sampling the employees of a company by using payroll list.

• However, in large, disorganized populations simply constructing a suitable sample frame is often a complex and expensive task.

Probability Sampling: Step 2

• Selecting a sample from within the Sample frame:

• a randomized process for selecting units from the sample frame, called a selection procedure.

• Common methods of conducting a probability sample of the household population in the United States are Area Probability Sampling, Random Digit Dial telephone sampling, and more recently Address-Based Sampling.

Specialized Techniques Of Probability Sampling

• Within probability sampling there are specialized techniques such as:– stratified sampling &– cluster sampling

• These techniques improve the precision or efficiency of the sampling process without altering the fundamental principles of probability sampling.

Probability Sampling: Step 3

• Collecting the Data:• There must be a method of contacting

selected units to and enabling them complete the survey, called a data collection method or mode.

Sources Of Bias

Major Types of Bias In Surveys

• Non-response bias

• Coverage bias

• Selection bias

Major Types of Bias In Surveys

• Non-response bias

• Coverage bias

• Selection bias

Major Types of Bias In Surveys

• Non-response bias: When individuals or households selected in the survey sample cannot or will not complete the survey there is the potential for bias to result from this non -response. Non -response bias occurs when the observed value deviates from the population parameter due to differences between respondents and non -respondents.

Major Types of Bias In Surveys

• Non-response bias

• Coverage bias

• Selection bias

Major Types of Bias In Surveys

• Coverage bias: Coverage bias can occur when population members do not appear in the sample frame (undercoverage). Coverage bias occurs when the observed value deviates from the population parameter due to differences between covered and non -covered units. Telephone surveys suffer from a well known source of coverage bias because they cannot include households without telephones.

Major Types of Bias In Surveys

• Non-response bias

• Coverage bias

• Selection bias

Major Types of Bias In Surveys

• Selection Bias: Selection bias occurs when some units have a differing probability of selection that is unaccounted for by the researcher. For example, some households have multiple phone numbers making them more likely to be selected in a telephone survey than households with only one phone number. This selection bias would be corrected by applying a survey weight equal to [1/(# of phone numbers)] to each household.

But how you select your sample is only one of the issues in doing survey

research

Bias Due to Measurement Error

• In survey research, the measurement process includes the environment in which the survey is conducted, the way that questions are asked, and the state of the survey respondent.

• Response bias refers to the bias that results from problems in the measurement process. Some examples of response bias:

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Examples of Response Bias (Due to error in the Measurement process)

• Leading questions . The wording of the question may be loaded in some way to unduly favor one response over another. For example, a satisfaction survey may ask the respondent to indicate where she is satisfied, dissatisfied, or very dissatisfied.

• By giving the respondent one response option to express satisfaction and two response options to express dissatisfaction, this survey question is biased toward getting a dissatisfied response.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Examples of Response Bias – Cont’d (Due to error in the Measurement process)

• Social desirability . Most people like to present themselves in a favorable light, so they will be reluctant to admit to unsavory attitudes or illegal activities in a survey, particularly if survey results are not confidential. Instead, their responses may be biased toward what they believe is socially desirable.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Sampling Statistic and Sampling Error

• A survey produces a sample statistic , which is used to estimate a population parameter. If you repeated a survey many times, using different samples each time, you might get a different sample statistic wi th each replication. And each of the different sample statistics would be an estimate for the same population parameter.

• If the statistic is unbiased, the average of all th e statistics from all possible samples will equal the true population parameter; even though any individual statistic may differ from the population parameter. The variability among statistics from different samples is called sampling error .

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

Increasing The Sample size:Reduces Sampling Error but NOT Survey Bias

• Increasing the sample size tends to reduce the sampling error ; that is, it makes the sample statistic less variable. However, increasing sample size does not affect survey bias.

• A large sample size cannot correct for the methodological problems (undercoverage, nonresponse bias, etc.) that produce survey bias.

• Example: The Literary Digest Survey sample size was very large - over 2 million surveys were completed; but the large sample size could not overcome problems with the sample - undercoverage and nonresponse bias.

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

The Null Hypothesis & Types of Error

To analyze survey data and arrive at a conclusion, we need to formulate a

Null Hypothesis

Null Hypothesis

• It is usually a statement that can be falsified and whose acceptance or rejection yields a useful insight into the problem being studied and for which the data was collected.

• The null hypothesis is a hypothesis which the researcher tries to disprove, reject or nullify.

• It is symbolized by H 0

Ronald Fisher (1890-1962)

The first to formalize the notion of the “Null Hypothesis”

How do you state your basic (null) Hypothesis?

Usually: the normal state

(don’t worry, no effect, no change)Or:

there is no difference between expected and observed (i.e. difference is due to

chance only)

How do you state your basic (null) Hypothesis?

Usually: the normal state

(don’t worry, no effect, no change)Or:

there is no difference between expected and observed (i.e. difference is due to

chance only)

One-tailed or Two -tailed Tests

• One-Tailed :

• Two Tailed:

Accept H0

Reject H0

Reject H0 Reject H0

Accept H0

Usually:No directionality: use two -tailed test

Directionality: use one -tailed test

The Null Hypothesis identifies which kind of test is needed: One tailed or two -tailed

• In classical science, it is most typically the H0 statement that there is no effect of a particular treatment; in observations, it is typically that there is no difference between the value of a particular measured variable and that of a prediction, or between two means. We use a two -tailed test

• But when there is Directionality , i.e. when we say that it is better than, bigger than or less than, we use a One-Tailed Test.

BUT:In Accepting or rejecting the Null Hypothesis

we could be making Two different types of error

Type I error:(False Positive)

• Test says: This person is healthy Reality: This person has cancer

• Test says: This person is not guilty• Reality: This person is guilty

• Test Says: This product is faulty• Reality: This product is good

Type II error:(False Negative)

• Test says: This person has cancer• Reality: This person is healthy

• Test says: This person is guilty• Reality: This person is not guilty

• Test Says: This product is good• Reality: This product is faulty

Type I & Type II Error

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

Two other kinds of error:

• In 1948, Frederick Mosteller (1916-2006) Type III error: "correctly rejecting the null hypothesis for the wrong reason". (1948, p.61)

Two other kinds of error:

• In 1970, Marascuilo and Levin proposed a "fourth kind of error" -- a " Type IV error" –defined as being the mistake of "the incorrect interpretation of a correctly rejected hypothesis";

• which, they suggested, was the equivalent of "a physician's correct diagnosis of an ailment followed by the prescription of a wrong medicine" (1970, p.398).

Other risks of error:This is in addition to many other risks:• Correctly specifying the problem• Sampling design• Experimental or quasi-experimental

designs• Correctly understanding the kind of data

and its limitations• Correctly specifying the type of statistical

analysis• Correctly interpreting the results

Calculation & Conclusions

Conclusion of the statistical analysis is to accept/reject the Null Hypothesis

Type I & Type II Error

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

Type I & Type II Errors

Source: http://stattrek.com/statistics/data-collect ion-methods.aspx?Tutorial=AP

More samples means more accurate estimation of the population parameter

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

How to refer to significance level of a test(all these statements are equivalent)

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

You should be familiar with these expressions

Tips to Help Avoid Common Mistakes

• Remember to convert between variance and standard deviation.

• Check if hypothesis is one- or two -tailed .For two -tailed, split α to � �⁄ .

• Always use n - 1 degrees of freedom for one sample t-test.

• Keep statistics ( �̅, s) distinct from population parameters ( �, α).

Choosing the significance level for a test

• Remember: the smaller the significance level p ( sa y 0.01 rather than 0.05), the more stringent the test .

• Choose the level based on:– Sample size– Estimated size of the effect being tested– Consequences of making a mistake

• Common Significance levels:– .05 (1 chance in 20); – .01 (1 chance in a hundred) or – .001 (1 chance in a thousand )

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001688

Choosing the significance level for a test

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

• Remember: the smaller the significance level p ( say 0.01 rather than 0.05), the more stringent the test.

• Choose the level based on:– Sample size– Estimated size of the effect being tested– Consequences of making a mistake

• Common Significance levels:– .05 (1 chance in 20); – .01 (1 chance in a hundred) or – .001 (1 chance in a thousand)

Common Mistakes

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

Lets take a few simple examples of a calculation

Remember: the normal (Gaussian) distribution, the Bell Curve…

It has a mean, and a standard deviation.

The standard deviation defines how “spread out” the distribution is:

Remember:The sample statistic (measured)

is only an estimate for the Population parameter (inferred)

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

Common Statistical Notation

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

Numerical Measures (Formulae)

Mean: �� = ∑ � =� � ⋯ ��

Variance: s2 = ∑ � �

��� = � ∑ �� ∑ �

� ���Standard Error of the Mean: � = �

�Median : the middle value of ordered valuesNth percentile : the value such that N% of ordered values lie below it

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001696

Assume that we have the mean of a distribution. We need to find the

standard deviation (or its square: the variance)

The Variance is the square of the Standard Deviation

Calculating the Variance and the standard deviation

• The formula for calculating the variance:

�� = ∑ − � �

� − �

• The Standard deviationis given by:

� = ��

699

Example: calculating Variance and Standard Deviation

For example, using these six measures 3,9,1,2,5 and 4:

∑� = 3 + 9 + 1 + 2 + 5 + 4 = 24

∑�� = 3� + 9� + 1� + 2� + 5� + 4�

= 9 + 81 + 1 + 4 + 25 + 16 = 136

The quantities are the substituted into the shortcut formulate to find ∑ − � �.

∑ � − �̅ � = ∑�� −∑� �

= 136 −24 �

6Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001

700

Example: calculating Variance and Standard Deviation

= �!" − #$"" = %&The variance and standard deviation are now found as before:

�� = ∑ − � �

� − �=%&

#= '

� = �� = ' = �. '�'

Source: Statistics, Cliffs Quick Review, Wiley, NY, 2001701

We will say more about the standard deviation and the

variance in a moment

Understanding What Is Behind A Formula

Clear thinking about statistics: understanding what is behind the

formula

.• I want you to understand the logic behind a formula. You do not need to memorize any formula. You do that by asking questions….

• For example, let’s look at the formula for computing the sample variance:

• Let’s ask why this? and why that?

)� = �* − �+ , − � �

*

,-�

705

Why do we square the deviations from the mean?

.� = 1/ − 1+ �0 − �̅ �

1

0-2

706

Why do we square the deviations from the mean?

• Because, if we add up all deviations, we get always zero value.

• So, to deal with this problem, we square the deviations.

• Bonus: Notice that squaring also magnifies the deviations; therefore it helps us better feel the spread of the data.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

707

Why do we square the deviations from the mean?

• Because, if we add up all deviations, we get always zero value.

• So, to deal with this problem, we square the deviations.

• Bonus: Notice that squaring also magnifies the deviations; therefore it helps us better feel the spread of the data.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

708

Why do we square the deviations from the mean?

• Because, if we add up all deviations, we get always zero value.

• So, to deal with this problem, we square the deviations.

• Bonus: Notice that squaring also magnifies the deviations; therefore it helps us better feel the spread of the data.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

709

Why not raise to the power of four (three will not work)?

.� = 1/ − 1+ �0 − �̅ �

1

0-2

710

Why not raise to the power of four (three will not work)?

• Squaring does the trick; why should we make life more complicated than it is?

.� = 1/ − 1+ �0 − �̅ �

1

0-2

711

Why is there a summation notation in the formula?

.� = 1/ − 1+ �0 − �̅ �

1

0-2

712

Why is there a summation notation in the formula?

• To add up the squared deviation of each data point to compute the total sum of squared deviations.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

713

Why do we divide the sum of squares by n -1.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

714

Why do we divide the sum of squares by n -1.

• The amount of deviation should reflect also how large the sample is; so we must bring in the sample size.

• Why? Because, in general, larger sample sizes have larger sum of square deviation from the mean.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

715

Why do we divide the sum of squares by n -1.

• The amount of deviation should reflect also how large the sample is; so we must bring in the sample size.

• Why? Because, in general, larger sample sizes have larger sum of square deviation from the mean.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

716

Why divide by n-1 not n?

.� = 1/ − 1+ �0 − �̅ �

1

0-2

717

Why divide by n-1 not n?

• When you divide by n -1, the sample's variance provides an estimated variance much closer to the population variance, than when you divide by n.

• But for larger samples, (say over 30), it really does not matter whether it is divided by n or n-1. The results are almost the same, and they are acceptable.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

718

Why divide by n-1 not n?

• When you divide by n -1, the sample's variance provides an estimated variance much closer to the population variance, than when you divide by n.

• But for larger samples, (say over 30), it really does not matter whether it is divided by n or n-1. The results are almost the same, and they are acceptable.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

719

Does N-1 have a Meaning?

.� = 1/ − 1+ �0 − �̅ �

1

0-2

720

Does N-1 have a Meaning?

• The factor n -1 is what we consider as the "degrees of freedom" (but that is another discussion).

• Degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

721

Does N-1 have a Meaning?

• The factor n -1 is what we consider as the "degrees of freedom" (but that is another discussion).

• Degrees of freedom is the number of values in the final calculation of a statistic that are free to vary.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

722

Explain number of values that are allowed to vary

.� = 1/ − 1+ �0 − �̅ �

1

0-2

723

Explain number of values that are allowed to vary

• For example, if we have two observations, when calculating the mean we have two independent observations;

• however, when calculating the variance, we have only one independent observation, since the two observations are equally distant from the mean.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

724

Explain number of values that are allowed to vary

• For example, if we have two observations, when calculating the mean we have two independent observations;

• however, when calculating the variance, we have only one independent observation, since the two observations are equally distant from the mean.

.� = 1/ − 1+ �0 − �̅ �

1

0-2

725

Degrees of Freedom• The number of independent pieces of

information that go into the estimate of a parameter is called the degrees of freedom (df).

• So for calculating the mean of the sample, we have all the observations in the sample size (n).

• But to calculate the distance from the mean, you have one less. Why?

• If you have two observations, they will be both at the same distance from the mean .

This example shows how to question statistical formulas.

To help you understand them rather than memorizing them.

Then you can use the concepts better.

Clear thinking is always more important than the ability to

calculate something.

Clear Thinking

Social surveys

• Framing the Issues• Identifying the target population• Sample Frame and Sample design• Instrument design• Gathering data• Analyzing data• Interpreting Results

That is done within the framework of a research design

Applications

• Market research • Opinion poll• Voting expectations• Educational or Health studies• Sociological studies• Medical clinical studies

And so much more…

Examples of US/UK Major surveys

• National Election Studies • Gallup poll • General Social Survey • International Social Survey • United Kingdom Census • United States Census • National Health and Nutrition

Examination Survey • World Values Survey

Again:Clear thinking is always more

important than the ability to calculate something.

So, One More Time…

With Clear thinking you will not be a turkey…

You will learn to fly…

Some will even soar like an eagle

Thank You