notes on MR
-
Upload
pravinsurya -
Category
Documents
-
view
196 -
download
6
Transcript of notes on MR
IBMRD, Ahmednagar(303-A) Marketing Research.
Important Notes for Revision of Statistics:
Topic 1: Mean, Mode, Median, and Standard Deviation
The Mean and Mode The sample mean is the average and is computed as the sum of all the observed outcomes from the sample divided by the total number of events. We use x as the symbol for the sample mean. In math terms,
where n is the sample size and the x correspond to the observed valued.
Example
Suppose you randomly sampled six acres in the Desolation Wilderness for a non-indigenous weed and came up with the following counts of this weed in this region:
34, 43, 81, 106, 106 and 115
We compute the sample mean by adding and dividing by the number of samples, 6.
34 + 43 + 81 + 106 + 106 + 115 = 80.83 6
We can say that the sample mean of non-indigenous weed is 80.83.
The mode of a set of data is the number with the highest frequency. In the above example 106 is the mode, since it occurs twice and the rest of the outcomes occur only once.
The population mean is the average of the entire population and is usually impossible to compute. We use the Greek letter for the population mean.
Median, and Trimmed Mean One problem with using the mean, is that it often does not depict the typical outcome. If there is one outcome that is very far from the rest of the data, then the mean will be strongly affected by this outcome. Such an outcome is called and outlier. An alternative measure is the median. The median is the middle score. If we have an even number of events we take the average of the two middles. The median is better for describing the typical value. It is often used for income and home prices.
Example
Suppose you randomly selected 10 house prices in the South Lake Tahoe area. Your are interested in the typical house price. In $100,000 the prices were
2.7, 2.9, 3.1, 3.4, 3.7, 4.1, 4.3, 4.7, 4.7, 40.8
If we computed the mean, we would say that the average house price is 710,000. Although this number is true, it does not reflect the price for available housing in South Lake Tahoe. A closer look at the data shows that the
1
house valued at 40.8 x $100,000 = $4.08 million skews the data. Instead, we use the median. Since there is an even number of outcomes, we take the average of the middle two
3.7 + 4.1 = 3.9 2
The median house price is $390,000. This better reflects what house shoppers should expect to spend.
There is an alternative value that also is resistant to outliers. This is called the trimmed mean which is the mean after getting rid of the outliers or 5% on the top and 5% on the bottom. We can also use the trimmed mean if we are concerned with outliers skewing the data, however the median is used more often since more people understand it.
Example: At a ski rental shop data was collected on the number of rentals on each of ten consecutive Saturdays:
44, 50, 38, 96, 42, 47, 40, 39, 46, 50.
To find the sample mean, add them and divide by 10:
44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50 = 49.2 10
Notice that the mean value is not a value of the sample.
To find the median, first sort the data:
38, 39, 40, 42, 44, 46, 47, 50, 50, 96
Notice that there are two middle numbers 44 and 46. To find the median we take the average of the two.
44 + 46 Median = = 45 2
Notice also that the mean is larger than all but three of the data points. The mean is influenced by outliers while the median is robust.
Variance, Standard Deviation and Coefficient of Variation The mean, mode, median, and trimmed mean do a nice job in telling where the center of the data set is, but often we are interested in more. For example, a pharmaceutical engineer develops a new drug that regulates iron in the blood. Suppose she finds out that the average sugar content after taking the medication is the optimal level. This does not mean that the drug is effective. There is a possibility that half of the patients have dangerously low sugar content while the other half have dangerously high content. Instead of the drug being an effective regulator, it is a deadly poison. What the pharmacist needs is a measure of how far the data is spread apart. This is what the variance and standard deviation do. First we show the formulas for these measurements. Then we will go through the steps on how to use the formulas. We define the variance to be
and the standard deviation to be
2
Variance and Standard Deviation: Step by Step 1. Calculate the mean, x. 2. Write a table that subtracts the mean from each observed value.3. Square each of the differences.4. Add this column.5. Divide by n -1 where n is the number of items in the sample This is the variance.6. To get the standard deviation we take the square root of the variance.
Example The owner of the Ches Tahoe restaurant is interested in how much people spend at the restaurant. He examines 10 randomly selected receipts for parties of four and writes down the following data.
44, 50, 38, 96, 42, 47, 40, 39, 46, 50
He calculated the mean by adding and dividing by 10 to get
x = 49.2 Below is the table for getting the standard deviation:
Now
2600.4 = 288.7 10 - 1
Hence the variance is 289 and the standard deviation is the square root of 289 = 17.
What this means is that most of the patrons probably spend between $32.20 and $66.20.
The sample standard deviation will be denoted by s and the population standard deviation will be denoted by the Greek letter .
The sample variance will be denoted by s2 and the population variance will be denoted by 2.
The variance and standard deviation describe how spread out the data is. If the data all lies close to the mean, then the standard deviation will be small, while if the data is spread out over a large range of values, s will be large. Having outliers will increase the standard deviation.
One of the flaws involved with the standard deviation, is that it depends on the units that are used. One way of handling this difficulty, is called the coefficient of variation which is the standard deviation divided by the mean times 100%
x x - 49.2 (x - 49.2 )2
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44
96 46.8 2190.24
42 -7.2 51.84
47 -2.2 4.84
40 -9.2 84.64
39 -10.2 104.04
46 -3.2 10.24
50 0.8 0.64
Total 2600.4
3
CV = 100%
In the above example, it is
17 100% = 34.6% 49.2
This tells us that the standard deviation of the restaurant bills is 34.6% of the mean.
The Standard Normal Distribution
Definition of the Standard Normal Distribution The Standard Normal distribution follows a normal distribution and has mean 0 and standard deviation 1
Notice that the distribution is perfectly symmetric about 0.
If a distribution is normal but not standard, we can convert a value to the Standard normal distribution table by first by finding how many standard deviations away the number is from the mean.
The z-score The number of standard deviations from the mean is called the z-score and can be found by the formula
x - z =
Example
Find the z-score corresponding to a raw score of 132 from a normal distribution with mean 100 and standard deviation 15.
Solution
We compute
132 - z = = 2.133 15
Example A z-score of 1.7 was found from an observation coming from a normal distribution with mean 14 and standard deviation 3. Find the raw score.
Solution We have
4
x - 1.7 = 3
To solve this we just multiply both sides by the denominator 3,
(1.7)(3) = x - 14 5.1 = x - 14 x = 19.1
Topic 2: Sampling Methods
Sampling and Statistics
Statistics We start the discussion in the natural way. We all have a general feeling about what statistics is. In the course of these lecture notes, we will lay out the detail about what statistics is and how it is used. For now we give a quick definition.
Statistics is the study of how to collect, organize, analyzes, and interpret numerical information from data.
Population vs. Sample
We define the population the total set of individuals that we are interested about and a sample a subset of the individuals selected in a prescribed manner of study.
Typically, population data is very hard or even impossible to gather. Statisticians and researchers will instead extract data from a sample. There are several types of data that is of interest.
We can classify data into two types:
1. Numerical or Quantitative data is data where the observations are numbers. For example, age, height, on a scale from one to ten..., distance, number of ,...
2. Categorical or Qualitative data is data where the observations are non-numerical. For example, favorite color, choice of politician, ...
There is a more refined way to classify data. Data can be put into one of several categories called levels of measurement
a. The nominal level is synonymous with qualitative data. b. The ordinal level is data that involves ranking. For example Williams took second place in the US Open.
There is are not actual values assigned to each variable, but we can still compare one with another.c. The interval level is data such that one outcome can be compared with another outcome by taking
differences. For example one outcome may be 12 degrees warmer than another, or an outcome may have occurred 35 minutes later than another.
d. The ratio level is data that both differences and ratios can be taken. For example if the cost of a hamburger is $2 and that of a steak is $12, it make sense to either say that the steak costs $10 more or that the steak is 6 times more expensive.
e. Boolean data is data that can achieve one of two values such as true or false, yes or no, on or off, etc. For example the outcome of a questionnaire asking if you agree with Bush’s policy on the Middle East.
Data is called univariate if it represents one attribute and bivariate if it contains two attributes. Bivariate data is often used to compare and contrast. For example, we may study weight gain and caloric intake.
Numerical data is called discrete if the number of possible values within every bounded range is finite. Examples include: rolling dice, number of times that..., ...
5
Otherwise, numerical data is called continuous. For example, height, weight, temperature, distance,...
Random Samples When we conduct a survey we always attempt to achieve a random sample. A simple random sample of size n is one in which every possible subset of size n has equal chance of being selected. For example, to choose a random sample of 20 people with phone numbers, we can use a random number generator to randomly select 20 phone numbers.
Caution: A simple random sample is almost always impossible to achieve in the real world. For example, using the phone number generator, we will only be able to collect data from those who have a phone, pick up the phone, and are willing to participate in the phone survey. Because of this most surveys have inherent flaws. However, a survey with a small flaw is better then no information.
Many surveys are done using convenience sampling. For example a researcher stands outside a supermarket and interviews anyone eager to respond.
One way to overcome the problem of obtaining a random sample is to use stratified sampling. Stratified sampling ensures that members of each strata (or type) are included in the survey. For example we may randomly select 50 Caucasians, 25 Hispanics, and 10 Philipinos from the Lake Tahoe community to ensure that the main three ethnic groups are represented.
One problem with sampling is that often the researcher only gets respondents who are eager to be interviewed. One way to combat this is to use cluster sampling. This process involves breaking the population into several groups or clusters. Some of the clusters are randomly selected and the researcher makes sure that every individual in the selected clusters are surveyed. This usually involves paying for the respondents to take the survey.
Experimental Design On of the most fatal mistakes a researcher can make is to have faulty experimental design, that is poor planning. Deep thought needs to go into the design of the experiment before any field work can take place. Below are some guidelines for planning a statistical study.
Identify the population. Decide what your variables are and how you are going to take measurements. This can involve lawyers
and regulatory agencies. Determine the sample size. Collect the data. Organize the data and use either descriptive or inferential statistical methods to interpret and report on the
data Publish, noting how you need more funding to do a more extensive survey.
The coin toss is a type of experimentation. Experimentation is a type of data collection where the researcher creates the data. Usually experimentation answers the question, “If we do this what happens.” The response variable is the variable being studied by the experimenter. Often the experimenter sets the environment to run the experiment. For example, a psychologist may want to determine mood based on weather conditions. She may study several peoples’ moods under various weather conditions. These conditions are called experimental conditions or factors. Sometimes there is a factor that cannot be distinguished from another factor. For example red wine drinking has been correlated with a low risk of heart disease. But if people with a low stress level tend to drink red wine, then the two factors are confounded. Low stress is said to be an extraneous factor. One way to handle this is called blocking which means to create groups that are similar in every way except what you are trying to experiment. Another way to handle this is called control which means to keep all extraneous factors constant.
6
Topic 3: Hypothesis Testing
Whenever we have a decision to make about a population characteristic, we make a hypothesis. Some examples are:
> 3 or
5.
Suppose that we want to test the hypothesis that 5. Then we can think of our opponent suggesting that = 5. We call the opponent’s hypothesis the null hypothesis and write:
H0: = 5 and our hypothesis the alternative hypothesis and write
H1: 5 For the null hypothesis we always use equality, since we are comparing with a previously determined mean.
For the alternative hypothesis, we have the choices: < , > , or .
Procedures in Hypothesis TestingWhen we test a hypothesis we proceed as follows:
1. Formulate the null and alternative hypothesis.2. Choose a level of significance.3. Determine the sample size. (Same as confidence intervals)4. Collect data.5. Calculate z (or t) score.6. Utilize the table to determine if the z score falls within the acceptance region.7. Decide to
a. Reject the null hypothesis and therefore accept the alternative hypothesis or b. Fail to reject the null hypothesis and therefore state that there is not enough evidence to suggest the
truth of the alternative hypothesis.
Errors in Hypothesis Tests We define a type I error as the event of rejecting the null hypothesis when the null hypothesis was true. The probability of a type I error () is called the significance level.
We define a type II error (with probability ) as the event of failing to reject the null hypothesis when the null hypothesis was false.
Type I Error
7
In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is wrongly rejected. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; that is H0: there is no difference between the two drugs on average. A type I error would occur if we concluded that the two drugs produced different effects when in fact there was no difference between them.
The following table gives a summary of possible results of any hypothesis test:
Decision
Reject H0 Don't reject H0
TruthH0 Type I Error Right Decision
H1 Right Decision Type II Error
A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error. The hypothesis test procedure is therefore adjusted so that there is a guaranteed 'low' probability of rejecting the null hypothesis wrongly; this probability is never 0. This probability of a type I error can be precisely computed as,
P (type I error) = significance level =
The exact probability of a type II error is generally unknown.
If we do not reject the null hypothesis, it may still be false (a type II error) as the sample may not be big enough to identify the falseness of the null hypothesis (especially if the truth is very close to hypothesis).
For any given set of data, type I and type II errors are inversely related; the smaller the risk of one, the higher the risk of the other.
A type I error can also be referred to as an error of the first kind.
Type II Error In a hypothesis test, a type II error occurs when the null hypothesis H0, is not rejected when it is in fact false. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; that is H0: there is no difference between the two drugs on average.A type II error would occur if it was concluded that the two drugs produced the same effect, that is, there is no difference between the two drugs on average, when in fact they produced different ones. A type II error is frequently due to sample sizes being too small.
The probability of a type II error is symbolised by and written:
P(type II error) = (but is generally unknown). A type II error can also be referred to as an error of the second kind. Example Suppose that you are a lawyer that is trying to establish that a company has been unfair to minorities with regard to salary increases. Suppose the mean salary increase per year is 8%. You set the null hypothesis to be H0: = .08 & alternative hypothesis H1: < .08
Q. What is a type I error? A. We put sanctions on the company, when they were not being discriminatory.
8
Q. What is a type II error? A. We allow the company to go about its discriminatory ways. Note: Larger results in a smaller , and smaller results in a larger .
Hypothesis Testing For a Population Mean
The Idea of Hypothesis Testing Suppose we want to show that only children have an average higher cholesterol level than the national average. It is known that the mean cholesterol level for all Americans is 190. Construct the relevant hypothesis test:
H0: = 190
H1: > 190
We test 100 only children and find that x = 198
and s = 15.
Do we have evidence to suggest that only children have an average higher cholesterol level than the national average? We have
z is called the test statistic.
Since z is so high, the probability that Ho is true is so small that we decide to reject H0 and accept H1. Therefore, we can conclude that only children have a higher cholesterol level on the average then the national average.
Rejection Regions Suppose that = .05. We can draw the appropriate picture and find the z score for -.025 and .025. We call the outside regions the rejection regions.
We call the blue areas the rejection region since if the value of z falls in these regions, we can say that the null hypothesis is very unlikely so we can reject the null hypothesis
Example 50 smokers were questioned about the number of hours they sleep each day. We want to test the hypothesis that the smokers need less sleep than the general public which needs an average of 7.7 hours of sleep. We follow the steps below.
A. Compute a rejection region for a significance level of .05.
9
B. If the sample mean is 7.5 and the standard deviation is .5, what can you conclude?
Solution First, we write write down the null and alternative hypotheses H0: = 7.7 H1: < 7.7 This is a left tailed test. The z-score that corresponds to .05 is -1.96. The critical region is the area that lies to the left of -1.96. If the z-value is less than -1.96 there we will reject the null hypothesis and accept the alternative hypothesis. If it is greater than -1.96, we will fail to reject the null hypothesis and say that the test was not statistically significant.
We have
Since -2.83 is to the left of -1.96, it is in the critical region. Hence we reject the null hypothesis and accept the alternative hypothesis. We can conclude that smokers need less sleep.
P-values (Probability values)
There is another way to interpret the test statistic. In hypothesis testing, we make a yes or no decision without discussing borderline cases. For example with = .06, a two tailed test will indicate rejection of H0 for a test statistic of z = 2 or for z = 6, but z = 6 is much stronger evidence than z = 2. To show this difference we write the p-value which is the lowest significance level such that we will still reject Ho. For a two tailed test, we use twice the table value to find p, and for a one tailed test, we use the table value.
Example: Suppose that we want to test the hypothesis with a significance level of .05 that the climate has changed since industrialization. Suppose that the mean temperature throughout history is 50 degrees. During the last 40 years, the mean temperature has been 51 degrees with a standard deviation of 2 degrees. What can we conclude?
We have
H0: = 50
H1: 50 We compute the z score:
The table gives us .9992 so that
p = (1 - .9992)(2) = .002
since
.002 < .05
we can conclude that there has been a change in temperature.
Note that small p-values will result in a rejection of H0 and large p-values will result in failing to reject H0.
Small Sample Hypothesis Tests For a Normal population
10
When we have a small sample from a normal population, we use the same method as a large sample except we use the t statistic instead of the z-statistic. Hence, we need to find the degrees of freedom (n - 1) and use the t-table in the back of the book.
Example Is the temperature required to damage a computer on the average less than 110 degrees? Because of the price of testing, twenty computers were tested to see what minimum temperature will damage the computer. The damaging temperature averaged 109 degrees with a standard deviation of 3 degrees. (use
We test the hypothesis H0: = 110 H1: < 110 We compute the t statistic:
This is a one tailed test, so we can go to our t-table with 19 degrees of freedom to find that
tc = 1.73
Since 1.49 > -1.73
We see that the test statistic does not fall in the critical region. We fail to reject the null hypothesis and conclude that there is insufficient evidence to suggest that the temperature required to damage a computer on the average less than 110 degrees.
Hypothesis Testing for a Population Proportion We have seen how to conduct hypothesis tests for a mean. We now turn to proportions. The process is completely analogous, although we will need to use the standard deviation formula for a proportion.
Example Suppose that you interview 1000 exiting voters about who they voted for governor. Of the 1000 voters, 550 reported that they voted for the democratic candidate. Is there sufficient evidence to suggest that the democratic candidate will win the election at the .01 level?
H0: p =.5 H1: p>.5
Since it a large sample we can use the central limit theorem to say that the distribution of proportions is approximately normal. We compute the test statistic:
11
Notice that in this formula, we have used the hypothesized proportion rather than the sample proportion. This is because if the null hypothesis is correct, then .5 is the true proportion and we are not making any approximations. We compute the rejection region using the z-table. We find that zc = 2.33.
The picture shows us that 3.16 is in the rejection region. Therefore we reject H0 so can conclude that the democratic candidate will win with a p-value of .0008.
Example 1500 randomly selected pine trees were tested for traces of the Bark Beetle infestation. It was found that 153 of the trees showed such traces. Test the hypothesis that more than 10% of the Tahoe trees have been infested. (Use a 5% level of significance)
Solution
The hypothesis is
H0: p = .1
H1: p > .1
We have that
Next we compute the z-score
Since we are using a 95% level of significance with a one tailed test, we have zc = 1.645. The rejection region is shown in the picture. We see that 0.26 does not lie in the rejection region, hence we fail to reject the null hypothesis. We say that there is insufficient evidence to make a conclusion about the percentage of infested pines being greater than 10%.
Exercises A. If 40% of the nation is registered republican. Does the Tahoe environment reflect the national proportion?
Test the hypothesis that Tahoe residents differ from the rest of the nation in their affiliation, if of 200 locals surveyed, 75 are registered republican.
B. If 10% of California residents are vegetarians, test the hypothesis that people who gamble are less likely to be vegetarians. If the 120 people polled, 10 claimed to be a vegetarian.
Hypothesis Testing of the Difference Between Two Means Do employees perform better at work with music playing. The music was turned on during the working hours of a business with 45 employees. There productivity level averaged 5.2 with a standard deviation of 2.4. On a different day the music was turned off and there were 40 workers. The workers’ productivity level averaged 4.8 with a standard deviation of 1.2. What can we conclude at the .05 level?
12
Solution We first develop the hypotheses
H0: 1 - 2 = 0 H1: 1 - 2 > 0
Next we need to find the standard deviation. Recall from before, we had that the mean of the difference is
x = 1 - 2
and the standard deviation is
x =
We can substitute the sample means and sample standard deviations for a point estimate of the population means and standard deviations. We have
and
Now we can calculate the z-score. We have
0.4 z = = 0.988 0.405
Since this is a one tailed test, the critical value is 1.645 and 0.988 does not lie in the critical region. We fail to reject the null hypothesis and conclude that there is insufficient evidence to conclude that workers perform better at work when the music is on.
Hypothesis Testing For a Difference Between Means for Small Samples.Recall that for small samples we need to make the following assumptions:
1. Random unbiased sample.2. Both population distributions are normal.3. The two standard deviations are equal.
If we know , then the sampling standard deviation is:
If we do not know then we use the pooled standard deviation.
13
Putting this together with hypothesis testing we can find the t-statistic.
and use n1 + n2 - 2 degrees of freedom.
Example:
Nine dogs and ten cats were tested to determine if there is a difference in the average number of days that the animal can survive without food. The dogs averaged 11 days with a standard deviation of 2 days while the cats averaged 12 days with a standard deviation of 3 days. What can be concluded? (Use = .05)
Solution We write:
H0: dog - cat = 0 H1: dog - cat 0 We have:
n1 = 9, n2 = 10 x1 = 11, x2 = 12 s1 = 2, s2 = 3 so that
and
The t-critical value corresponding to a = .05 with 10 + 9 - 2 = 17 degrees of freedom is 2.11 which is greater than .84. Hence we fail to reject the null hypothesis and conclude that there is not sufficient evidence to suggest that there is a difference between the mean starvation time for cats and dogs.
14
Hypothesis Testing for a Difference between Proportions
Inferences on the Difference between Population ProportionsIf two samples are counted independently of each other we use the test statistic:
ExampleIs the severity of the drug problem in high school the same for boys and girls? 85 boys and 70 girls were questioned and 34 of the boys and 14 of the girls admitted to having tried some sort of drug. What can be concluded at the .05 level?
Solution
The hypotheses are
H0: p1 - p2 = 0 H1: p1 - p2 0
We have
p1 = 34/85 = 0.4 p2 = 14/70 = 0.2 p = 48/155 = 0.31 q = 0.69
Now compute the z-score
Since we are using a significance level of .05 and it is a two tailed test, the critical value is 1.96. Clearly 2.68 is in the critical region, hence we can reject the null hypothesis and accept the alternative hypothesis and conclude that gender does make a difference for drug use.
Paired Data: Hypothesis Tests Example
Is success determined by genetics?
The best such survey is one that investigates identical twins who have been reared in two different environments, one that is nurturing and one that is non-nurturing. We could measure the difference in high school GPAs between each pair. This is better than just pooling each group individually. Our hypotheses are
Ho: d = 0 H1: d > 0
where d is the mean of the differences between the matched pairs. We use the test statistic
where r1 + r2 p = n1 + n2
and
q = 1 - p
15
where sd is the standard deviation of the differences. For a small sample we use n - 1 degrees of freedom, where n is the number of pairs.
Difference between Means I surveyed 50 people from a poor area of town and 70 people from an affluent area of town about their feelings towards minorities. I counted the number of negative comments made. I was interested in comparing their attitudes. The average number of negative comments in the poor area was 14 and in the affluent area was 12. The standard deviations were 5 and 4 respectively. Let’s determine a 95% confidence for the difference in mean negative comments. First, we need some formulas.
Theorem The distribution of the difference of means x1 - x2 has mean
1 - 2
and standard deviation
For our investigation, we use s1 and s2 as point estimates for 1 and 2. We have x1 = 14 x2 = 12 s1 = 5 s2 = 4 n1 = 50 n2 = 70
Now calculate
x1 - x2 = 14 - 12 = 2
The margin of error is
E = zcs = (1.96)(0.85) = 1.7
The confidence interval is 2 1.7
or [0.3, 3.7]
We can conclude that the mean difference between the number of racial slurs that poor and wealthy people make is between 0.3 and 3.7.
Small Samples When either sample size is small, we can still run the statistics provided the distributions are approximately normal. If in addition we know that the two standard deviations are approximately equal, then we can pool the data together to produce a pooled standard deviation. We have the following theorem.
Pooled Estimate of
with n1 + n2 - 2 degrees of freedom
16
You’ve gotta love the beautiful formula!
Note
After finding the pooled estimate we have that a confidence interval is given by
Example What is the difference between commuting patterns for students and professors. 11 students and 14 professors took part in a study to find mean commuting distances. The mean number of miles traveled by students was 5.6 and the standard deviation was 2.8. The mean number of miles traveled by professors was 14.3 and the standard deviation was 9.1. Construct a 95% confidence interval for the difference between the means. What assumption have we made?
Solution
We have
x1 = 5.6 x2 = 14.3 s1 = 2.8 s2 = 9.1 n1 = 11 n2 = 14
The pooled standard deviation is
The point estimate for the mean is
14.3 - 5.6 = 8.7
and
Use the t-table to find tc for a 95% confidence interval with 23 degrees of freedom and find
tc = 2.07 8.7 (2.07)(7.09)(.403) = 8.7 5.9
The range of values is [2.8, 14.6]
The difference in average miles driven by students and professors is between 2.8 and 14.6. We have assumed that the standard deviations are approximately equal and the two distributions are approximately normal.
Difference between Proportions
So far, we have discussed the difference between two means (both large and small samples). Our next task is to estimate the difference between two proportions. We have the following theoremAnd a confidence interval for the difference of proportions is
Confidence Interval for the difference of Proportions
Note: in order for this to be valid, we need all four of the quantitiesp1n1 p2n2 q1n1 q2n2 to be greater than 5.
17
Example300 men and 400 women we asked how they felt about taxing Internet sales. 75 of the men and 90 of the women agreed with having a tax. Find a confidence interval for the difference in proportions.
SolutionWe have
p1 = 75/300 = .25 q1 = .75 n1 = 300 p2 = 90/400 = .225 q2 = .775 n2 = 400We can calculate
We can conclude that the difference in opinions is between -8.5% and 3.5%.
The Central Limit Theorem
A Review of Terminology
We begin our journey into inferential statistics. Most of the time the population mean and population standard deviation are impossible or too expensive to determine exactly. Two of the major tasks of a statistician is to get an approximation to the mean and analyze how accurate the approximation is. The most common way of accomplishing this task is by using sampling techniques. Out of the entire population the researcher obtains a (hopefully random) sample from the population and uses the sample to make inferences about the population. From the sample the statistician computes several numbers such as the sample size, the sample mean, and the sample standard deviation. The numbers that are computed from the sample are called statistics.
Example
How many cups of coffee do you drink each week?
If we asked this question to two different five person groups, we will probably get two different sample means and two different sample standard deviations. Choosing different samples from the same population will produce different statistics.
The distribution of all possible samples is called the sampling distribution.
The Central Limit Theorem
Let x denote the mean of a random sample of size n from a population having mean and standard deviation . Let
x = mean value of x and x = the standard deviation of x then
A. x =
18
B.
C. When the population distribution is normal so is the distribution of x for any n.D. For large n, the distribution of x is approximately normal regardless of the
population distribution
Rule of thumb: n > 30 is large
Example: Suppose that we play a slot machine such you can either double your bet or lose your bet. If there is a 45% chance of winning then the expected value for a dollar wager is
1(.45) + (-1)(.55) = -.1
We can compute the standard deviation:
x p(x) (x - )2 p(x)(x - )2
1 .45 1.21 .545
-1 .55 .81 .446
Total .991
If we throw 100 silver dollars into the slot machine then we expect to average a loss of ten cents with a standard deviation of
Notice that the standard deviation is very small. This is why the casinos are assured to make money. Now let us find the probability that the gambler does not lose any money, that is the mean is greater than or equal to 0.
We first compute the z-score. We have
0 - (-.1) z = = 1.01 .0991
Now we go to the table to find the associated probability. We get .8438. Since we want the area to the right, we subtract from 1 to get
P(z > 1.01) = 1 - P(z < 1.01) = 1 - .8438 = .1562
There is about a 16% chance that the gambler will not lose.
19
Sampling Distributions for Proportions
The last example was a special case of proportions that is Boolean data. For now on, we can use the following theorem.
The Central Limit Theorem for Proportions
Let p be the probability of success, q be the probability of failure. The sampling distribution for samples of size n is approximately normal with mean
Example
The new Endeavor SUV has been recalled because 5% of the cars experience brake failure. The Tahoe dealership has sold 200 of these cars. What is the probability that fewer than 4% of the cars from Tahoe experience brake failure?
SolutionWe have
p = .05 q = .95 n = 200We have
mp = p = .05 sp = = .0154
Next we want to find
P(x < 8)
Using the continuity correction, we find instead
P(x < 7.5)
This is equivalent to
P(p < 7.5/200) = P(p < .0375)
We find the z-score
.0375 - .05 z = = -.81 .0154
The table gives a probability of .2090. We can conclude that there is about a 21% chance that fewer than 4% of the cars will experience brake failure.
Significance levelThis is among the more confusing terms. “Does a 5 percent significance level mean there is only a 5% chance that my results are significant?” The significance level is actually the alpha, or Type I risk. If the null hypothesis is true, there is a 5 percent chance of rejecting it because of random variation (luck).
20
Point EstimationsUsually, we do not know the population mean and standard deviation. Our goal is to estimate these numbers. The standard way to accomplish this is to use the sample mean and standard deviation as a best guess for the true population mean and standard deviation. We call this “best guess” a point estimate.
A Point Estimate is a statistic that gives a plausible estimate for the value in question.
Example: x is a point estimate for s is a point estimate for
A point estimate is unbiased if its mean represents the value that it is estimating.
Confidence IntervalsWe are not only interested in finding the point estimate for the mean, but also determining how accurate the point estimate is. The Central Limit Theorem plays a key role here. We assume that the sample standard deviation is close to the population standard deviation (which will almost always be true for large samples). Then the Central Limit Theorem tells us that the standard deviation of the sampling distribution is
We will be interested in finding an interval around x such that there is a large probability that the actual mean falls inside of this interval. This interval is called a confidence interval and the large probability is called the confidence level.
Example Suppose that we check for clarity in 50 locations in Lake Tahoe and discover that the average depth of clarity of the lake is 14 feet with a standard deviation of 2 feet. What can we conclude about the average clarity of the lake with a 95% confidence level?
Solution We can use x to provide a point estimate for and s to provide a point estimate for . How accurate is x as a point estimate? We construct a 95% confidence interval for as follows. We draw the picture and realize that we need to use the table to find the z-score associated to the probability of .025 (there is .025 to the left and .025 to the right).
We arrive at z = -1.96. Now we solve for x:
21
x - 14 x - 14 -1.96 = =
2/ 0.28
Hence
x - 14 = -0.55
We say that 0.55 is the margin of error. We have that a 95% confidence interval for the mean clarity is
(13.45,14.55)
In other words there is a 95% chance that the mean clarity is between 13.45 and 14.55.
In general if zc is the z value associated with c% then a c% confidence interval for the mean is
Confidence Interval for a Small Sample When the population is normal the sampling distribution will also be normal, but the use of s to replace is not that accurate. The smaller the sample size the worse the approximation will be. Hence we can expect that some adjustment will be made based on the sample size. The adjustment we make is that we do not use the normal curve for this approximation. Instead, we use the Student t distribution that is based on the sample size. We proceed as before, but we change the table that we use. This distribution looks like the normal distribution, but as the sample size decreases it spreads out. For large n it nearly matches the normal curve. We say that the distribution has n - 1 degrees of freedom.
Trees and Counting
Using Trees
We have seen that probability is defined by
Number in E P(E) = Number in the Sample Space
Although this formula appears simple, counting the number in each can prove to be a challenge. Visual aids will help us immensely.
Example
22
A native flowering plant has several varieties. The color of the flower can be red, yellow, or white. The stems can be long or short and the leaves can be thorny, smooth, or velvety. Show all varieties.
SolutionWe use a tree diagram. A tree diagram is a diagram that branches out and ends in leaves that correspond to the final variety. The picture below shows this.
To read this tree diagram, we begin from start. then move along the branches collecting words until we get to the end. For example,
Always taking the upper path leads to the selection of a red long thorny plant.
Always taking the lower path leads to a blue short velvety plant. We can count the total number of leaves (path endings) and get that there are 18 possible varieties.
Counting the leaves that came from long stems tell us that there are 9 possible long stemmed varieties.
ExampleA committee of three republican senators and four democratic senators is selected to investigate corporate securities fraud. Out of this committee two members are to be selected at random for a subcommittee on the energy sector.
A. What is the probability that both members will be republican?B. What is the probability that both members will be democrat?C. What is the probability of one of each?
Solution
We write a tree diagram
In this tree diagram, D represents democrat and R represents republican. The probabilities are given in the diagram.
To answer part A, we need to find
P(first is R and second is R)
This corresponds to the bottom leaf. As we travel to the bottom leaf, we pick up the two numbers
P(R and R) = (3/7)(1/3) = 1/7
To answer part B, we need to find 23
P(first is D and second is D)
This corresponds to the top leaf. We have
P(D and D) = (4/7)(1/2) = 2/7
To answer the part C we add the two middle leaves
P((D and R) or (R and D)) = (4/7)(1/2) + (3/7)(2/3) = 2/7 + 2/7 = 4/7
Permutations
ExampleSuppose that 40 women try out for the newest play that has an all women cast of seven. You are the director. How many choices do you have?
SolutionThe way to work this problem out is to consider the main role first. You have 40 choices for the main role. For the lead supporting actor there are 39 left to select from. For the next role there are 38 to select from. Now following this pattern and consider that there are seven in the cast gives a total number of choices as
40 . 39 . 38 . 37 . 36 . 35 . 34
We could multiply these all out, however there is an easier way. We can write
33 . 32 . 31 . ... . 2 . 1 40 . 39 . 38 . 37 . 36 . 35 . 34 33 . 32 . 31 . ... . 2 . 1
40! = (40 - 7)!
This expression has a special notation. We write
40! P40,7 = = 93,963,542,400 (40 - 7)!
We can see that there are plenty of choices.In general we write n! Pn,r = (n - r)!
We call this a permutation.
Combinations
ExampleHow many 5 card poker hands are there?
SolutionWe can solve this in a similar way as the prior question. We are selecting 5 cards out of 52 total. Unfortunately this is not quite a permutation since, for example, the hand
24
2H 3H 4H 5H 6H
is the same as the hand
3H 5H 2H 6H 4H
where “H” means hearts. That is the order at which the cards are dealt does not matter. The number of ways of ordering 5 cards is 5! (five choices for the first card, four left for the second, three left for the third, two for the fourth, and one for the fifth). We divide by this number to get our solution
52! = (52 - 5)! 5!
We write this with the notation
C52,5 = 2,598,960In general, we have
n! Cn,k = (n - k)! k!
and call this a combination.
ExampleThe following was taken from the California state lottery web site:
“SuperLottoPlus is your chance to win millions of dollars! The jackpot ranges from $7 million to $50 million or more. The jackpot rolls over and grows whenever there is no winner. All you have to do is pick five numbers from 1 to 47 and one MEGA number from 1 to 27 and match them to the numbers drawn by the Lottery every Wednesday and Saturday.”
What is the probability of winning the lottery?
SolutionThere is only one element in the event space. Your numbers. For the sample space, first they pick 5 numbers from 47. There are
C47,5 = 1,533939ways of doing this.
Next they select a number from 1 to 27. There are 27 ways of doing this. We multiply to get
1,533939 x 27 = 41,416,353
So your chances are worse than one in forty-million.
Probability Distributions
Random Variables
A variable whose value depends upon a chance experiment is called a random variable.
25
Suppose that a person is asked who that person is closest to: their mother or their father. The random variable of this experiment is the boolean variable whose possibilities are {Mother, Father}
A continuous random variable is a variable whose possible outcomes are part of a continuous data set.
Examples
the random variable that represents the height of the next person who walks in the room is a continuous random variable while the random variable that represents the number rolled on a six sided die is not a continuous random variable.
A random variable that is not continuous is called a discreet random variable.
Probability Distributions
Example
Suppose we toss two dice. We will make a table of the probabilities for the sum of the dice. The possibilities are
2,3,4,5,6,7,8,9,10,11,12.
Probability Distribution Table
x23456789101112P(x)1/362/363/364/365/366/365/364/363/362/361/36
Exercise
26
Suppose that you buy a raffle ticket for $5. If 1,000 tickets are sold and there are 10 third place winners of $25, three second place winners of $100 and1 grand prize winner of $2,000, construct a probability distribution table. Do not forget that if you have the $25 ticket, you will have won $20.
Expected Value: (Mean)
Example Insurance
We when we buy insurance in black jack we lose the insurance bet if the dealer does not have black jack and win twice the bet if the dealer does have black jack.
Suppose you have $20 wagered and that you have a king and a 9 and the dealer has an ace. Should you buy insurance for $10?
Solution:
We construct a probability distribution table
x P(x) -10 34/49 20 15/49
(There are 49 cards that haven’t been seen and 15 are 10JKQ and the other 34 are non tens.)
We define the
expected value = xP(x)
We calculate:
-10(34/49) + 20(15/49) = -40/49
Hence the expected value is negative so that we should not buy insurance.
What if I am playing with my wife. My cards are 2 and a 6 and my wife’s are 7 and 4. Should I buy insurance? We have:
x P(x) -10 31/47 20 16/47
27
We calculate:
-10(31/47) + 20(16/47) = 10/47 = 0.21
Hence my expected value is positive so that I should buy insurance.
Standard Deviation
We compute the standard deviation for a probability distribution function the same way that we compute the standard deviation for a sample, except that after squaring x - , we multiply by P(x). Also we do not need to divide by n - 1.
Consider the second insurance example:
x P(x) x - x (x - x )2 -10 31/47 -10.21 10420 16/47 19.79 392
Hence the variance is
104(31/47) + 392(16/47) = 202
and the standard deviation is the square root, that is 14.2.
Combining Distributions
If we have two distributions with independent random variables x and y and if a and b are constants then if
L = a + bx and W = ax + by
then
1. L = a + b
2. L2 = b22
3. L = |b| 4. W = ax + by 5. W
2 = a22 + b2
2
28
6.
Example
Gamblers who played both black jack and craps were studied and it was found that the average amount of black playing per weekend was 7 hours with a standard deviation of 3 hours. The average amount of craps play was 4 hour with a standard deviation of 2 hours.
A. What is the mean and standard deviation for the total amount of gaming?
Solution
Here a and b are 1 and 1. The mean is just
7 + 4 = 11
and the standard deviation is just
B. If each player spends about $100 per hour on black jack and $200 per hour on craps, what will be the mean and standard deviation for the amount of money that the casino wins per person?
Solution
Here a and b are 100 and 200. the mean is
100(7) + 200(4) = 1,500
and the standard deviation is
C. If the players spend $150 on the hotel, find the mean and standard deviation of the total amount of money that the players spend.
Here
L = 150 + x
where x is the result from part B. Hence the mean is
150 + 1500 = 1,650
and the standard deviation is the same as part B since the coefficient is 1.
29
The Binomial Distribution
There is a type of distribution that occurs so frequently that it has a special name.
We call a distribution a binomial distribution if all of the following are true
1. There are a fixed number of trials, n, which are all independent.
2. The outcomes are Boolean, such as True or False, yes or no, success or failure.3. The probability of success is the same for each trial.
For a binomial distribution with n trials with the probability of success p and failure q, we have
P(r successes) = Cn,r pr qn-r
Example
Suppose that each time you take a free throw shot, you have a 25% chance of making it. If you take 15 shots,
A. What is the probability of making exactly 5 of them.
Solution
We have
n = 15 r = 5 p = .25 q = .75
Compute
C15,5 .255 .7510 = 0.165
There is a 16.5 percent chance of making exactly 5 shots.
B. What is the probability of making fewer than 3 shots?
Solution
The possible outcomes that will make this happen are 2 shots, 1 shot, and 0 shots. Since these are mutually exclusive, we can add these probabilities.
C15,2 .252 .7513 + C15,1 .251 .7514 + C15,0 .250 .7515
= .156 + .067 + .013 = 0.236
There is a 24 percent chance of sinking fewer than 3 shots.
30
Area Under the Normal Curve and the Binomial Distribution The General Normal Distribution and Area
Typically the probability distribution does not follow the standard normal distribution, but does follow a general normal distribution. When this is the case, we compute the z-score first to convert it into a standard normal distribution. Then we can use the table.
Example The Tahoe Natural Coffee Shop morning customer load follows a normal distribution with mean 45 and standard deviation 8. Determine the probability that the number of customers tomorrow will be less than 42.
Solution
We first convert the raw score to a z-score. We have
42 - z = = - 0.375 8
Next, we use the table to find the probability. The table gives .3520. (We have rounded the raw score to -0.38).
We can conclude that
P(x < 42) = .352
That is there is about a 35% chance that there will be fewer than 42 customers tomorrow.
Example A study was done to determine the stress levels that students have while taking exams. The stress level was found to be normally distributed with a mean stress level of 8.2 and a standard deviation of 1.34. What is the probability that at your next exam, you will have a stress level between 9 and 10?
Solution
We want
P(9 < x < 10)
We compute the z-scores for each of these
9 - z9 = = 0.60 z10 = = 1.34 1.34 1.34
Now we want 31
P(0.60 < z < 1.34)
This is the “in between” type hence we subtract
P(0.60 < z < 1.34) = P(z < 1.34) - P(z < 0.60)
We use the table to get
P(0.60 < z < 1.34) = .9099 - .7257 = .1842
We conclude that there is about an 18 percent chance that the stress level will be between nine and ten.
Example Suppose that your wife is pregnant and due in 100 days. Suppose that the probability density distribution function for having a child is approximately normal with mean 100 and standard deviation 8. You have a business trip and will return in 85 days and have to go on another business trip in 107 days.
A. What is the probability that the birth will occur before your second trip?B. What is the probability that the birth will occur after you return from your first business trip?C. What is the probability that you will be there for the birth?D. You are able to cancel your second business trip, and your boss tells you that you can return home from
your first trip so that there is a 99% chance that you will make it back for the birth. When must you return home?
Solution: A. We want
P(x < 107)
We compute the z-score:
107 - 100 z = = .88 8
We compute
P(z < .88)
The table on the inside front cover gives us
P(z < .88) = .8106
Hence there is about a 81% chance that the baby will be born before the second business trip.
B. We want
32
P(x > 85)
We compute the z-score:
85 - 100 z = = -1.88 8
We compute
P(z > -1.88)
The table on the inside front cover gives us
P(z < -1.88) = .0301
We want the complement of this area hence
P(z > -1.88) = 1 - .0301 = .9699
Hence there is about a 97% chance that the baby will be born after the first business trip.
C. Now we want P( 85 < x < 107)
We see form the picture that this is the middle region. We have
P(85 < x < 107) = P(x < 107) - P(x < 85)
We have already computed these. We have
P(85 < x < 107) = P(x < 107) - P(x < 85) = 81% - 3% = 78%
There is about a 78% chance that you will make it to the birth.
D. This problem asks us to work out the math backwards. We are given the probability and we want the raw score. First, we realize that we if there is a 99% chance that we will make it on time, then there is a 1% chance that we will not. Next, we use the table in reverse. That is, we seek a z-score that gives .01 as the probability.
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
-2.4 .0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064
-2.3 .0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084
-2.2 .0139 .0136 .0132 .0129 .0125 .0122 .0119 .0116 .0113 .0110
We search for the probability value that is closest to .01 and find .0102 and .0099. Since .0099 is the closest to .01, we use this value. The corresponding z-score is -2.33. Now we find the x that produces this z. We have
33
x - 100 -2.33 = 8
Multiply both sides by 8 to get
18.64 = x - 100
Add 100 to both sides to get
x = 81.36
We must return from our business trip in 81 days.
Using the Normal Distribution to Approximate the Binomial Distribution The Binomial Distribution is easy to calculate as long as we only need a few values. However, if we need many values, the computation can be extremely tedious.
Example Suppose you throw a die 1000 times. What is the probability of having it roll a 6 fewer than 160 times?
Solution The horrible way of figuring this out is to calculate C1000,r (1/6)r (5/6)1000 - r for every r between 0 and 159. We have better things to do with our time than do this. Instead we will approximate the answer. The graph of the distribution is shown on the right. It was created with this program. As you may have already guessed, this distribution is very close to being normal. We give the following theorem
Theorem: Normal Approximation to the Binomial Distribution
If a binomial distribution with probability of success p and failure q and n trials is such that
1. np > 5
2. nq > 5
Then the distribution can be approximated by a normal distribution with mean
= np
and standard deviation
34
Now we can continue with our example. We have
np = (1000)(1/6) = 166.67 > 5 and nq = (1000)(5/6) = 833 > 5
Thus we can use the normal distribution. We have
= np = (1000)(1/6) = 166.67
and npq = (1000)(1/6)(5/6) = 138.89
Taking a square root gives
= 11.79
Now we can compute the z-score, since we want P(x < 160). We have
160 - 166.67 z = = -0.57 11.79
Now we use the table to find the probability. We get .2843. Thus there is about a 28% chance that we will roll a six fewer than 160 times.
Continuity Correction We can achieve a slightly more accurate approximation with what is called the continuity correction. We looked at P(x < 160). However this is the same as P(x < 159.5) or any such fraction. When using the normal distribution to approximate the binomial distribution, we correct by this .5 value.
Example Each year a squirrel has a 35% chance of surviving the winter. Suppose in patch of land there are 200 squirrels. What is the probability that between 65 and 80 of these squirrels will survive the winter?
Solution We first check
np = (200)(.35) = 70 > 5 and nq = (200)(.65) = 130 > 5
Thus we can use the normal curve approximation. We have
= np = (200)(.35) = 70
and npq = (200)(.35)(.65) = 45.5
Taking a square root gives
= 6.7
Instead of using P(65 < x < 80), we use the continuity correction and find P(64.5 < x < 80.5). We compute the two z-scores.
35
64.5 - 70 z = = -.82 6.7
and
80.5 - 70 z = = 1.57 6.7
Now we use the table to find the probabilities. We get .2061 and .9418. Since we want the middle area, we subtract these
.9418 - .2061 = .7357
Thus there is about a 73% chance that there will be between 65 and 80 surviving squirrels.
Correlation
Residuals Suppose that the average lifespan for people who smoke is:
Packs Per Week Life Span
1 72
2 70
3 69
5 68
We can calculate the least squares regression line:
y = 72.3 - .943x
We define the first residual to be the difference between the first lifespan and the first estimated lifespan:
72 - (72.3 - .943(1)) = .643
the second residual as:
70 - (72.3 - .943(2)) = -.414
the third as:
69- (72.3 - .943(3)) = -.471 and the fourth as
68- (72.3 - .943(5)) = .415 in general we have the residual is yi - y = yi - (a + bxi)
Coefficient of determination: r2
36
We define the coefficient of determination as an indication of how linear the data is. r2 has the following properties:
Properties of the Coefficient of Determination r2 is between 0 and 1.If r2 = 1 then all points lie on a line. (perfectly linear)If r2 = 0 then the regression line is a useless indicator for predicting y values.
Construction To compute r2, do the following:
Compute the sum of the squares of the residuals: SSResid
a. Compute y2 and (y)2. We say that
SSTo = y2 - ( y)2/n
b. Compute
1 - SSResid/SSto
This is r2 .
If we multiply r2 by 100%, we arrive at the percent of the observed variation attributable to the linear relationship.
Correlation: r If we want to determine not just if they are linearly related, but also want to know whether there is a positive relationship or a negative relationship (b> 0 or b<0) and want the calculation unitless, we compute Pearson’s correlation coefficient r
We have
r2 = r2
that is the square of the correlation coefficient is equal to the coefficient of determination.
If r < 0 then they are negatively correlated. If r > 0 then they are positively correlated.
37
We say that the correlation is
strong if |r| >.8middle if .5 < |r| < .8 and weak otherwise.
Correlation does not imply causation. For example there may be a strong correlation between grayness in hair and wrinkles, but having gray hair does not cause one to have wrinkles.
Estimating Differences
Difference Between Means I surveyed 50 people from a poor area of town and 70 people from an affluent area of town about their feelings towards minorities. I counted the number of negative comments made. I was interested in comparing their attitudes. The average number of negative comments in the poor area was 14 and in the affluent area was 12. The standard deviations were 5 and 4 respectively. Let’s determine a 95% confidence for the difference in mean negative comments. First, we need some formulas.
Theorem
The distribution of the difference of means x1 - x2 has mean
1 - 2
and standard deviation
For our investigation, we use s1 and s2 as point estimates for 1 and 2. We have x1 = 14 x2 = 12 s1 = 5 s2 = 4 n1 = 50 n2 = 70
Now calculate
x1 - x2 = 14 - 12 = 2
The margin of error is
E = zcs = (1.96)(0.85) = 1.7
The confidence interval is 2 1.7
or [0.3, 3.7]
We can conclude that the mean difference between the number of racial slurs that poor and wealthy people make is between 0.3 and 3.7.
38
Small Samples When either sample size is small, we can still run the statistics provided the distributions are approximately normal. If in addition we know that the two standard deviations are approximately equal, then we can pool the data together to produce a pooled standard deviation. We have the following theorem.
Pooled Estimate of
with n1 + n2 - 2 degrees of freedom
You’ve gotta love the beautiful formula!
Note
After finding the pooled estimate we have that a confidence interval is given by
Example What is the difference between commuting patterns for students and professors. 11 students and 14 professors took part in a study to find mean commuting distances. The mean number of miles traveled by students was 5.6 and the standard deviation was 2.8. The mean number of miles traveled by professors was 14.3 and the standard deviation was 9.1. Construct a 95% confidence interval for the difference between the means. What assumption have we made?
Solution
We have
x1 = 5.6 x2 = 14.3 s1 = 2.8 s2 = 9.1 n1 = 11 n2 = 14
The pooled standard deviation is
The point estimate for the mean is
14.3 - 5.6 = 8.7
and
39
Use the t-table to find tc for a 95% confidence interval with 23 degrees of freedom and find
tc = 2.07 8.7 (2.07)(7.09)(.403) = 8.7 5.9
The range of values is [2.8, 14.6]
The difference in average miles driven by students and professors is between 2.8 and 14.6. We have assumed that the standard deviations are approximately equal and the two distributions are approximately normal.
Difference Between ProportionsSo far, we have discussed the difference between two means (both large and small samples). Our next task is to estimate the difference between two proportions. We have the following theorem
And a confidence interval for the difference of proportions is
Confidence Interval for the difference of Proportions
Note: in order for this to be valid, we need all four of the quantitiesp1n1 p2n2 q1n1 q2n2
to be greater than 5.
Example300 men and 400 women we asked how they felt about taxing Internet sales. 75 of the men and 90 of the women agreed with having a tax. Find a confidence interval for the difference in proportions.
SolutionWe have
p1 = 75/300 = .25 q1 = .75 n1 = 300
p2 = 90/400 = .225 q2 = .775 n2 = 400
We can calculate
We can conclude that the difference in opinions is between -8.5% and 3.5%.
Confidence Intervals For Proportions and
40
Choosing the Sample Size
A Large Sample Confidence Interval for a Population Proportion
Recall that a confidence interval for a population mean is given by
Confidence Interval for a Population Mean zc s
x
We can make a similar construction for a confidence interval for a population proportion. Instead of x, we can use
p and instead of s, we use , hence, we can write the confidence interval for a large sample proportion as
Confidence Interval Margin of Error for a Population Proportion
Example 1000 randomly selected Americans were asked if they believed the minimum wage should be raised. 600 said yes. Construct a 95% confidence interval for the proportion of Americans who believe that the minimum wage should be raised.
Solution: We have
p = 600/1000 = .6 zc = 1.96 and n = 1000
We calculate:
Hence we can conclude that between 57 and 63 percent of all Americans agree with the proposal. In other words, with a margin of error of .03 , 60% agree.
41
Calculating n for Estimating a Mean
Example Suppose that you were interested in the average number of units that students take at a two year college to get an AA degree. Suppose you wanted to find a 95% confidence interval with a margin of error of .5 for knowing = 10. How many people should we ask?
Solution
Solving for n in
Margin of Error = E = zc /
we have
E = zc
zc = E
Squaring both sides, we get
We use the formula:
Example A Subaru dealer wants to find out the age of their customers (for advertising purposes). They want the margin of error to be 3 years old. If they want a 90% confidence interval, how many people do they need to know about?
Solution:
We have
E = 3, zc = 1.65
but there is no way of finding sigma exactly. They use the following reasoning: most car customers are between 16 and 68 years old hence the range is
42
Range = 68 - 16 = 52
The range covers about four standard deviations hence one standard deviation is about
52/4 = 13
We can now calculate n:
Hence the dealer should survey at least 52 people.
Finding n to Estimate a Proportion
Example Suppose that you are in charge to see if dropping a computer will damage it. You want to find the proportion of computers that break. If you want a 90% confidence interval for this proportion, with a margin of error of 4%, How many computers should you drop?
Solution
The formula states that
Squaring both sides, we get that
zc2 p(1 - p)
E2 = n
Multiplying by n, we get
nE2 = zc2[p(1 - p)]
This is the formula for finding n. Since we do not know p, we use .5 ( A conservative estimate)
We round 425.4 up for greater accuracy
We will need to drop at least 426 computers. This could get expensive.
43
Goodness of Fit
Before the Gondola was in operation, Heavenly tracked its skiers and boarders and found the following
Type Percent of Skiers
Beginner 30%
Intermediate 40%
Advanced 20%
Expert 10%
With the new gondola in place the ski resort wants to determine if the distribution has changed. They tracked 2000 skiers and boarders and came up with the following
Type Observed Count
Beginner 590
Intermediate 860
Advanced 400
Expert 150
What can be concluded (use = .05)?
Solution
We first write determine the null and alternative hypotheses
H0: The new population of skiers and boarders follows the same distribution as the old distribution of skiers.
H1: The new population of skiers and boarders does not follow the same distribution as the old distribution of skiers.
Next we compute the expected counts by multiplying the sample size 2000 by the expected percent.
Type Observed CountExpected Count
(O - E)2
(O - E)2 / E
Beginner 590 600 100 0.167
Intermediate 860 800 3600 4.5
Advanced 400 400 0 0
Expert 150 200 2500 12.5
Now add up the total of the last column to get0.167 + 4.5 + 0 + 12.5 = 17.17
The number of degrees of freedom is
df = n - 1 = 4 - 1 = 3
where n is the number of rows in the table.
We use the table for the Chi Square distribution. The critical value that corresponds to a level of significance of .05 with 3 degrees of freedom is 7.81.
Since
44
17.17 > 7.81
we can reject the null hypothesis and accept the alternative hypothesis and conclude that the distribution of skiers and boarders has changed.
An applet that does goodness of fit computations can be found here
Analyzing the Regression Line
Estimating Sigma The correlation provides us with an estimate of how linear the data is. We would also like to know how close the data are to the regression line. We use a measurement se which is a point estimate for the standard deviation for the residuals. If se is large then the points lie far from the line and if it is small then the points are close to the line.
We have an empirical rule that says that:
approximately 95% of the points lie within 2se of the line.
The mean value for is a and the mean value for is b. Some assumptions that we make on the error e from y = + x are
e has mean value 0.e has standard deviation which does not depend on x.The distribution of e is normal.Each of the e’s for different x’s are independent of one one another.
A point estimate for 2 is given by
SSResid se
2 = n - 2
and the point estimate for is its square root.
45
Inferences on the Slope Suppose that the equation of the regression line calculated from the data is
y = a + bx.
Can we trust this b? In other words, if the true equation of the regression line is
y = + x,
is b a good point estimate for ? We can estimate the standard deviation by the formula
The t statistic is
b - t = sb with n - 2 degrees of freedom
We can form a confidence interval for as
To interpret this confidence for example we can say that we are 95% confident that the true slope of the regression line is between two and three.
If the slope of the regression line is 0 then the regression line is useless. Hence it is typical to test the hypothesis
Ho: = 0 Ha: 0
We use the t statistic
b - 0 t = sb
and proceed as usual.
Example
Suppose that we have computed the regression line that corresponds to education (years of college) vs. income as
= 15,000 + 5x
with 200 data points and have sb = 2
Use = .05
46
Then we have Ho: = 0 Ha: 0
and
t = 5/2 = 2.5
giving a p-value between .01 and .02. Since p < we can reject H0 and accept H1 and conclude that the regression line is useful for predicting the income based on college years. We can make a 95% confidence interval for the slope:
5 1.96(2) or
[1.08,8.92]
Testing if There is a CorrelationWe have talked about the correlation being weak, moderate, or strong; however, with a small sample this may not be reliable. Smaller samples can produce unreliable results. Next we will create a hypothesis on whether there is a correlation between the two variables. If there is no correlation then the correlation coefficient will be 0. Otherwise it will not be 0. We can also test to see if there is a positive or negative correlation. As you may guess, the difference in the test for a correlation, a positive correlation, or a negative correlation will be whether we use a two tailed test, a right tailed test, or a left tailed test. We will use the Greek letter “” pronounced “rho” for the population correlation and r for the sample correlation. The test statistic will be given by
Notice that this is a “t” statistic. We havedegrees of freedom = n - 2
Notice that the larger the sample size (with the same r), the larger the t value. Also, a larger r will produce a larger t value.
ExampleA study was done to see if there is a positive correlation between the number of times per month that college students call home and the amount of money that their parents contribute towards their education. 175 students were surveyed and the correlation was found to be 0.18. What can be said at the 0.05 level of significance?
SolutionFirst we write down the null and alternative hypotheses:
H0: = 0
H1: > 0
We compute the
47
Since the sample size is large, we can use the normal distribution (z-table) to approximate the P-value. Notice also that this is a right tailed test so we need to subtract the table value from 1. We have
P = 1 - .9920 = 0.008Since P is less than 0.05, we can conclude that there is a positive correlation between the number of times per month that students call home and the amount of money that their parents contribute towards their education.
Remark: We were able to conclude that there is strong statistical evidence of a positive correlation. On the other hand the correlation of 0.18 is a weak correlation. Try not to confuse strong evidence to show a correlation with a strong correlation. Also we can not conclude that calling parents frequently will induce parents to send more money. We have established correlation, not causation.
Remark: If the correlation is 0, then so is the slope. It turns out that the test statistic for the slope is the same as the test statistic for the correlation. Computers will usually provide the P-value for testing the slope. This is the same as the P-value for testing the correlation.
ANOVA We saw that for paired differences we had the hypothesis that
H0: 1 = 2 H1: 1 2
What if there are many types and we want to see if it makes a difference which we look at? We could test the appropriate hypothesis for each pair, but if we test for enough pairs, then we are bound to find two different ones even if they are not any (Eventually we will get unlucky). For example if there were 12 means that we wanted to test to see if they were all the same, then we would have to test C(12,2) = 66 pairs. If we used a level of significance of 0 .05, then if the populations actually had the same means, then we could expect that on average we would reject the null hypothesis (66)(0.05) = 3.3 times. We need a test where it is rare to make the mistake of saying that a pair of means differ when they really are the same.
Instead we will do an ANOVA (ANalysis Of VAriance) test for multiple means. We call each population a treatment. The test statistic turns out to be the “F” statistic that we saw when we looked at when comparing variances. The calculation to arrive at the test statistic is quite complicated, hence we will assume that either the reader will be looking at a textbook for this, or rely on a computer.
In order to use ANOVA, one must make the following (not always reasonable) assumptions.
1. Each of the populations follows approximately normal distribution. 2. Each sample is randomly selected and independent of every other sample. 3. The standard deviations of all the populations are approximately equal to each other.
Since we are looking at an F statistic, we need the numerator and denominator degrees of freedom. Let N be the total sample size, that is, the sum of each of the sample sizes. Let k be the number of treatments (number of samples). Then
d.f.N = k - 1 (numerator degrees of freedom) d.f.D = N - k (denominator degrees of freedom)
ANOVA is always a right tailed test, hence the table will give the true P-value (we never need to to multiply by 2).
Example
48
Does it make a difference which type of car we buy in terms of cost of maintenance? We test 10 American, 20 Japanese, 30 Korean, and 44 German cars. Suppose that the F-statistic was computed to be 3.2. what can be concluded at the 0.05 level of significance?
Solution We have
H0: 1 = 2 = 3 = 4
H1: i j for some i not j (At least two are different)
We use two types of degrees of freedom:
Let
N = the total of all the n
and k = the number of samples.
In our case
N = 10 + 20 + 30 + 44 = 104 and
k = 4 then we have
k - 1 = numerator degrees of freedom and N - k = denominator degrees of freedom
In our example we have
3 numerator degrees of freedom and 100 denominator degrees of freedom.
Since F = 3.2, we use the table to find that
0.025 < P < 0.050
In particular, the P-value is smaller than the level of significance, hence we can reject the null hypothesis and accept the alternative hypothesis. Hence, we can conclude that it does make a difference which car one buys in terms of average cost of maintenance.
Two Way ANOVA
One Observation in Each Cell
Factor 1: MajorFactor 2: Class Status
Freshman Sophomore Junior Senior
Science 2.8 3.1 3.2 2.7
Humanities 3.3 3.5 3.6 3.1
Other 3.0 3.2 2.9 3.0
49
In the prior discussion, we saw that there is a way of testing to see of all the means of several populations are not the same. Often, there are two factors involved and we want to see if the means are different within each factor. We will explain how this works with an example.
ExampleSuppose we want to look at students GPA’s based on the type of their major (science, humanities, other) and their class status (freshmen, sophomore, junior, senior). We will use a level of significance of 0.05. We question find the GPA’s for one randomly selected student for each of the 12 possible combinations. The results are shown in the table below.
Since there are two factors given, major and class status, we will have two separate pairs of hypotheses. H0: There is no difference in population mean GPA based on major. H1: At least two majors have a different population mean GPA.
and
H0: There is no difference in population mean GPA based on class status. H1: At least two class statuses have a different population mean GPA.
For the same reason we used the technique of ANOVA for a one-way table in the previous discussion, we will use ANOVA for this situation. In order to proceed, we need to make the following assumptions:
The measurements in each cell was selected randomly from a normal distribution.The distributions from the cells all have the same standard deviation.The values of each cell come from independent samples.There are the same number of measurements in each cell (in the above example there was only one measurement taken per cell).
The calculation of the F statistic is not that enlightening to the elementary statistics student, so we will assume that a computer will be used for this calculation. The program (StatCrunch) that we have been using does not support 2-Way ANOVA; however there are free applets that do support 2-Way ANOVA. One such applet can be found at http://home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/ANOVATwo.htm. We can think of the first factor as what block the student is in and the second factor a treatment that that student is given. This is the terminology that is given in the applet.
Sample Proportions and Point Estimation:
Sample Proportions Let p be the proportion of successes of a sample from a population whose total proportion of successes is and let p be the mean of p and p be its standard deviation.
Then
The Central Limit Theorem For Proportions 1. p =
2.
3. For n large, p is approximately normal.
Example
50
Consider the next census. Suppose we are interested in the proportion of Americans that are below the poverty level. Instead of attempting to find all Americans, Congress has proposed to perform statistical sampling. We can concentrate on 10,000 randomly selected people from 1000 locations. We can determine the proportion of people below the poverty level in each of these regions. Suppose this proportion is .08. Then the mean for the sampling distribution is
p = 0.8
and the standard deviation is
Point Estimations A Point Estimate is a statistic that gives a plausible estimate for the value in question.
Example x is a point estimate for s is a point estimate for
A point estimate is unbiased if its mean represents the value that it is estimating.
Paired Differences
Paired Data: Hypothesis Tests
Example
Is success determined by genetics?
The best such survey is one that investigates identical twins who have been reared in two different environments, one that is nurturing and one that is non-nurturing. We could measure the difference in high school GPAs between each pair. This is better than just pooling each group individually. Our hypotheses are
Ho: d = 0 H1: d > 0
where d is the mean of the differences between the matched pairs. We use the test statistic
where sd is the standard deviation of the differences. For a small sample we use n - 1 degrees of freedom, where n is the number of pairs.
Paired Differences: Confidence Intervals To construct a confidence interval for the difference of the means we use:
xd t sd/
51
Example :Suppose that ten identical twins were reared apart and the mean difference between the high school GPA of the twin brought up in wealth and the twin brought up in poverty was 0.07. If the standard deviation of the differences was 0.5, find a 95% confidence interval for the difference.
Solution
We compute
or
[-0.29, 0.43] We are 95% confident that the mean difference in GPA is between -0.29 and 0.43. Notice that 0 falls in this interval; hence we would fail to reject the null hypothesis at the 0.05 level.
52