Found StatCrunch Resources – Use StatCrunch to find correlation between two variables – Find a...

39
Found StatCrunch Resources Use StatCrunch to find correlation between two variables http://screencast.com/t/rAbGVY5We8 Find a Confidence Interval for a population mean using StatCrunch http://www.youtube.com/watch?v=G5nw2B9g19c StatCrunch Cheat Sheet http://www3.jjc.edu/staff/msullivan/Stats/Technology%20S tep%20by%20Step%20StatCrunch.pdf You might want to have StatCrunch open 1

Transcript of Found StatCrunch Resources – Use StatCrunch to find correlation between two variables – Find a...

Page 1: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Found StatCrunch Resources– Use StatCrunch to find correlation between two variables

• http://screencast.com/t/rAbGVY5We8– Find a Confidence Interval for a population mean using StatCrunch

• http://www.youtube.com/watch?v=G5nw2B9g19c– StatCrunch Cheat Sheet

• http://www3.jjc.edu/staff/msullivan/Stats/Technology%20Step%20by%20Step%20StatCrunch.pdf

– You might want to have StatCrunch open

1

Page 2: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Final Project Notes• 1. Using the MM207 Student Data Set: • What is the correlation between student cumulative GPA and the number of hours spent on

school work each week? Be sure to include the computations or StatCrunch output to support your answer.

– Stat->summary stats->correlation

• What would be the predicted GPA for a student who spends 16 hours per week on school work? Be sure to include the computations or StatCrunch output to support your prediction.

– Stat->regression->simple linear– Choose dependent (y) and independent (x) variable– Each of the “next” screens have useful options “ Confidence Intervals” , “Predict Y for

x=“, “plot fitted Line”

• Hit the Next button to see the graph.• Highlight the tables with the mouse, press ctrl-c to copy to the clipboard,

Ctrl-v to paste in your document

2

Page 3: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Final Project Notes• 2. bar graphs are great,

– Graphics->bar plot->with data

• 3. Jonathan is a 42 year old male student and Mary is a 37 year old female student thinking about taking this class. Based on their relative position, which student would be farther away from the average age of their gender group based on this sample of MM207 students?

– compute the z- values

• 4. If you were to randomly select a student from the set of students who have completed the survey, what is the probability that you would select a male? Explain your answer.

3

Page 4: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Final Project Notes• 5. Using the sample of MM207 students: What is the probability of randomly

selecting a person who is conservative and then selecting from that group someone who is a nursing major? What is the probability of randomly selecting a liberal or a male?

– Stat->tables->Contingency-> with data. Choose q9 and q13

4

Business IT Legal Studies Nursing Other Psycholog

y Total

Conservative 4 1 6 17 6 22 56

Liberal 2 3 2 12 8 26 53

Moderate 13 6 2 32 17 49 119

Total 19 10 10 61 31 97 228

Page 5: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Final Project Notes

Female Male Total

Conservative 41 15 56

Liberal 44 9 53

Moderate 101 17 118

Total 186 41 227

5

What is the probability of randomly selecting a liberal or a male?

Page 6: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Final Project Notes• 6. compute the z-score using the standard deviation from the CLT, n=25• 7. Select a random sample of 30 student responses to question 6, "How many

credit hours are you taking this term?" Using the information from this sample, and assuming that our data set is a random sample of all Kaplan statistics students, estimate the average number of credit hours that all Kaplan statistics students are taking this term using a 95% level of confidence. Be sure to show the data from your sample and the data to support your estimate.

– To get the sample of 30 students hit Data->Sample columns, choose q6, enter sample size of 30 and hit “Sample Columns”. A popup comes up somewhere to tell you that a new column has been added to the data, Sample(q6)

– Compute 95% CI from this column. Stat->z-statistics-> one sample->with data. Choose the new Sample(q6) column, hit next and select Confidence Interval, calculate

6

Page 7: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Final Project Notes• 8. Assume that the MM207 Student Data Set is a random sample of all Kaplan

students; estimate the proportion of all Kaplan students who are female using a 90% level of confidence.

– First get the number of females in the sample: Stat->table->frequency, choose Gender.

– Proportions-> one sample-> with summary. Successes=192, observations=233. Next, choose Confidence Intervals, calculate

7

Gender Frequency Relative Frequency

Female 192 0.82403433

Male 41 0.17596567

Proportion Count Total Sample Prop. Std. Err. L. Limit U. Limit

p 192 233 0.82403433 0.024946446 0.7751402 0.87292844

Page 8: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Final Project Notes• 9. Assume you want to estimate with the proportion of students who commute less than 5

miles to work within 2%, what sample size would you need? – N=1/E^2

• 10. A professor at Kaplan University claims that the average age of all Kaplan students is 36 years old. Use a 95% confidence interval to test the professor's claim. Is the professor's claim reasonable or not? Explain.

– Stat->z-stat-one sample->with data. Choose q2, next, select confidence intervals, calculate

8

Variable n Sample Mean Std. Err. L. Limit U. Limit

Q2 How old are you? 237 37.291138 0.67025393 35.977467 38.604813

Page 9: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

z-Scores

z-scores determine how far, in terms of standard deviations, a given score is from the mean of the distribution.

z

value of piece of data mean

standard deviation

x

9

Page 209

Page 10: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Slide 5.2- 10Copyright © 2009 Pearson Education, Inc.

Figure 5.22 shows the values on the distribution of IQ scores from Example 6.

Figure 5.22 Standard scores for IQ scores of 85, 100, and 125.

Page 11: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

11

Graph from: http://www.comfsm.fm/~dleeling/statistics/fx_2001_02.html

Page 12: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Percentiles

• Are normally used with lots of data.• We divide the number of data values by 100,

and that will tell us how many data values are in each percent.

• The following example has the grocery bills for 300 families for a week.

• There will be 3 data values to each percent, or 30 values for each 10 %.

12

Page 13: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

591 215 150 342 265 426 414 33 426 507 269 116 205 153 199 418 177 106 318 473

52 461 328 172 82 451 150 384 480 68 269 580 191 98 477 468 471 398 68 124

222 551 315 134 249 599 272 210 485 183 535 43 55 150 274 94 331 536 317 446

152 65 358 254 196 209 213 317 447 431 593 162 220 239 129 259 102 92 491 469

35 487 273 216 214 428 282 226 149 271 330 452 216 150 574 538 420 488 170 263

218 256 475 372 110 550 425 59 194 138 518 402 594 184 305 309 146 112 416 390

45 262 183 520 306 597 407 309 558 259 348 272 234 276 261 438 246 407 65 118

481 130 391 441 398 399 407 164 486 149 257 271 446 144 130 238 408 83 157 204

591 86 352 498 351 203 182 418 242 587 566 125 241 369 444 372 405 319 523 391

272 255 542 429 241 227 150 563 419 180 352 506 341 372 314 289 512 243 202 58

244 558 506 551 57 391 328 335 533 32 593 122 506 227 401 108 350 342 212 113

596 265 392 414 73 48 525 513 350 465 44 419 549 534 543 137 176 587 401 490

257 250 129 295 507 267 522 41 522 581 302 42 543 132 275 363 365 181 360 232

238 535 263 488 285 433 380 270 69 511 99 574 49 549 106 516 220 185 344 317

136 391 288 389 85 481 500 77 338 331 488 309 400 372 501 506 307 72 556 569

Page 14: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

32 65 99 130 150 184 215 241 261 273 309 341 372 399 419 447 486 507 536 569

33 65 102 130 152 185 216 241 262 274 314 342 372 400 419 451 487 511 538 574

35 68 106 132 153 191 216 242 263 275 315 342 372 401 420 452 488 512 542 574

41 68 106 134 157 194 218 243 263 276 317 344 372 401 425 461 488 513 543 580

42 69 108 136 162 196 220 244 265 282 317 348 380 402 426 465 488 516 543 581

43 72 110 137 164 199 220 246 265 285 317 350 384 405 426 468 490 518 549 587

44 73 112 138 170 202 222 249 267 288 318 350 389 407 428 469 491 520 549 587

45 77 113 144 172 203 226 250 269 289 319 351 390 407 429 471 498 522 550 591

48 82 116 146 176 204 227 254 269 295 328 352 391 407 431 473 500 522 551 591

49 83 118 149 177 205 227 255 270 302 328 352 391 408 433 475 501 523 551 593

52 85 122 149 180 209 232 256 271 305 330 358 391 414 438 477 506 525 556 593

55 86 124 150 181 210 234 257 271 306 331 360 391 414 441 480 506 533 558 594

57 92 125 150 182 212 238 257 272 307 331 363 392 416 444 481 506 534 558 596

58 94 129 150 183 213 238 259 272 309 335 365 398 418 446 481 506 535 563 597

59 98 129 150 183 214 239 259 272 309 338 369 398 418 446 485 507 535 566 599

Page 15: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

15

The Central Limit Theorem

Suppose we take many random samples of size n for a variable with any distribution (not necessarily a normal distribution) and record the distribution of the means of each sample. Then,

1. The distribution of means will be approximately a normal distribution for large sample sizes.

2. The mean of the distribution of means approaches the population mean, µ, for large sample sizes.

3. The standard deviation of the distribution of means approaches σ/√n for large sample sizes, where σ is the standard deviation of the population.

Page 217

Page 16: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

16

Figure 5.26 As the sample size increases (n = 5, 10, 30), the distribution of sample means approaches a normal distribution, regardless of the shape of the original distribution. The larger the sample size, the smaller is the standard deviation of the distribution of sample means.

Page 17: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

You are a middle school principal and your 100 eighth-graders are about to take a national standardized test. The test is designed so that the mean score is = 400 with a standard deviation of = 70. Assume the scores are normally distributed.

a. What is the likelihood that one of your eighth-graders, selected at random, will score below 375 on the exam?

Solution:a. In dealing with an individual score, we use the method of

standard scores discussed in Section 5.2. Given the mean of 400 and standard deviation of 70, a score of 375 has a standard score of

z = = = -0.36

EXAMPLE 1 Predicting Test Scores

data value – meanstandard deviation

375 – 400 70

Page 18: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

18

Page 19: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

According to Table 5.1, a standard score of -0.36 corresponds to about the 36th percentile— that is, 36% of all students can be expected to score below 375. Thus, there is about a 0.36 chance that a randomly selected student will score below 375.

Notice that we need to know that the scores have a normal distribution in order to make this calculation, because the table of standard scores applies only to normal distributions.

EXAMPLE 1 Predicting Test Scores

Solution: (cont.)

Page 20: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

You are a middle school principal and your 100 eighth-graders are about to take a national standardized test. The test is designed so that the mean score is = 400 with a standard deviation of = 70. Assume the scores are normally distributed.

b. Your performance as a principal depends on how well your entire group of eighth-graders scores on the exam. What is the likelihood that your group of 100 eighth-graders will have a mean score below 375?

Solution:b. The question about the mean of a group of students must be

handled with the Central Limit Theorem. According to this theorem, if we take random samples of size n = 100 students and compute the mean test score of each group, the distribution of means is approximately normal.

EXAMPLE 1 Predicting Test Scores

Page 21: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Moreover, the mean of this distribution is = 400 and its standard deviation is = 70/ 100 = 7. With these values for the mean and standard deviation, the standard score for a mean test score of 375 is

EXAMPLE 1 Predicting Test ScoresSolution: (cont.)

data value – meanstandard deviation

375 – 400 7 z = = = -03.57

Table 5.1 shows that a standard score of -3.5 corresponds to the 0.02th percentile, and the standard score in this case is even lower.

In other words, fewer than 0.02% of all random samples of 100 students will have a mean score of less than 375.

n/

Page 22: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

22

Page 23: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Therefore, the chance that a randomly selected group of 100 students will have a mean score below 375 is less than 0.0002, or about 1 in 5,000.

Notice that this calculation regarding the group mean did not depend on the individual scores’ having a normal distribution.

EXAMPLE 1 Predicting Test ScoresSolution: (cont.)

This example has an important lesson. The likelihood of an individual scoring below 375 is more than 1 in 3 (36%), but the likelihood of a group of 100 students having a mean score below 375 is less than 1 in 5,000 (0.02%).

In other words, there is much more variation in the scores of individuals than in the means of groups of individuals.

Page 24: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Figure 7.3 Types of correlation seen on scatter diagrams.

Types of Correlation Page 289

24

Page 25: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Linear Correlation Coefficient Page 294

Page 26: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

The line of best fit (regression line or the least squares line) is the line that best fits the data, i.e. it is closer to the data than any other line.

26

Page 27: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Regression

27

Page 28: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Data Set 1, WT is y and HT is x

28

Page 29: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Cautions in Making Predictions from Best-Fit Lines 1. Don’t expect a best-fit line to give a good prediction unless the

correlation is strong and there are many data points. If the sample points lie very close to the best-fit line, the correlation is very strong and the prediction is more likely to be accurate. If the sample points lie away from the best-fit line by substantial amounts, the correlation is weak and predictions tend to be much less accurate.

2. Don’t use a best-fit line to make predictions beyond the bounds of the data points to which the line was fit.

3. A best-fit line based on past data is not necessarily valid now and might not result in valid predictions of the future.

4. Don’t make predictions about a population that is different from the population from which the sample data were drawn.

5. Remember that a best-fit line is meaningless when there is no significant correlation or when the relationship is nonlinear.

29

Page 30: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not.

Solution:No one exercises 18 hours per day on an ongoing basis, so this much

exercise must be beyond the bounds of any data collected. Therefore, a prediction about someone who exercises 18 hours per day should not be trusted.

EXAMPLE 1 Valid Predictions?

You’ve found a best-fit line for a correlation between the number of hours per day that people exercise and the number of calories they consume each day. You’ve used this correlation to predict that a person who exercises 18 hours per day would consume 15,000 calories per day.

30

Page 31: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not.

Solution:

EXAMPLE 1 Valid Predictions?

Historical data have shown a strong negative correlation between national birth rates and affluence. That is, countries with greater affluence tend to have lower birth rates. These data predict a high birth rate in Russia.

We cannot automatically assume that the historical data still apply today. In fact, Russia currently has a very low birth rate, despite also having a low level of affluence.

31

Page 32: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not.

Solution:

EXAMPLE 1 Valid Predictions?

A study in China has discovered correlations that are useful in designing museum exhibits that Chinese children enjoy. A curator suggests using this information to design a new museum exhibit for Atlanta-area school children.

The suggestion to use information from the Chinese study for an Atlanta exhibit assumes that predictions made from correlations in China also apply to Atlanta. However, given the cultural differences between China and Atlanta, the curator’s suggestion should not be considered without more information to back it up.

32

Page 33: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not.

Solution:

EXAMPLE 1 Valid Predictions?

Scientific studies have shown a very strong correlation between children’s ingesting of lead and mental retardation. Based on this correlation, paints containing lead were banned.

Given the strength of the correlation and the severity of the consequences, this prediction and the ban that followed seem quite reasonable. In fact, later studies established lead as an actual cause of mental retardation, making the rationale behind the ban even stronger.

33

Page 34: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

State whether the prediction (or implied prediction) should be trusted in each of the following cases, and explain why or why not.

Solution:

EXAMPLE 1 Valid Predictions?

Based on a large data set, you’ve made a scatter diagram for salsa consumption (per person) versus years of education. The diagram shows no significant correlation, but you’ve drawn a best-fit line anyway. The line predicts that someone who consumes a pint of salsa per week has at least 13 years of education.

Because there is no significant correlation, the best-fit line and any predictions made from it are meaningless.

34

Page 35: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

The square of the correlation coefficient, or r2, is the proportion of the variation in a variable that is accounted for by the best-fit line.

The use of multiple regression allows the calculation of a best-fit equation that represents the best fit between one variable (such as price) and a combination of two or more other variables (such as weight and color). The coefficient of determination, R2, tells us the proportion of the scatter in the data accounted for by the best-fit equation.

35

Page 36: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Political scientists are interested in knowing what factors affect voter turnout in elections. One such factor is the unemployment rate. Data collected in presidential election years since 1964 show a very weak negative correlation between voter turnout and the unemployment rate, with a correlation coefficient of about r = -0.1. Based on this correlation, should we use the unemployment rate to predict voter turnout in the next presidential election?

Note that there is a scatter diagram of the voter turnout data on page 312.

Solution: The square of the correlation coefficient is r2 = (-0.1)2 = 0.01, which means that only about 1% of the variation in the data is accounted for by the best-fit line. Nearly all of the variation in the data must therefore be explained by other factors. We conclude that unemployment is not a reliable predictor of voter turnout.

EXAMPLE 4 Voter Turnout and Unemployment

36

Page 37: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

The Search for Causality

• A correlation may suggest causality, but by itself a correlation never establishes causality. Much more evidence is required to establish that one factor causes another.

• a correlation between two variables may be the result of either (1) coincidence, (2) a common underlying cause, or (3) one variable actually having a direct influence on the other.

• The process of establishing causality is essentially a process of ruling out the first two explanations.

37

Page 38: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Determining Causality• We can rule out coincidence by repeating the experiment many

times or using a large number of subjects in the experiment. Because coincidences occur randomly, they should not occur consistently in many subjects or experiments.

• If the controls rule out confounding variables, any remaining effects must be caused by the variables being studied.

38

Page 39: Found StatCrunch Resources – Use StatCrunch to find correlation between two variables  – Find a Confidence Interval for.

Guidelines for Establishing CausalityIf you suspect that a particular variable (the suspected cause) is causing some effect:

1. Look for situations in which the effect is correlated with the suspected cause even while other factors vary.

2. Among groups that differ only in the presence or absence of the suspected cause, check that the effect is similarly present or absent.

3. Look for evidence that larger amounts of the suspected cause produce larger amounts of the effect.

4. If the effect might be produced by other potential causes (besides your suspected cause), make sure that the effect still remains after accounting for these other potential causes.

5. If possible, test the suspected cause with an experiment. If the experiment cannot be performed with humans for ethical reasons, consider doing the experiment with animals, cell cultures, or computer models .

6. Try to determine the physical mechanism by which the suspected cause produces the effect.

39