Scatterplots and Cautions of Correlation
-
Upload
oleg-janke -
Category
Documents
-
view
112 -
download
0
Transcript of Scatterplots and Cautions of Correlation
![Page 1: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/1.jpg)
© 2006 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice
Oleg Janke25-June-2012
Tip of the hat to Kevin Kacmarynski
Scatterplots and Cautions of Correlation
![Page 2: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/2.jpg)
Topics for Today’s Discussion• Scatterplots:
−Why bother?−Creating & Analyzing−Good usage
• Correlation and Association• Potential Missteps • Summary• Real Personal Application
![Page 3: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/3.jpg)
Presumption• Basic graphing skills• Familiar with charting (for example, as
Excel)• YES NO
![Page 4: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/4.jpg)
Scatterplots: Why bother?• Scatterplot (Scatter diagram)
−Converts two columns of numbers (ordered pairs) into picture
−Explores relationship between two quantitative variables
• What value does it have?−Determine possible cause and
effect links (control)−Predict results of variable that is
difficult to measure if it is strongly related to another variable that is easier to measure (proxy)
![Page 5: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/5.jpg)
Creating a scatterplotWhat is the relationship between height & weight?
![Page 6: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/6.jpg)
Creating a scatterplotPlot each characteristic of interest on a standard XY plot. Katie James
![Page 7: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/7.jpg)
Analyzing a scatterplotIs there a relationship? OR Is it just randomness? (N=40)
![Page 8: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/8.jpg)
Analyzing a scatterplotIs there a relationship? OR Is it just randomness? (N=40) How confident are you
that there is a linear relationship between height and weight in this data set? (Choose one)
• 100%
• 99-100%
• 95-99%
• 90-95%
• 80-90%
• insufficient data to say
• no relationship
![Page 9: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/9.jpg)
Analyzing a scatterplotIs there a relationship or is it just randomness? (N=40)• Add Median lines and count quadrant totals+
Median X
146
614
Median Y
+ Olmstead-Tukey 1947
![Page 10: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/10.jpg)
Analyzing a scatterplotIs there a relationship or is it just randomness? (N=40)• Add Median lines and count quadrant totals+
Median X
146
614
Median Y
+ Olmstead-Tukey 1947
NO relationship
• shotgun effect
• appx equal number in each quadrant
IS a relationship
• one diagonal will dominate
![Page 11: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/11.jpg)
Analyzing a scatterplotIs there a relationship or is it just randomness? (N=40)• Add Median lines and count quadrant totalsMedian X
146
614
Median Y
•Less than 5% chance data could align this way simply from randomness• Therefore fairly confident X & Y are related
SIGN TEST TABLE *N 1% 5%10 0 120 3 530 7 940 11 1350 15 1760 19 21
* Ishikawa “Guide to Quality Control”, 1976
![Page 12: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/12.jpg)
Good usage of Scatterplots• In this plot, we
observe a clear relationship between height and weight
• As height of individuals increase, their weight tends to increase as well
• In the ideal case this relationship is called Body Mass Index (BMI)
![Page 13: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/13.jpg)
Good usage of scatterplots
Scenario 1: We are building
parts on one line in one location.
What does the plot tell us about part length and part diameter?
![Page 14: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/14.jpg)
Good usage of scatterplots Scenario 2:
We build the same part on two different Lines.
Now what does plot tell us about part length & part diameter?
![Page 15: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/15.jpg)
Good usage of scatterplots
• IS a Line effect here • Relationship between
Diam & Length differs by Line
• Diam1 twice Daim2− Tighter process control?
• Length1 < Length2
Now what does plot tell us about part length & part diameter?• Always be alert to possible strata in the data• Plotting your data is crucial for discovery
![Page 16: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/16.jpg)
Correlation Coefficient• Correlation is defined as measure of strength of
linear relationship between two quantitative variables−Correlation coefficient is a mathematically calculated value:
−Correlation values are always between -1 and +1• 0 indicates no correlation (perfect shotgun pattern)• -1 and +1 indicates perfect correlation (all points fall on line)• Sign indicates direction
− Positive: up and to right− Negative: down and to left
![Page 17: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/17.jpg)
Correlation and Association• Re-visiting our first
example, we saw strong, positive relationship between height and weight
• Supported by correlation coefficient value of 0.709
• Relationship exists • Does NOT prove causalityCorrelation = 0.709
Calculation provided by JMP statistical software
![Page 18: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/18.jpg)
Correlation and Association
Medical Trial• Dosage 490-510 mg• Recorded therapeutic
response from 20 to 100
Is there a correlation between Dosage and Response?
• Yes
• No
• Insufficient data
• Don’t know
![Page 19: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/19.jpg)
Correlation and Association• Since calculated
correlation value is zero, there is no association between dosage and desired response! Right?
− No LINEAR relationship
• Correlation coefficient, by itself, does not tell the entire story
• Always look at your graphs to see what the data sayCorrelation coefficient
= 0Calculation provided by JMP statistical software
2 2
2 2
![Page 20: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/20.jpg)
Questions?How to create/analyze
ScatterplotsCorrelation and Associations
![Page 21: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/21.jpg)
Missteps with scatterplots & correlation
1. Bimodal distributions2. Stratified data3. Lurking variables4. Extrapolation5. Too narrow range of X (independent variable)6. Weak/sloppy measurement 7. Chicken and Egg Syndrome
![Page 22: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/22.jpg)
Misstep #1with scatterplots & correlation• What is my house
worth?−Sale price data & house
size were collected on 21 houses in the same town
−Another house (mine) in same town is 2300 square feet in size, so it should be worth a little over $200K
−Correlation coefficient = 0.943 (very high)
What is the problem in this analysis and the resulting conclusion?
![Page 23: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/23.jpg)
Misstep #1with scatterplots & correlation
What is the problem?• Relationship/correlation dependent solely on one data point• Why might this one point not be appropriate?
• Location (school district, suburban)• Features (pool, lot, barn, view)• Timing (peak of housing bubble)
• Example of Bimodal data
Need both appropriate data & proper analysis techniques
47
64
Is there a linear relationship?• What do median lines say about the relationship?
Sign Test TableN 1% 5%10 0 120 3 530 7 940 11 1350 15 17
![Page 24: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/24.jpg)
Misstep #2 with scatterplots & correlation
• Based on this data set, with high Correlation Coefficient (0.780) what’s the relationship between shoe size and knowledge?
• What’s missing?Correlation = 0.780Calculation provided by JMP statistical software
![Page 25: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/25.jpg)
Misstep #2 with scatterplots & correlation• Does this help solve mystery?
• Be sure to look for hidden variables that might have an impact on relationship
• Stratified Data– sub population with different relationships – can give erroneous conclusions
For example: CSAT data• Those who respond to survey• Those who do NOT respond to surveyDo both groups have similar opinions?
![Page 26: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/26.jpg)
Misstep #3 with scatterplots & correlation Beware of Lurking Variables
• Related thru common 3rd variable− Ice cream sales correlates with water usage (temperature)− Height–weight example (age)− Call vol at hp Call Center A correlated w/call vol at hp CC--B
(business)• Related thru independent growth (decay) rates
− Population in Indonesia correlates with price of tea in NYC (growth)− My car’s value correlates to grams of Cobalt-60 isotope (decay)
• Both have half-lives of about 5-6 years• Related through measuring same characteristic differently
− Weight in pounds is correlated to weight in kilos− Attendance at an event is correlated to empty seats at same event− Area of a US state is correlated with population of that state
• Some notable exceptions (AK, MT)
![Page 27: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/27.jpg)
Does School Spending Educate Students?
States spending more per student have lower SAT* scores!
Expediture/pupil by State vs SAT
900
950
1000
1050
1100
1150
1200
$4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000
Expenditure per pupil for public school K-12 (2002-03)
SAT
scor
es fo
r 199
8 Obvious Negative
Correlation
* SAT test is a standardized test used by many colleges across US to determine level of student preparedness for college
![Page 28: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/28.jpg)
Does School Spending Educate Students?
States spending more per student have lower SAT scores!
Expediture/pupil by State vs SAT
900
950
1000
1050
1100
1150
1200
$4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000
Expenditure per pupil for public school K-12 (2002-03)
SAT
scor
es fo
r 199
8
Negative Correlation
Do we have a measurement issue here?
What does SAT scores actually measure?
• Test performance – at a minimum
• Education – Not always correlated with Knowledge
• Knowledge – Our belief that this leads to Life Success
• Life Success -- This is what we would like to be the case
Be careful of proxies that stand in for other measures
![Page 29: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/29.jpg)
Does School Spending Educate Students?
States spending more per student have lower SAT scores!
Expediture/pupil by State vs SAT
900
950
1000
1050
1100
1150
1200
$4,000 $5,000 $6,000 $7,000 $8,000 $9,000 $10,000 $11,000 $12,000
Expenditure per pupil for public school K-12 (2002-03)
SAT
scor
es fo
r 199
8
Negative Correlation
Do we have a measurement issue here?
No! SAT scores are predictive of Life Success-- financial
• Life Success – college grad, good job
• Not with certainty, but on average
• Should we move to states with lower student spending?
![Page 30: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/30.jpg)
Does School Spending Educate Students?
Does Percent of students taking SAT impact SAT scores?
Correlation Coefficient = .92
Beware the Lurking Variable!
1998 SAT by State
y = 1278x-0.0575
R2 = 0.8461
900
1000
1100
1200
0 10 20 30 40 50 60 70 80 90
% Taking SAT
Com
posi
te S
AT OR
SC
WV
DC
NH
GA
WAAK
COIL
MN
TX
MS
OH
KS
WI
USA
UT
MA
NY
NJ
CT
INHI NC
NV
MT
VT
VT PARI
MSMSMS
![Page 31: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/31.jpg)
Misstep #4 with scatterplots & correlationBack to Height & Weight data• How much would an 100 inch (~2.5meters) person weigh?
80 90 100
300
280
260
240
220
200
•From scatterplot he would weight ~300# (136 kg)
•Can we make this prediction?
![Page 32: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/32.jpg)
Misstep #4 with scatterplots & correlation
• “Predicted” 300 pounds!•Robert Wadlow was 8 ft 11 (2.72 m) and weighed 439# (199 kg)• Interpolating within range of independent variable set -- acceptable•Extrapolating beyond range of independent variable is dangerous
• Relationship may not be stable
![Page 33: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/33.jpg)
Misstep #5 with scatterplots & correlationBack to Height & Weight data• What would conclusion be if height ranged from 1600.0 mm
to 1700.0 mm (64-66 inches)?
• Easily conclude no relationship between height and weight
• Make sure range of independent variable (X) sufficiently large relative to dependent variable (Y)
![Page 34: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/34.jpg)
Misstep #6 with scatterplots & correlationPoor Measurement System• Inappropriate tool or gage to measure
−Pixel width with standard yard/meter stick−Monitor response time with second hand on watch
• Weak tech repeatability−Tech visual determination of damage of NB set in for repair−Typo-graphical errors on a written page
Ensure Measurement System Analysis performed before data are collected
![Page 35: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/35.jpg)
Misstep #7 with scatterplots & correlation
Chicken and Egg Syndrome• Which came first?• What is the cause and what is the effect?
−Do children from poor families do poorly academically because they are poor OR are they poor because of poor academic performance?
−Do consumers buy good product out of loyalty OR are consumers loyal because of good product?
• Vicious/Virtuous Cycles – hard to break through• Relationship is there; Causality is not easily determined
![Page 36: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/36.jpg)
Misstep SummaryWatch out for ….• Bimodal distribution House Size and Price• Stratified data Shoe Size and Knowledge • Lurking variable SAT Scores and Participation Rate
− Underlying third variable− Common but unrelated growth/decay curves− Same variable measured differently
• Extrapolation Height and Weight for tallest man− Generalizing from a sampled subset to a broader, larger population
• Narrow range of X Height and Weight• Sloppy measurement
− Can hide a real relationship− May create one when none exists
• Chicken and Egg Syndrome− Variables are related, but which is cause & which is effect?
![Page 37: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/37.jpg)
Questions?Seven Missteps
![Page 38: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/38.jpg)
Summary• Scatterplots
−Simple, but powerful tool to explore relationships between two quantitative variables
• Be sure data are representative of question −“What are we trying to accomplish?”
• Plot data to look for anomalies or associations• Correlation has special meaning
−Correlation does not imply causation −Nor does lack of correlation deny causation
• Recall Missteps that may impact scatterplot/correlation analysis including lurking variables
![Page 39: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/39.jpg)
Personal Example
![Page 40: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/40.jpg)
Memory Loss Boosts Risk of Death
![Page 41: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/41.jpg)
x
Memory Loss Boosts Risk of Death
Cognitive Impairment
Qty Lifespan median month
Mortality
None 3157 138 57%Mild 533 106 68%
Moderate267 63 79%Severe
Were there any missteps in the analysis?
X
Key Points in Article• About 4000 men & women• Aged 60 to 102 • Indianapolis, Indiana. USA• Started in early 1990’s; ended 2006• Lower socio-economic background
• 10 questions to assess mental status
• Primary care Dr appt• No intervening follow-up of mental assessment
![Page 42: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/42.jpg)
Memory Loss Boosts Risk of DeathPotential missteps in analysis1) Applies equally for men and women?
Stratified data?2) Indy only? OR for all USA? OR World Wide?3) Only applies to those that go to Doctor?4) Socio-economic Background– What role does it
play? Three possible Extrapolations
5) How repeatable was 10 question assessment? Measurement system
6) Why combine Moderate & Severe?7) Depends on fewer points at extreme
Bimodal?8) And the Big Misstep
Lurking variable!!!
Key Points in Article• About 4000 men & women• Aged 60 to 102 • Indianapolis, Indiana. USA• Started in early 1990’s; ended 2006• Lower socio-economic background
• 10 questions to assess mental status
• Primary care Dr appt• No intervening follow-up of mental assessmentCognitive Impairment
Qty Lifespan median month
Mortality
None 3157 138 57%Mild 533 106 68%
Moderate267 63 79%Severe
Cognitive Impairment
Qty Lifespan median month
Mortality
None 3157 138 57%Mild 533 106 68%
Moderate267 63 79%Severe
![Page 43: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/43.jpg)
Memory Loss Boosts Risk of DeathPotential missteps in analysis1) Applies equally for men and women?
Stratified data?2) Indy only? OR for all USA? OR WW?3) Only applies to those that go to Doctor?4) Socio-economic Background– What role does it
play? Three possible Extrapolations
5) How repeatable was 10 question assessment? Measurement system
6) Why combine Moderate & Severe?7) Depends on fewer points at extreme
Bimodal?8) And the Biggie ….
A Lurking variable!!!
Key Points in Article• About 4000 men & women• Aged 60 to 102 • Indianapolis, Indiana. USA• Started in early 1990’s; ended 2006• Lower socio-economic background
• 10 questions to assess mental status
• Primary care Dr appt• No intervening follow-up of mental assessment
Cognitive Impairment
Qty Lifespan median month
Mortality
None 3157 138 57%Mild 533 106 68%
Moderate267 63 79%Severe
Cognitive Impairment
Qty Lifespan median month
Mortality
None 3157 138 57%Mild 533 106 68%
Moderate267 63 79%Severe
Does Cognitive Impairment hasten death? OR
Does Age Boost the Risk of Death?
![Page 44: Scatterplots and Cautions of Correlation](https://reader035.fdocuments.in/reader035/viewer/2022062900/58e7a7661a28ab847a8b5a07/html5/thumbnails/44.jpg)
Questions?