Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit...
Transcript of Unit 4: Summarizing and Exploring Dataleon/stat571/2004SummerPDFs/571Unit4Handout.pdf3 6/9/2004 Unit...
1
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 1
Unit 4: Summarizing and Exploring Data
Statistics 571: Statistical MethodsRamón V. León
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 2
Variable Types – Scales of Measurement• The numerical scale of a variable determines the
appropriate statistical treatment of the resulting data• Types of variables
– Categorical (qualitative)• Nominal• Ordinal
– Numerical (quantitative)• Continuous• Discrete
2
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 3
Types of Variables
• A categorical (qualitative) variable classifies each study unit into one of several distinct categories– When the categories are simply distinct labels the variable is
nominal– When the categories can be ordered or ranked the variable is
ordinal
• A numerical (quantitative) variable takes values from a set of numbers– A continuous variable takes values from a continuous set of
numbers– A discrete variable takes values from a discrete set of numbers,
typically integers
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 4
Alternative Classification of Numerical Variables
• If two measurements on a variable can be meaningfully compared by their difference, but not their ratio, then it is called an interval variable (e.g. temperature)
• If both the difference and the ratio are meaningful, then the variable is called a ratio variable (e.g. number of crimes, weight)– A ratio variable has a true zero while an interval variable does not.
The zero on a temperature scale depends on the scale. True zeros are independent of scale, e.g. zero weight.
– A 40 Kgs weight is twice a 20 Kgs weight even if these weight are translated into pounds. A 40 F° temperature is not twice a 20 F°temperature if these temperatures are translated into C°.
• The ratio scale is the strongest scale of measurement
3
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 5
Summarizing Categorical Data
A frequency table shows the number of occurrences of each category
The relative frequency is the proportion in each category
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 6
Bar Chart, Pie Chart, and Pareto ChartA bar chart with categories arranged from the highest to thelowest is called a Pareto Chart
Used to distinguish the vital few from the trivial many
4
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 7
Bar Chart, Pie Chart, and Pareto Chart
Homework: Go to the course home page and do the first two JMP tutorials. Turn in the Pareto chart produced by JMP. Then fish around in JMP and do a simple (unordered) bar chart and a pie chart. (See next page.) Turn in these charts in too. You will need to use the Help menu of JMP.
0
100
200
300
400
500
600C
ount
Sand
Cor
ebre
akBr
oken
Dro
pSh
iftM
is-ru
nC
rush
Run
-out
Cor
e-M
issi
ngBl
ow Cut
Cause_of_Rejection
0
25
50
75
100
Cum
Per
cent
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 8
Bar Chart, Pie Chart, and Pareto Chart
0
100
200
300
400
Freq
uenc
y
Sand
Mis
-run
Blow Sh
iftC
utBr
oken
Dro
pC
oreb
reak
Run
-out
Cor
e-M
issi
ngC
rush
Cause_of_Rejection
Cause_of_Rejection Sand Mis-runBlow ShiftCut BrokenDrop CorebreakRun-out Core-MissingCrush
Cause_of_Rejection
Cause_of_Rejection Sand Mis-runBlow ShiftCut BrokenDrop CorebreakRun-out Core-MissingCrush
5
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 9
Summarizing Numerical Data(In Unit 4 measures are used to describe the data, not to infer anything about a population)
Mean:
1 2 1...
n
in i
xx x xX
n n=+ + +
= =∑
2
1( )
n
ii
x m=
−∑is minimizedwhen
X m=
Homework: Import into JMP the mileage data of Table 4.2 available in the data set link of the course home page. Go to the Analyze menu of JMP and select Distribution. Look for the mean and the median in the resulting JMP output. Turn in this output.
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 10
Summarizing Numerical Data
Median: Middle value in the ordered sample
min (1) ( 2 ) ( ) max... nx x x x x= ≤ ≤ ≤ =1( )
2
12 2
12
n
n n
x
xx x
+
+
= +
if n is odd
if n is even
1
n
ii
x m=
−∑ is minimized when x m=
6
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 11
Mean or Median?• Appropriate summary of the center of the data:
– Mean if the data has a symmetric distribution with light tails (i.e. a relative small proportion of the observations lie away from the center of the data
– If the distribution has heavy tails or is asymmetric (skewed) use the median
• Extreme values that are far removed from the main body of the data are called outliers– Large influence on the mean but not on the median
• Right or positive skewness– Median < Mean
• Left of negative skewness– Median > Mean
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 12
Summary Statistics: Measures of Dispersion
• Two data sets may have the same center but quite different dispersions around it
• Two way to summarize variability– Give milestones that divide the data into equal
parts, e.g. quartiles– Compute a single number, e.g., range, variance,
or standard deviation
7
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 13
Summary Statistics: Percentiles and Quartiles
• The 100pth percentile or pth quantile is the value such that a fraction p of the data is below it and 1-p of the data is above it
• Quartiles– Median is the 50th percentile– The 25th, 50th, and 75th percentiles are called quartiles
and divide the data into four equal part– The minimum, maximum, and three quartiles are called
the five number summary of the data
px
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 14
Calculation Percentiles and Quartiles( ( 1))
( ) ( 1) ( )[ ( 1) ]( )p n
pm m m
xx
x p n m x x+
+
= + + − −
if p(n+1) is an integerotherwise
m is the integer part of p(n+1)
Calculate Q1: ( 1) .25(24 1) 6.25p n + = + = is not an integer
1 (6) (7) (6)0.25( ) 24 0.25(25 24) 24.25Q x x x= + − = + − =
Example – First Quartile Calculation:
8
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 15
Range, Interquartile Range, and Standard Deviation
Range = maximum - minimum
Interquartile range (IQR) = Q3 – Q1
Sample variance:
( )22 2 2
1 1
1 1( )1 1
n n
i ii i
s x x x n xn n= =
= − = − − − ∑ ∑
Sample standard deviation: 2s s=Sample mean, variance, and standard deviations are sample analogs of the population mean, variance, and standard deviation 2, ,µ σ σ
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 16
Variance Calculation Example
100154251140031-124-21
Xi iX X− ( )2iX X− 15
53
X =
=
2
2 1
( )
110
5 12.5
n
ii
X Xs
n=
−=
−
=−
=
∑
9
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 17
Sample Variance and Standard Deviation
• s2 and s should only be used to summarize dispersion with symmetric distributions
• For asymmetric distribution a more detailed breakup of the dispersion must be given in terms of quartiles
• For normal data and large samples– 50% of the data values fall between
– 68% of the data values fall between
– 95% of the data values fall between
– 99.7% of the data values fall between
0.67X s±
1X s±
2X s±
3X s±
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 18
Relationship between s and IQR for Normal Data
( 0.67 ) ( 0.67 ) 1.34
1.34
IQR x s x s sIQRs
+ − − =
.50
-.67 .67
Standard Normal Density
z scores
10
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 19
Measures of Dispersion Example
Coefficient of VariationsCVX
= =
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 20
Coefficient of Variation (CV)
Relative measure of dispersion that expresses the sample SD interms of the sample mean:
sCVX
=
Group of children study:
:
138 , 9 , 9 /138 6.5%
Arm Circunference
x mm s mm CV= = = =
:
60%, 4%, 4 / 60 6.7%
Body Water
x s CV= = = =
11
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 21
Box and Whiskers Plot (Box Plot)JMP’s (Default) Outlier Box Plot
•Q1 and Q3 are called the hinges•Center line is the median•Two lines called whiskers extend to the most extreme datavalues that are still inside the (inner) fences•Observation outside the fences are regarded as possible outliersand are denoted by asterisks
Lower (Inner) Fence = Q1 – 1.5 x IQRUpper (Inner) Fence = Q3 + 1.5 x IQR
Lower Outer Fence = Q1 – 3 x IQR Upper Outer Fence = Q3 + 3 x IQR
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 22
Mileage data + 100 Value in Extra JMP Row
0
20
40
60
80
100
120
100.0%99.5%97.5%90.0%75.0%50.0%25.0%10.0%2.5%0.5%0.0%
maximum
quartilemedianquartile
minimum
100.00 100.00 100.00 44.00 36.50 30.00 24.50 18.80 13.00 13.00 13.00
Quantiles
Outlier Box Plot obtained by going to the Analyze menu and selecting the Distributionoption. This is the default box plot.The Quantile Box Plot (innermost) was obtained by going to the red triangle menu on top of the histogram and selecting the option.
Homework:Turn in this JMP output
12
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 23
Explanation of JMP Output
A and B are called the lower and upper fences. The dashed lines (whiskers) are the most extreme observations still within the fences. A point outside the fences is considered a possible outlier and should be examined with care.
Lower (Inner) Fence = Q1 – 1.5 x IQRUpper (Inner) Fence = Q3 + 1.5 x IQR
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 24
Explanation of JMP Output
This explanation was obtained by going to the Tools menu of JMP, selecting the Help (?) tool, and then clicking with it on top of the JMP output for which you want the explanation.
13
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 25
Using Box Plots to Compare
Open circle denotes an extreme outlier, i.e. a data point beyondthe upper outer fence Q3 + 3 x IQR
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 26
Using Box Plots to Compare: JMP Output
Homework: Do the JMP tutorial “Using Box Plots to Compare”
Cos
t
0
5000
10000
15000
20000
25000
30000
35000
Control Treatment
Group
14
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 27
Standardized Z-scores•Often it is useful to express the data on a standardized scale:
, 1, 2, ... ,ii
x Xz i ns−
= =
•How many standard deviations a data value is above or below the mean•Unitless quantities standardized to have mean 0 and SD 1•Used to flag outliers e.g. observations with z > 3•Used to compare two or more batches of data on the same scale that originally were measured (possibly) on different scales•If data is normal z score can be translated into percentiles using Table A.3 of the normal distribution
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 28
Histogram
10
15
20
25
30
35
40
45
50
55
10
15
20
25
30
35
40
45
50
55
The two different histograms were obtained by going to the JMP Tools menu, grabbing the Hand tool, and moving it around on top of the histogram.
Relative frequency Histogram:
15
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 29
Histogram Remarks
• Gives a picture of the distribution of the data
• Area under the (relative frequency) histogram represents sample proportion
• Show if the distribution is– Symmetric or skewed– Unimodal or bimodal
• Gaps in the data may indicate a lack of precision with the measurement process
• Many quality control applications– Are there two processes?– Detection of rework or cheating– Tells if process meets the specifications
LSL USLLSL USL
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 30
Stem and Leaf Diagram
Obtained by going to the red triangle pop-up menu on top of the histogram and selecting Stem and Leaf.
Stem Leaf 5 0 4 4 00 3 5678 3 01112 2 557889 2 0134 1 7 1 3
Count1
2456411
Multiply Stem.Leaf by 10
Stem and Leaf
16
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 31
Normal PlotsIf the data follows a normal distribution, then the percentiles of that normal distribution should plot linearly against the sample percentiles, except for sampling variation
See next page to see how the normal scores were obtained
p - percentiles p ( )1z p−= Φ
Plot pversus z
ordered data = percentiles
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 32
Standard Normal Table
( )1z p−= Φ p
= p
17
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 33
Normal Plot
z
( )
( )
( )
1( )
( )1
i
i
i
xp
x p
xp
µσ
µ σ
µσ
−
−
− Φ =
⇒
= + Φ
⇒−
Φ =
Normal Score = z
1i
n +
ordered data
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 34
Normal Plot: JMP Output
Homework: Do the JMP tutorial “Normal Quantile Plot.” Turn in your JMP output
10
15
20
25
30
35
40
45
50
55.01 .05.10 .25 .50 .75 .90.95 .99
-2 -1 0 1 2 3
Normal Quantile Plot
z = normal score
( )1
iz pn
Φ = =+
ordered data
18
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 35
Easier to Interpret JMP
Output.01
.05
.10
.25
.50
.75
.90
.95
.99
-3
-2
-1
0
1
2
3
Nor
mal
Qua
ntile
Plo
t
10 15 20 25 30 35 40 45 50 55
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 36
Normal Goodness of Fit Test
19
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 37
Normal Goodness of Fit Test
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 38
SamplingVariationand Normal Plots
Use JMP’s Random function in the JMP calculator to obtain similar plots
Do the JMP tutorial: “Generating Standard Normal Random Numbers” to see how.
20
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 39
Generating Random Samples with JMP’s Random Function
1. Open a new data table and add as many rows as you want in the random sample
2. Right click on top of the column and select Formula… . This will bring up the calculator
3. Select the column in the calculator and then from Random menu select Random Normal
Consult JMP’s help underRandom Number Functions Use other random functions to do Problem
4.24, Parts (b) and (c)
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 40
Departuresfrom
Normalityand
Normal Plots
21
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 41
Normalizing Transformations•Data can be nonnormal in a number of ways, e.g., the distributionmay not be bell shaped or may be heavier tailed than the normal distribution or may not be symmetric
•Only the departure from symmetry can be easily corrected by transforming the data
•If the distribution is positively skewed, then the right tail needs tobe shrunk inward. The most common transformation used for thispurpose is the logarithmic transformation, i.e., logx x→
•The square-root transformation, i.e. x x→ providesa weaker shrinking effect; it is frequently used for (Poisson)count data
•For negatively skewed data used the inverse transformations 2orxx e x x→ →
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 42
Hospitalization Cost Before and After a Log Transformation
•The transformation facilitate statistical analysis, but the transformed scale might not have a physical meaning. •Also the same results are not obtained by transforming back to the original scale as by performing the same statistical analyses on the raw data, e.g., the average of the log is not the same as the log of the average
22
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 43
Using the JMP Calculator To Transform Data1. Double click to the right of the “Group” column to get a new column and name it “LogCost”
2. Right click on top of the LogCostcolumn and select Formula… . This takes you to JMP’scalculator
3. Select the Log function from the TranscendentalGroup
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 44
Comparison of the Hospital Cost After Log Transformation
LogC
ost
6
6.5
7
7.5
8
8.5
9
9.5
10
10.5
Control Treatment
Group
Later we will learn a formal statistical procedure, called the t-test, that can be used to test for differences of populations that are normally distributed and have similar variation.
23
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 45
Runs Chart•The graphical techniques described so far are inappropriate fortime series data because they ignore the time sequence
•A run chart graphs the data against time
Elongation Measurements on Fifty Springs
The histogram hides the time order effect and is therefore an inappropriate method for summarizing time-ordered data.
The Time Series platform of JMP can be used to produce a runs chart.
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 46
Summarizing Bivariate Data• When two or more variables are measured on each
sampling unit the result is multivariate data.• If only two variables are measured the result is bivariate
data. One variable is called the x variable and the other the y variable
• We can analyze the x and y variable separately with the methods we have learned so far but these methods would not answer questions about the relationship between x and y– What is the nature of the relationship (if any) between x
and y? – How strong is the relationship?– How well can one variable be predicted from the other?
24
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 47
Summarizing Bivariate Categorical Data:Two-Way Table
•The numbers in the cells are the frequencies of each possiblecombination of categories•Why would you like to transform the frequencies into percentages?
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 48
Table and Row Percentages
What information can you tell from these two tables?
25
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 49
What Percentage to Calculate on the Table?
• When a sample is drawn and then crossclassified according to the row and column variable then overall percentages should be reported. Row and column percentages are interpretable too. This is the case in the income and satisfaction example.
• When the sample is drawn by using predetermined sample sizes for the rows (or columns), the row (or column) percentages should be reported. Other percentages do not make sense. (See example in the next page.)
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 50
Appropriate Percentage Example
Here row percentages are meaningful but column percentages are not. It would be misleading to report that 25/30 = 83% of the people with a respiratory illness are smokers, because the proportion of smokers in the sample does not reflect the true proportion of smokers in the population.
26
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 51
Two-Way Table: JMP
Output
Inco
me
($)
15,000-25,000
6000-15,000
<6000
>25,000
283.11
25.9311.91
818.99
25.3934.47
131.44
20.975.53
11312.5427.4348.09
384.22
35.1913.15
10411.5432.6035.99
222.44
35.487.61
12513.8730.3443.25
242.66
22.2211.65
808.88
25.0838.83
202.22
32.269.71
829.10
19.9039.81
182.00
16.6710.53
545.99
16.9331.58
70.78
11.294.09
9210.2122.3353.80
23526.08
28932.08
20622.86
17118.98
10811.99
31935.41
626.88
41245.73
901
Job SatisfactionCountTotal %Col %Row %
Little Dissatisfied
Moderately Satisfied
Very Dissatisfied
Very Satisfied
Homework: Do the JMP tutorial “Summarizing Bivariate Categorical Data.”
•Percentages can be deleted in JMP•This table needed some cleaning up in PowerPoint after pasting.
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 52
Two-Way Table: JMP
Output
Homework: Do the JMP tutorial “Summarizing Bivariate Categorical Data.”
•Percentages can be deleted in JMP•This table needed some cleaning up in PowerPoint after pasting.
27
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 53
JMP’s Mosaic Plot
Job
Satis
fact
ion
0.00
0.25
0.50
0.75
1.00
<6000 6000-15,000 15,000-25,000 >25,000
Income ($)
Very Dissatisfied
Little Dissatisfied
Moderately Satisfied
Very Satisfied
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 54
Simpson’s Paradox: Collapsing Several Variable into Two Variables
University of California sex-bias study in graduate admissions:
Is there a gender bias?
28
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 55
Simpson’s Paradox: Collapsing Several Variable into Two Variables
University of California sex-bias study in graduate admissions:
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 56
How Can the Admission Rates be Compared
29
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 57
Summarizing Bivariate Numerical Data
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 58
Scatter Plot
30
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 59
Scatter Plot: JMP Output
2
3
4
5
6
7
8
9M
etho
d B
2 3 4 5 6 7 8 9Method A
Homework: Do the JMP tutorial “Scatter Plot.”
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 60
Example: Life Expectancy by Race and Gender
31
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 61
Labeled Scatter Plot
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 62
Labeled Scatter Plot: JMP Output
Homework: Go to the JMP tutorial “Labeled Scatter Plot” and produce the graphs in this page and the next. Turn them in.
45
50
55
60
65
70
75
80
Year
s
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000Year
32
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 63
Labeled Scatter
Plot: JMP Output
45
50
55
60
65
70
75
80
Yea
rs
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000Year
Linear Fit Group=="BF"Linear Fit Group=="BM"Linear Fit Group=="WF"Linear Fit Group=="WM"
What is the message in this plot?
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 64
Sample Correlation CoefficientA single numerical summary statistic which measures the strengthof a linear relationship between x and y
1 1
where1 1( )( )
1 1is the sample covariance
xy
x y
n n
xy i i i ii i
sr
s s
s x x y y x y nx yn n= =
=
= − − = − − − ∑ ∑
1
11
ni i
i x y
x x y yrn s s=
− −= −
∑Note thatAverage of the products of the standardized scores
33
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 65
Properties of the Sample Correlation Coefficient r
• Properties similar to the population correlation coefficient ρ– Unitless quantity– Takes values between –1 and 1– The extreme values are attained if and only if the points
(xi, yi) fall exactly on a straight line (r = -1 for a line with negative slope and r = +1 for a line with positive slope.)
– Takes values close to zero if there is no linear relationship between x and y
Homework: Do the JMP tutorial “Correlation Coefficient”
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 66
Values of the Correlation Coefficient
34
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 67
Scatter Plots with
Zero Correlation
Lesson: Always plot the data first -before interpreting numerical summaries such as the correlation coefficient!
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 68
Military Draft
r = -0.226
Is the plot on the next page better?
There appears to be no linear relationship between the birth date and the selection number in the military draft lottery.
35
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 69
Military Draft
r = -0.866Lesson: If the data is averaged over subsets of the sample, then the correlation coefficient is usually made larger in absolute value.
Another example: The daily closes of the S&P 500 stock index and the S&P bond index exhibit less correlation than the monthly averages of the daily closes of these indexes.
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 70
Correlation and Causation• High correlation is frequently mistaken for a
cause-and-effect relationship. Such a conclusion may not be valid in observational studies, where the variables are not controlled– A lurking variable may be affecting both variables– One can only claim association not causation
• Countries with high fat diets tend to have higher incidences of breast and colon cancer. Can we conclude causation?
• A common lurking variable in many studies is time order– Wealth and health problems go up with age. Does wealth cause
health problems?• Sometimes correlations can be found without any plausible
explanation, e.g., sun spots and economic cycles
36
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 71
Straight Line Regression•It is convenient to summarize the relationship between twovariables, x and y, by an equation to predict y for a given x. The simplest equation is that of a straight line.•Least squares method:
y x
y y x xrs s
− −=
Equation predicts that if the value of x is one standard deviation away for the mean, then the value of y will be r standard deviations (rsy) away from the mean
where andy
x
sy a bx b r a y bx
s= + = = −
Slope-intercept form of the regression line:
Homework: Do the JMP tutorial “Simple Linear Regression.”
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 72
Regression Toward the Mean• “Regression” means falling back• Regression towards the mean (Francis Galton and Karl
Pearson)– Many hereditary traits tend to fall back towards the mean in
successive generations– What simple explanation for this phenomenon is there?
• Galton’s data on father’s heights and son’s heights
• It is natural to think that the regression line is y = x + 1, that is, that on the average a son would be taller than his father by 1 inch.
68, 69, 2.7, 2.7, 0.5x yx y s s r
37
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 73
Regression Toward the Mean
• Actual regression line: y-69=0.5(x-68)– Son’s of 72 inch tall fathers are 71 inches tall on the
average– Son’s of 64 inch tall (short?) fathers are 67 inches tall
(short?) on the average• Same phenomenon occurs in pretest-posttest situation• What happens to the SAT scores of students who retake
the SAT exam after doing particularly bad?
Regression fallacy: Misleading conclusions due to regression towards the mean
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 74
Galton’sPlot of Father-
Son Height Data
Dashed Line:y = x + 1
Solid Line:Regression line
38
6/9/2004 Unit 4 - Stat 571 - Ramón V. León 75
Stephen Senn’s Quote on Regression to the Mean
•“This phenomenon of regression is widespread and general so that, for example, patients which have been selected for a clinical trial on grounds of high blood pressure may, on purely statistical grounds, be expected, other things being equal, to show an improvement in blood pleasure when measured again. This omnipresent, but widely ignored feature is an important reason for the unreliability of uncontrolled studies and it is asad fact that medical researchers are still regular and gulliblevictims of this phenomenon.”•Importance of the counterfactual view of causality.
Statistical Issues in Drug Development by Stephen Senn. Wiley