Stats

Applied Managerial Statistics

Steven L. Scott

Winter 2005-2006

Contents

1 Looking at Data 1

1.1 Our First Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Summaries of a Single Variable . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Relationships Between Variables . . . . . . . . . . . . . . . . . . . . 9

1.4 The Rest of the Course . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Probability Basics 17

2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 The Probability of More than One Thing . . . . . . . . . . . . . . . 20

2.2.1 Joint, Conditional, and Marginal Probabilities . . . . . . . . 20

2.2.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.3 A “Real World” Probability Model . . . . . . . . . . . . . . . 27

2.3 Expected Value and Variance . . . . . . . . . . . . . . . . . . . . . . 33

2.3.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.3 Adding Random Variables . . . . . . . . . . . . . . . . . . . . 36

2.4 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 42

3 Probability Applications 45

3.1 Market Segmentation and Decision Analysis . . . . . . . . . . . . . . 45

3.1.1 Decision Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.1.2 Building and Using Market Segmentation Models . . . . . . 48

3.2 Covariance, Correlation, and Portfolio Theory . . . . . . . . . . . . . 50

3.2.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.2 Measuring the Risk Penalty for Non-Diversified Investments . 51

3.2.3 Correlation, Industry Clusters, and Time Series . . . . . . . 52

i

ii CONTENTS

3.3 Stock Market Volatility . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Estimation and Testing 61

4.1 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.1 Example: log10 CEO Total Compensation . . . . . . . . . . . 64

4.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.1 Can we just replace σ with s? . . . . . . . . . . . . . . . . . 67

4.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Hypothesis Testing: The General Idea . . . . . . . . . . . . . . . . . 71

4.4.1 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.2 Hypothesis Testing Example . . . . . . . . . . . . . . . . . . 74

4.4.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . 75

4.5 Some Famous Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . 76

4.5.1 The One Sample T Test . . . . . . . . . . . . . . . . . . . . . 76

4.5.2 Methods for Proportions (Categorical Data) . . . . . . . . . . 79

4.5.3 The χ2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Simple Linear Regression 87

5.1 The Simple Linear Regression Model . . . . . . . . . . . . . . . . . . 87

5.1.1 Example: The CAPM Model . . . . . . . . . . . . . . . . . . 89

5.2 Three Common Regression Questions . . . . . . . . . . . . . . . . . 91

5.2.1 Is there a relationship? . . . . . . . . . . . . . . . . . . . . . . 91

5.2.2 How strong is the relationship? . . . . . . . . . . . . . . . . . 92

5.2.3 What is my prediction for Y and how good is it? . . . . . . . 93

5.3 Checking Regression Assumptions . . . . . . . . . . . . . . . . . . . 97

5.3.1 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.3.2 Non-Constant Variance . . . . . . . . . . . . . . . . . . . . . 104

5.3.3 Dependent Observations . . . . . . . . . . . . . . . . . . . . . 106

5.3.4 Non-normal residuals . . . . . . . . . . . . . . . . . . . . . . . 109

5.4 Outliers, Leverage Points and Influential Points . . . . . . . . . . . . 110

5.4.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.4.2 Leverage Points . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.4.3 Influential Points . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.4.4 Strategies for Dealing with Unusual Points . . . . . . . . . . 114

5.5 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6 Multiple Linear Regression 117

6.1 The Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2 Several Regression Questions . . . . . . . . . . . . . . . . . . . . . . 119

CONTENTS iii

6.2.1 Is there any relationship at all? The ANOVA Table and theWhole Model F Test . . . . . . . . . . . . . . . . . . . . . . . 120

6.2.2 How Strong is the Relationship? R2 . . . . . . . . . . . . . . 1226.2.3 Is an Individual Variable Important? The T Test . . . . . . . 123

6.2.4 Is a Subset of Variables Important? The Partial F Test . . . 1246.2.5 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.3 Regression Diagnostics: Detecting Problems . . . . . . . . . . . . . . 1276.3.1 Leverage Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.3.2 Whole Model Diagnostics . . . . . . . . . . . . . . . . . . . . 1296.4 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.4.1 Detecting Collinearity . . . . . . . . . . . . . . . . . . . . . . 1346.4.2 Ways of Removing Collinearity . . . . . . . . . . . . . . . . . 135

6.4.3 General Collinearity Advice . . . . . . . . . . . . . . . . . . . 1366.5 Regression When X is Categorical . . . . . . . . . . . . . . . . . . . 137

6.5.1 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . 1376.5.2 Factors with Several Levels . . . . . . . . . . . . . . . . . . . 141

6.5.3 Testing Differences Between Factor Levels . . . . . . . . . . . 1446.6 Interactions Between Variables . . . . . . . . . . . . . . . . . . . . . 145

6.6.1 Interactions Between Continuous and Categorical Variables . 1476.6.2 General Advice on Interactions . . . . . . . . . . . . . . . . . 150

6.7 Model Selection/Data Mining . . . . . . . . . . . . . . . . . . . . . . 150

6.7.1 Model Selection Strategy . . . . . . . . . . . . . . . . . . . . 1516.7.2 Multiple Comparisons and the Bonferroni Rule . . . . . . . . 152

6.7.3 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . 152

7 Further Topics 1577.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7.2 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607.3 More on Probability Distributions . . . . . . . . . . . . . . . . . . . 162

7.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1637.3.2 Exponential Waiting Times . . . . . . . . . . . . . . . . . . . 164

7.3.3 Binomial and Poisson Counts . . . . . . . . . . . . . . . . . . 1647.3.4 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.4 Planning Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1657.4.1 Different Types of Studies . . . . . . . . . . . . . . . . . . . . 165

7.4.2 Bias, Variance, and Randomization . . . . . . . . . . . . . . . 1667.4.3 Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 1707.4.5 Observational Studies . . . . . . . . . . . . . . . . . . . . . . 171

7.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

iv CONTENTS

A JMP Cheat Sheet 177A.1 Get familiar with JMP. . . . . . . . . . . . . . . . . . . . . . . . . . 177A.2 Generally Neat Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . 177

A.2.1 Dynamic Graphics . . . . . . . . . . . . . . . . . . . . . . . . 177A.2.2 Including and Excluding Points . . . . . . . . . . . . . . . . 177A.2.3 Taking a Subset of the Data . . . . . . . . . . . . . . . . . . 178A.2.4 Marking Points for Further Investigation . . . . . . . . . . . 178A.2.5 Changing Preferences . . . . . . . . . . . . . . . . . . . . . . 178A.2.6 Shift Clicking and Control Clicking . . . . . . . . . . . . . . 178

A.3 The Distribution of Y . . . . . . . . . . . . . . . . . . . . . . . . . . 178A.3.1 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . 178A.3.2 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . 179

A.4 Fit Y by X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179A.4.1 The Two Sample T-Test (or One Way ANOVA). . . . . . . 179A.4.2 Contingency Tables/Mosaic Plots . . . . . . . . . . . . . . . 179A.4.3 Simple Regression . . . . . . . . . . . . . . . . . . . . . . . 180A.4.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 181

A.5 Multivariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181A.6 Fit Model (i.e. Multiple Regression) . . . . . . . . . . . . . . . . . . 181

A.6.1 Running a Regression . . . . . . . . . . . . . . . . . . . . . . 181A.6.2 Once the Regression is Run . . . . . . . . . . . . . . . . . . . 181A.6.3 Including Interactions and Quadratic Terms . . . . . . . . . 182A.6.4 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182A.6.5 To Run a Stepwise Regression . . . . . . . . . . . . . . . . . 182A.6.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 183

B Some Useful Excel Commands 185

C The Greek Alphabet 189

D Tables 191D.1 Normal Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192D.2 Quick and Dirty Normal Table . . . . . . . . . . . . . . . . . . . . . 193D.3 Cook’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194D.4 Chi-Square Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

Don’t Get Confused

1.1 Standard Deviation vs. Variance. . . . . . . . . . . . . . . . . . . . . 62.1 Understanding Probability Distributions . . . . . . . . . . . . . . . . 192.2 The difference between X1 +X2 and 2X. . . . . . . . . . . . . . . . 373.1 A general formula for the variance of a linear combination . . . . . . 543.2 Correlation vs. Covariance . . . . . . . . . . . . . . . . . . . . . . . . 564.1 Standard Deviation vs. Standard Error . . . . . . . . . . . . . . . . 664.2 Which One is the Null Hypothesis? . . . . . . . . . . . . . . . . . . . 724.3 The Standard Error of a Sample Proportion. . . . . . . . . . . . . . 805.1 R2 vs. the p-value for the slope . . . . . . . . . . . . . . . . . . . . . 926.1 Why call it an “ANOVA table?” . . . . . . . . . . . . . . . . . . . . 122

The “Don’t Get Confused” call-out boxes highlight points that often cause newstatistics students to stumble.

v

Not on the Test

4.1 Does the size of the population matter? . . . . . . . . . . . . . . . . 674.2 What are “Degrees of Freedom?” . . . . . . . . . . . . . . . . . . . . 694.3 Rationale behind the χ2 degrees of freedom calculation . . . . . . . . 845.1 Why sums of squares? . . . . . . . . . . . . . . . . . . . . . . . . . . 895.2 Box-Cox transformations . . . . . . . . . . . . . . . . . . . . . . . . . 995.3 Why leverage is “Leverage” . . . . . . . . . . . . . . . . . . . . . . . 1146.1 How to build a leverage plot . . . . . . . . . . . . . . . . . . . . . . . 1296.2 Making the coefficients sum to zero . . . . . . . . . . . . . . . . . . . 1436.3 Where does the Bonferroni rule come from? . . . . . . . . . . . . . . 153

There is some material that is included because a minority of students are likelyto be curious about it. Much of this material has to do with minor technical pointsor questions of rationale that are not central to the course. The “Not on the Test”call-out boxes explain such material to interested students, while letting others knowthat they can spend their energy reading elsewhere.

vii

viii NOT ON THE TEST

Preface

An MBA statistics course differs from an undergraduate course primarily in terms ofthe pace at which material is covered. Much of the material in an MBA course canalso be found in an undergraduate course, but an MBA course tends to emphasizetopics that undergraduates never get to because they spend their time focusing onother things. Unfortunately for MBA students, most statistics textbooks are writtenwith undergraduates in mind. These notes are an attempt to help MBA statisticsstudents navigate through the mountains of material found in typical undergraduatestatistics books.

Most undergraduate statistics courses last for a semester and conclude witheither one way ANOVA or simple regression. Though it has a similar number ofcontact hours, our MBA course lasts for eight weeks and covers up to multipleregression. To get to where we need to be at the course’s end we must deviate fromthe usual undergraduate course and make regression our central theme. In doing so,we condense material that occupies several chapters in an undergraduate textbookinto a single chapter on confidence intervals and hypothesis tests. Undergraduatetextbooks often present this material in four or more chapters, usually with othermaterial that does not help prepare students to study regression, and in manycases with an undue emphasis on having students do calculations themselves. Ourphilosophy is that by condensing the non-regression hypothesis testing materialinto one chapter we can present a more unified view of how hypothesis testing isused in practice. Furthermore, the one-sample problems typically found in thesechapters are much less compelling than regression problems, so packing them in toone chapter lets us get to the good stuff more quickly. Material such as the twosample t test and the F test from one way ANOVA are presented as special cases ofregression, which reduces the number of paradigms that students must master overan eight week term.

A textbook for a course serves three basic functions. A good text conciselypresents the ideas that a student must learn. It illustrates those ideas with examplesas the ideas are presented. It is also a source of problems and exercises for studentsto work on to reinforce the ideas from the reading and from lecture. At some

ix

x NOT ON THE TEST

point these notes may evolve into a textbook, but they’re not there yet. The notesare evolving into a good presentation of statistical theory, but the hardest part ofwriting a textbook is developing sufficient numbers of high quality examples andexercises. At present, we have borrowed and adapted examples and exercises fromthree primary sources, all of which are required or optional course reading:

• Statistical Thinking for Managers, 4th Edition, by Hildebrand and Ott, pub-lished by Duxbury Press.

• Business Analysis Using Regression, by Foster, Stine, and Waterman, pub-lished by Springer.

• JMP Start Statistics, by John Sall, published by Duxbury.

Each of these sources provides data sets for their problems and examples, H&Ofrom the diskette included with their book, FSW from the internet, and Sall fromthe CD containing the JMP-IN program. We will distribute the FSW data setselectronically.

Chapter 1

Looking at Data

Any data analysis should begin with “looking at the data.” This sounds like anobvious thing to do, but most of the time you will have so much data that youcan’t look at it all at once. This Chapter provides some tools for creating usefulsummaries of the data so that you can do a cursory examination of a data setwithout having to literally look at each and every observation. Other goals of thisChapter are to illustrate the limitations of simply “looking at data” as a form ofanalysis, even with the tools discussed here, and to motivate the material in laterChapters.

1.1 Our First Data Set

Consider the data set forbes94.jmp (provided by Foster et al., 1998), which lists the800 highest paid CEO’s of 1994, as ranked by Forbes magazine. When you open thedata set in JMP (or any other computer software package) you will notice that thedata are organized by rows and columns. This is a very common way to organizedata. Each row represents a CEO. Each column represents a certain characteristicof each CEO such as how much they were paid in 1994, the CEO’s age, and theindustry in which the CEO’s company operates. In general terms each CEO isan observation and each column of the data table is a variable. Database peoplesometimes refer to observations as records and variables as fields.

One reason the CEO compensation data set is a good first data set for us to lookat is its size. There are 800 observations, which is almost surely too many for you tointernalize by simply looking at the individual entries. Some sort of summary mea-sures are needed. A second feature of this data set is that it contains different typesof variables. Variables such as the CEO’s age and total compensation are continu-

1

2 CHAPTER 1. LOOKING AT DATA

Figure 1.1: The first few rows and columns of the CEO data set.

ous1 variables, while variables like the CEO’s industry and MBA status (whetheror not each CEO has an MBA) are categorical2. The distinction between categori-cal and continuous variables is important because different summary measures areappropriate for categorical and continuous variables.

Regardless of whether a variable is categorical or continuous, there are numericaland graphical methods that can be used to describe it (albeit different numericaland graphical methods for different types of variables). These are described below.

1.2 Summaries of a Single Variable

1.2.1 Categorical Data

An example of a categorical variable is the CEO’s industry (see Figure 1.2). Thedifferent values that a categorical variable can assume are called levels. The mostcommon numerical summary of a categorical variable is a frequency table or con-

tingency table, which simply counts the number of times each level occurred in thedata set. It is sometimes easier to interpret the counts as fractions of the totalnumber of observations in the data set, also known as relative frequencies. The

1For our purposes, continuous variables are numerical variables where the numbers mean some-thing, as opposed to being labels for categorical levels (1=yellow, 2=blue, etc.). There are stricter,and more precise, definitions that could be applied.

2There are actually two different types of categorical variables. Nominal variables are categorieslike red and blue, with no order. Ordinal variables have levels like “strongly disagree, disagree,agree, . . . ” which are ordered, but with no meaningful numerical values. We will treat all categor-ical variables as nominal.

1.2. SUMMARIES OF A SINGLE VARIABLE 3

(a) Histogram and Mosaic Plot (b) Frequency Distribution

Figure 1.2: Graphical and numerical summaries of CEO industries (categorical data). Themosaic plot is very useful because you can put several of them next to each other to comparedistributions within different groups (see Figure 1.6).

choice between viewing the data as frequencies or relative frequencies is largely amatter of personal taste.

If the categorical variable contains many levels then it will be easier to lookat a picture of the frequency distribution such as a histogram or a mosaic plot.A histogram is simply a bar-chart depicting a frequency distribution. The biggerthe bar, the more frequent the level. Histograms have been around more or lessforever, but mosaic plots are relative newcomers in the world of statistical graphics.A mosaic plot works like a pie chart, but it represents relative frequencies as slicesof a stick (or a candy bar) instead of slices of a pie. Mosaic plots have two bigadvantages over pie charts. First, it is easier for people to see linear differences thanangular differences (okay, maybe that’s not so big since you’ve been looking at piecharts all your life). The really important advantage of mosaic plots is that you canput several of them next to each other to compare categorical variables for severalgroups (see Figure 1.6).

The summaries in Figure 1.2 indicate that Finance is by far the most frequentindustry in our data set of the 800 most highly paid CEO’s. The constructionindustry is the least represented, and you can get a sense of the relative numbers of


CEO’s from the other industries.

1.2.2 Continuous Data

An example of a continuous variable is a CEO’s age or salary. It is easier for manypeople to think of summaries for continuous data because you can imagine graphingthem on a number line, which gives the data a sense of location. For example, youhave a sense of how far an 80 year old CEO is than a 30 year old CEO, but it isnonsense to ask how far a Capital Goods CEO is from a Utilities CEO.

Summaries of continuous data fall into two broad categories: measures of centraltendency (like the mean and the median) and measures of variability (like standarddeviation and range). Another way to classify summaries of continuous data iswhether they are based on moments or quantiles.

Moments (mean, variance, and standard deviation)

Moments (a term borrowed from physics) are simply averages. The first moment isthe sample mean

x =1

n

n∑

i=1

xi.

You certainly know how to take an average, but it is useful to present the formulafor it to get you used to some standard notation that is going to come up repeatedly.In this formula (and in most to follow) n represents the sample size. For the CEOdata set n = 800. The subscript i represents each individual observation in the dataset (imagine i assuming each value 1, 2, . . . , 800 in turn). For example, if we areconsidering CEO ages, then x1 = 52, x2 = 62, x3 = 56, and so on (see Figure 1.1).The summation sign simply says to add up all the numbers in the data set. Thus,this formula says to add up all the numbers in the data set and divide by the samplesize, which you already knew. FYI: putting a bar across the top of a letter (like x,pronounced “x bar”) is standard notation in statistics for “take the average.”

The second moment is the sample variance

s2 =1

n− 1

n∑

i=1

(xi − x)2.

It is the “second” moment because the thing being averaged is squared.3 The samplevariance looks at each observation xi, asks how far it is from the mean (xi − x),squares each deviation from the mean (to make it positive), and takes the average. It

3The third moment has something cubed in it, and so forth.


would take you a while to try to remember the formula for s2 by rote memorization.However, if you remember that s2 is the “average squared deviation from the mean”then the formula will make more sense and it will be easier to remember. There aretwo technical details that cause people to get hung up on the formula for samplevariance. First, why divide by n− 1 instead of n? In any data set with more thana few observations dividing by n − 1 instead of n makes almost no difference. Wejust do it to make math geeks happy for reasons explained (kind of) in the call-outbox on page 69. Second, why square each deviation from the mean instead of doingsomething like just dropping the minus signs? This one is a little deeper. If you’rereally curious you can check out page 89 (though you may want to wait a little bituntil we get to Chapter 5).

So the sample variance is the “average squared deviation from the mean.” Youuse the sample variance to measure how spread out the data are. For example, thevariance of CEO ages is 47.81. Wait, 47.81 what? Actually, the variance is hard tointerpret because when you square each CEO’s xi − x you get an answer in “yearssquared.” Nobody pretends to know what that means. In practice, the variance iscomputed en route to computing the standard deviation, which is simply the squareroot of the variance

s =√s2 =

√√√√

1

n− 1

n∑

i=1

(xi − x)2.

The standard deviation of CEO ages is 6.9 years, which says that CEO’s are typicallyabout 7 years above or below the average.

Standard deviations are used in two basic ways. The first is to compare the“reliability” of two or more groups. For example: the standard deviation of CEOages in the Chemicals industry is 3.18 years, while the SD for CEO’s in the Insuranceindustry is 8.4 years. That means you can expect to find more very old and veryyoung CEO’s in the Insurance industry, while CEO’s in the Chemicals industrytend to be more tightly clustered about the average CEO age in that industry. Thesecond, and more widespread use of standard deviations is as a standard unit ofmeasurement to help us decide whether two things are close or far. For example,the standard deviation of CEO total compensation is $8.3 million. It so happensthat Michael Eisner made over $200 million that year. The average compensationwas $2.8 million, so Michael Eisner was 24 standard deviations above the mean.That, we will soon learn, is a lot of standard deviations.

Quantiles

Quantiles (a fancy word for percentiles) are another method of summarizing a con-tinuous variable. To compute the p’th quantile of a variable simply sort the variable


Don’t Get Confused! 1.1 Standard Deviation vs. Variance.

Standard deviation and variance both measure how far away from your“best guess” you can expect a typical observation to fall. They measurehow spread out a variable is. Variance measures spread on the squaredscale. Standard deviation measures spread using the units of the variable.

from smallest to largest and find which number is p% of the way through the dataset. If p% of the way through the data set puts you between two numbers, just takethe average of those two numbers. The most famous quantiles are the median (50’thpercentile), the minimum (0’th percentile), and the maximum (100’th percentile).If you’re given enough well chosen quantiles (say 4 or 5) you can get a pretty goodidea of what the variable looks like.

The main reason people use quantiles to summarize data is to minimize theimportance of outliers, which are observations far away from the rest of the data.Figure 1.3 shows the histogram of CEO total compensation, where Michael Eisneris an obvious outlier. A big outlier like Eisner can have an big impact on averageslike the mean and variance (and standard deviation). With Eisner in the samplethe mean compensation is $2.82 million. The mean drops to $2.57 million with himexcluded. Eisner has an even larger impact on the standard deviation, which is $8.3million with him in the sample and $4.3 million without him. The median CEOcompensation is $1.3 million with or without Michael Eisner.

Outliers have virtually no impact on the median, but they do impact the maxi-mum and minimum values. (The maximum CEO compensation with Eisner in thedata set is $202 million. It drops to $53 million without him.) If you want to usequantiles to measure the spread in the data set it is smart to use something otherthan the max and min. The first and third quartiles (aka the 25’th and 75’th per-centiles) are often used instead. The first and third quartiles are $787,000 and $2.5million regardless of Eisner’s presence.

Quantiles are useful summaries if you want to limit the influence of outliers,which you may or may not want to do in any given situation. Sometimes outliersare the most interesting points (people certainly seem to find Michael Eisner’s salaryvery interesting).

Graphical Summaries

Boxplots and histograms are the best ways to visualize the distribution of a contin-uous variable. Histograms work by chopping the variable into bins, and counting


Figure 1.3: Histogram of CEO total compensation (left panel) and log10 CEO com-pensation (right panel). Michael Eisner made so much money that we had to writehis salary in “scientific notation.” On the log scale the skewness is greatly reducedand Eisner is no longer an outlier.

frequencies for each bin.4 For boxplots the top of the box is the upper quartile i.e.the point 75% of the way through the data. The bottom of the box is the lowerquartile i.e. the point which 25% of the data lies below. Thus the box in a boxplotcovers the middle half of the data. The line inside the box is the median. The lines(or “whiskers”) extending from the box are supposed to cover “almost all” the restof the data5. Outliers, i.e. extremely large or small values, are represented as singlepoints. Histograms usually provide more information than boxplots, though it iseasier to see individual outliers in a boxplot. The main advantage of boxplots, likemosaic plots, is that only one dimension of the boxplot means anything (the heightof the boxplot in Figure 1.4 means absolutely nothing). Therefore it is much easierto look at several boxplots than it is to look at several histograms. This makesboxplots very useful for comparing the distribution of a continuous variable acrossseveral groups. (See Figure 1.7).

4At some point someone came up with a good algorithm for choosing histogram bins, which youshouldn’t waste your time thinking about.

5The rules for how long to make the whiskers are arcane and only somewhat standard. Youshouldn’t worry about them.


Quantiles100.0% maximum 81.00099.5% 77.00097.5% 69.00090.0% 64.00075.0% quartile 61.00050.0% median 57.00025.0% quartile 52.00010.0% 48.0002.5% 42.0000.5% 36.0000.0% minimum 29.000

Figure 1.4: Numerical and graphical summaries of CEO ages (continuous data). Thenormal curve is superimposed. The mean of the data is 56.325 years. The standard deviationis 6.9 years. How well do the quantiles in the data match the predictions from the normalmodel?

The Normal Curve

Often we can use the normal curve, or “bell curve,” to model the distribution of acontinuous variable. Although many continuous variables don’t fit the normal curvevery well, a surprising number do. In Chapter 2 we will learn why the normal curveoccurs as often as it does.

If the histogram of a continuous variable looks approximately like a normal curvethen all the information about the variable is contained in its mean and standarddeviation (a dramatic data reduction: from 800 numbers down to 2). The normalcurve tells us what fraction of the data set we can expect to see within a certainnumber of standard deviations away from the mean. In Chapter 2 we will learn howto use the normal curve to make very precise calculations. For now, some of themost often used normal calculations are summarized by the empirical rule, whichsays that if the normal curve fits well then (approximately): 68% of the data iswithin ±1 SD of the mean, 95% within ±2 SD and 99.75% within ±3 SD.

To illustrate the empirical rule, consider Figure 1.4, which lists several observedquantiles for the CEO ages. The data appear approximately normal, so the empiricalrule says that about 95% of the data should be within 2 standard deviations of themean. That means about 2.5% of the data should be more than 2 SD’s above themean, and a similar amount should be more than 2 SD’s below the mean. Themean is 56.3 years, and the size of an SD is 6.9 years. So 2 SD’s above the mean

1.3. RELATIONSHIPS BETWEEN VARIABLES 9

is about 70. The 97.5% quantile is actually 69, which is pretty close to what thenormal curve predicted.

Of course you can’t use the empirical rule if the histogram of your data doesn’tlook approximately like a normal curve. This is a subjective call which takes somepractice to make. Figure 1.5 shows the four most common ways that the datacould be non-normal. The distribution can be skewed with a heavy tail trailingoff in one direction or the other. The direction of the skewness is the directionof the tail, so CEO compensation is “right skewed” because the tail trails off tothe right. A variable can have fat tails like in Figure 1.5(b). You can think offat tailed distributions as being skewed in both directions. The most common fattailed distributions in business applications are the distributions of stock returns(closely related to corporate profits). Figure 1.5(c) shows evidence of discreteness.It shows a variable which is “continuous” according to our working definition, butwith relatively few distinct values. Finally, Figure 1.5(d) shows a bimodal variable,i.e. a variable whose distribution shows two well-separated clusters.

A more precise way to check whether the normal curve is a good fit is to use anormal quantile plot (aka. quantile-quantile plot, or Q-Q plot). This plots the dataordered from smallest to largest versus the corresponding quantiles from a normalcurve. If the data looks like a normal the quantile plot should have an approximatelystraight line. If the dots deviate substantially from a straight line this indicates thatthe data does not look normal. For more details see the discussion of Q-Q plots onpage 41.

1.3 Relationships Between Variables

There are two main reasons to look at variables simultaneously:

• To understand the relationship e.g. if one variable increases what happens tothe other (if I increase the number of production lines what will happen toprofit).

• To use one or more variables to predict another e.g. using Profit, Sales, PEratio etc to predict the correct value for a stock. We will then purchase thestock if its current value is under what we think it should be.

If we want to use one variable X to predict another variable Y then we call X thepredictor and Y the response. The way we analyze the relationship depends on thetypes of variables X and Y are. There are four possible situations depending onwhether X and Y are categorical or continuous (see page 179). Three are describedbelow. The fourth (when Y is categorical and X is continuous) is best describedusing a model called logistic regression which we won’t see until Chapter 7. If


(a) Skewness: CEO Compensation (top 20 outliersremoved)

(b) Heavy Tails: Corporate Profits

(c) Discreteness: CEO’s age upon obtaining under-graduate degree (top 5 outliers excluded)

(d) Bimodal: Birth Rates of Different Countries

Figure 1.5: Some non-normal data.


Figure 1.6: Contingency table and mosaic plot for auto choice data.

X is categorical then it is possible to simply do the analysis you would do for Yseparately for each level of X. If X is continuous then this strategy is no longerfeasible.

Categorical Y and X

Just as with summarizing a single categorical variable, the main numerical tool forshowing the relationship between categorical Y and categorical X is a contingencytable. Figure 1.6 shows data collected by an automobile dealership listing the typeof car purchased by customers within different age groups. This type of data isoften encountered in Marketing applications.

The primary difference between two-way contingency tables (with two categor-ical variables) and one-way tables (with a single variable) is that there are moreways to turn the counts in the table into proportions. For example there were 22people in the 29-38 age group who purchased work vehicles. What does that meanto us? These 22 people represent 8.37% (=22/263) of the total data set. This isknown as a joint proportion because it treats X and Y symmetrically.

If you really want to think of one variable explaining another, then you wantto use conditional proportions instead. The contingency table gives you two groupsof conditional proportions because it doesn’t know in advance which variable youwant to condition on.

For example, if you want to see how automobile preferences vary by age then youwant to compute the distribution of TYPE conditional on AGEGROUP. Restrictyour attention to just the one row of the contingency table corresponding to 29-38year olds. What fraction of them bought work vehicles? There are 133 of them, so


the 22 people represent 16.54% of that particular row. Of that same group, 63%purchased family vehicles, and 20% purchased sporty vehicles. To see how autopreferences vary according to age group, compare these “row percentages” for theyoung, middle, and older age groups. It looks like sporty cars are less attractiveto older customers, family cars are more attractive to older customers, and workvehicles have similar appeal across age groups.

You could also condition the other way, by restricting your attention to the col-umn for work vehicles. Of the 44 work vehicles purchased, 22 (50%) were purchasedby 29-38 year olds. The younger demographic purchased 39% of work vehicles, whilethe older demographic purchased only 11%. By comparing these distributions acrosscar type you can see that most family and work cars tend to be purchased by 29-38year olds, while sporty cars tend to be purchased by 18-28 year olds.

Finally, the margins of the contingency table contain information about theindividual X and Y variables. Because of this, when you restrict your attentionto a single variable by ignoring other variables you are looking at its marginal

distribution. The same terminology is used for continuous variables too. Thus thetitle of Section 1.2 could have been “looking at marginal distributions.” We cansee from the margins of the table that the 29-38 age group was the most frequentlyobserved, and that family cars (a favorite of the 29-38 demographic) were the mostoften purchased.

Far and away the best way to visualize a contingency table is through a side-by-side mosaic plot like the one in Figure 1.6. The individual mosaic plots showyou the conditional distribution of Y (in this case TYPE) for each level of X (inthis case AGEGROUP). The plot represents the marginal distribution of X by thewidth of the individual mosaic plots: the 39+ demographic has the thinnest mosaicplot because it has the fewest members. The marginal distribution of the Y variableis a separate mosaic plot serving as the legend to the main plot. Finally, becauseof the way the marginal distributions are represented, the joint proportions in thecontingency table correspond to the area of the individual tiles. Thus you can seefrom Figure 1.6 that “family cars purchased by 29-38 year olds” is the largest cellof the table.

Side-by-side mosaic plots are a VERY effective way of looking at contingencytables. To see for yourself, open autopref.jmp and construct separate histograms forTYPE within each level of AGEGROUP (use the “by” button in the “Distributionof Y” dialog box).

Continuous Y and Categorical X

If you want to see how the distribution of a continuous variable varies across sev-eral groups you can simply list means and standard deviations (or your favorite


Figure 1.7: Side-by-side boxplots comparing log10

compensation for CEO’s in differentindustries.

quantiles) for each group. Graphically, the best way to do the comparison is withside-by-side boxplots. If there are only a few levels (2 or 3) you could look at a his-tograms for each level (make sure the axes all have the same scale), but beyond thatboxplots are the way to go. Multiple histograms are harder to read than side-by-sideboxplots because each histogram has a different sets of axes. Consider Figure 1.7,which compares log10 CEO compensation for CEO’s in different industries. As withmosaic plots, the width of the side-by-side boxplots depicts the marginal distribu-tion of X. Thus the finance industry has the widest boxplot because it is the mostfrequent industry in our data set. Compensation-wise, the finance CEO’s seemfairly typical of other CEO’s on the list. The aerospace-defense CEO’s are ratherwell paid, while the forest and utilities CEO’s haven’t done as well. To convinceyourself of the value of side-by-side boxplots, try doing the same comparison with19 histograms. Yuck!

Continuous Y and X

The best graphical way to show the relationship between two continuous variablesis a scatterplot like the one in Figure 1.8. Each dot represents a CEO. Dots on theright are older CEO’s. Dots near the top are highly paid CEO’s. From the Figureit appears that if there is a relationship between a CEO’s age and compensation itisn’t a very strong one.


Figure 1.8: Scatterplot showing log10 compensation vs. age for CEO dataset. Thebest fitting line and quadratic function are also shown.

Of course the Figure is plotted on the log scale, and small changes in log com-pensation can be large changes in terms of real dollars. Could there be a trendin the data that is just too hard to see in the Figure? We can use regression tocompute the straight line that best6 fits the trend in the data. The regression linehas a positive slope, which indicates that older CEO’s tend to be paid more thanyounger CEO’s.

Of course, the regression line also raises some questions.

1. The slope of the line isn’t very large. How large does a slope have to be beforewe conclude that it isn’t worth considering?

2. Why are we only looking at straight lines? We can also use regression to fitthe “best” quadratic function to the data. The linear and quadratic modelssay very different things about CEO compensation. The linear model saysthat older CEO’s make more than younger CEO’s. The quadratic model saysthat a CEO’s earning power peaks and then falls off. Which model should webelieve?

3. The regression line only describes the trend in the data. Our previous analyses(such as comparing log10 compensation by industry) actually described thedata themselves (both center and spread). Is there some way to numericallydescribe the entire data set and not just the trend.

6The regression line is “best” according to a specific criterion known as “least squares” whichis discussed in Chapter 5.

1.4. THE REST OF THE COURSE 15

4. What if a CEO’s compensation depends on more than one variable?

1.4 The Rest of the Course

The questions listed above are all very important, and we will spend much of therest of the course understanding the tools that help us answer them. Procedurally,questions 1 and 2 are answered by something called a p-value, which is included inthe computer output that you get when you fit a regression. Chapter 4 is largelyabout helping you understand p-values. To do so you need to know a few basic factsabout probability, the subject of Chapters 2 and 3. Chapters 3.2.3 and 5 return tothe more interesting topic of relationships between variables.

Question 3 will be dealt with in Chapter 5 once we learn a little more about thenormal curve in Chapter 2. Question 4 may be the greatest limitation of analyseswhich consist only of “looking at data.” To measure the impact that several Xvariables have on Y requires that you build a model, which is the subject of Chap-ter 6. By the end of Chapter 6 you will have a working knowledge of the multipleregression model, which is one of the most flexible and the most widely used modelsin all of statistics.

Chapter 2

Probability Basics

This Chapter provides an introduction to some basic ideas in probability. The focusin Chapter 1 was on looking at data. Now we want to start thinking about buildingmodels for the process that produced the data.

Throughout your math education you have learned about one mathematical tool,and then learned about its opposite. You learned about addition, then subtraction.Multiplication, then division. Probability and statistics have a similar relationship.Probability is used to define a model for a process that could have produced thedata you are interested in. Statistics then takes your data and tries to estimate theparameters of that model.

Probability is a big subject, and it is not the central focus of this course, sowe will only sketch some of the main ideas. The central characters in this Chapterare random variables. Every random variable has a probability distribution thatdescribes the values the random variable is likely to take. While some probabilitydistributions are simple, some of them are complicated. If a probability distributionis too complicated to deal with we may prefer to summarize it with its expected value

(also known as its mean) and its variance. One probability distribution that we willbe particularly interested in is the normal distribution, which occurs very often.A bit of math known as the central limit theorem (CLT) explains why the normaldistribution shows up so much. The CLT says that sums or averages of randomvariables are normally distributed. The CLT is so important because many of thestatistics we care about (such as the sample mean, sample proportion, and regressioncoefficients) can be viewed as averages.

17

18 CHAPTER 2. PROBABILITY BASICS

2.1 Random Variables

Definition

A number whose value is determined by the outcome of a random experiment. Ineffect, a random variable is a number that hasn’t “happened” yet.

Examples

• The diameter of the next observed crank shaft from an automobile productionprocess.

• The number on a roll of a die.

• Tomorrow’s closing value of the Nasdaq.

Notation

Random variables are usually denoted with capital letters like X and Y . Thepossible values of these random variables are denoted with lower case letters like xand y. Thus, if X is the number of cars my used car lot will sell tomorrow, and ifI am interested in the probability of selling three cars, then I will write P (X = 3).Here 3 is a particular value of lower-case x that I specify.

The Distribution of a Random Variable

By definition it is impossible to know exactly what the numerical value of a randomvariable will be. However, there is a big difference between not knowing a vari-able’s value and knowing nothing about it. Every random variable has a probabilitydistribution describing the relative likelihood of its possible values. A probabilitydistribution is a list of all the possible values for the random variable and the cor-responding probability of that value happening. Values with high probabilities aremore likely than values with small probabilities.

For example, imagine you own a small used-car lot that is just big enough tohold 3 cars (i.e. you can’t sell more than 3 cars in one day). Let X representthe number of cars sold on a particular day. Then you might face the followingprobability distribution

x 0 1 2 3

P (X = x) 0.1 0.2 0.4 0.3

From the probability distribution you can compute things like the probability thatyou sell 2 or more cars is 70% (=.4 + .3). Pretty straightforward, really.

2.1. RANDOM VARIABLES 19

Don’t Get Confused! 2.1 Understanding Probability Distributions

One place where students often become confused is the distinction betweena random variable X and its distribution P (X = x). You can think of aprobability distribution as the histogram for a very large data set. Thenthink of the random variable X as a randomly chosen observation from thatdata set. It is often convenient to think of several different random vari-ables with the same probability distribution. For example, let X1, . . . ,X10

represent the numbers of dots observed during 10 rolls of a fair die. Eachof these random variables has the same distribution P (X = x) = 1

6 , forx = 1, 2, . . . , 6. But they are different random variables because each onecan assume different values (i.e. you don’t get the same roll for each die).

Where Probabilities Come From

Probabilities can come from four sources.

1. “Classical” symmetry arguments

2. Historical observations

3. Subjective judgments

4. Models

Classical symmetry arguments include statements like “all sides of a fair die areequally likely, so the probability of any one side is 1

6 .” They are the oldest of thefour methods, but are of mainly mathematical interest and not particularly usefulin applied work.

Historical observations are the most obvious way of of deriving probabilities.One justification of saying that there is a 40% chance of selling two cars today isthat you sold two cars on 40% of past days. A bit of finesse is needed if you wish tocompute the probability of some event that you haven’t seen in the past. However,most probability distributions used in practice make use of past data in some formor another.

Subjective judgments are used whenever experts are asked to assess the chancethat some event will occur. Subjective probabilities can be valuable starting pointswhen historical information is limited, but they are only as reliable as the “expert”who produces them.

The most common sources of probabilities in business applications are proba-bility models. Models are useful when there are too many potential outcomes to


list individually, or when there are too many uncertain quantities to consider simul-taneously without some structure. Many of the most common probability modelsmake use of the normal distribution, and its extension the linear regression model.We will discuss these two models at length later in the course.

The categories listed above are not mutually exclusive. For example, probabilitymodels usually have parameters which are fit using historical data. Subjectivejudgment is used when selecting families of models to fit in a given application.

2.2 The Probability of More than One Thing

Things get a bit more complicated if there are several unknown quantities to bemodeled. For example, what if there were two car salesmen (Jim and Floyd) workingon the lot? Then on any given day you would have two random variables: X, thenumber of cars that Jim sells, and Y , the number of cars that Floyd sells.

2.2.1 Joint, Conditional, and Marginal Probabilities

The joint distribution of two random variablesX and Y is a function of two variablesP (x, y) giving the probability that X = x and Y = y. For example, the jointdistribution for Jim and Floyd’s sales might be.

Y (Floyd)X(Jim) 0 1 2 3

0 .10 .10 .10 .101 .10 .20 .10 .002 .10 .05 .00 .003 .05 .00 .00 .00

Remember that there are only 3 cars on the lot, so P (x, y) = 0 if x+ y > 3. Aswith the distribution of a single random variable, the joint distribution of two (ormore) random variables simply lists all the things that could happen, along with thecorresponding probabilities. So in that sense it is no different than the probabilitydistribution of a single random variable, there are just more possible outcomes toconsider. Just to be clear, the distribution given above says that the probabilityof Jim selling two cars on a day that Floyd sells 1 is .05 (i.e. that combination ofevents will happen about 5% of the time).

Marginal Probabilities

If you were given the joint distribution of two variables, you might decide that oneof them was irrelevant for your immediate purpose. For example, Floyd doesn’t care

2.2. THE PROBABILITY OF MORE THAN ONE THING 21

about how many cars Jim sells, he just wants to know how many cars he (Floyd)will sell. That is, Floyd wants to know the marginal distribution of Y . (Likewise,Jim may only care about the marginal distribution of X.) Marginal probabilitiesare calculated in the obvious way, you simply sum across any variable you want toignore. The mathematical formula describing the computation looks worse than itactually is

P (Y = y) =∑

x

P (X = x, Y = y). (2.1)

All this says is the following. “The probability that Floyd sells 0 cars is the prob-ability that he sells 0 cars and Jim sells 0, plus the probability that he sells 0 carsand Jim sells 1, plus . . . .” Even more simply, it says to add down the column ofnumbers in the joint distribution that correspond to Floyd selling 0 cars. In fact,the name marginal suggests that marginal probabilities are often written on themargins of a joint probability distribution. For example:


0 .10 .10 .10 .10 .401 .10 .20 .10 .00 .402 .10 .05 .00 .00 .153 .05 .00 .00 .00 .05

.35 .35 .20 .10 1.00

The marginal probabilities say that Floyd has a 10% chance (and Jim a 5%chance) of selling three cars on any given day. Notice that if you have the jointdistribution you can compute the marginal distributions, but you can’t go the otherway around. That makes sense, because the two marginal distributions have only8 numbers (4 each), while the joint distribution has 16 numbers, so there must besome information loss. Also note that the word marginal means something totallydifferent in probability than it does in economics.

Conditional Probabilities

Each day, Floyd starts out believing that his sales distribution is

Num. Cars (y) 0 1 2 3

Prob .35 .35 .20 .10

What if Floyd somehow knew that today was one of the days that Jim would sell0 cars. What should he believe about his sales distribution in light of the newinformation? This situation comes up often enough in probability that there is


standard notation for it. A vertical bar “|” inside a probability statement separatesinformation which is still uncertain (on the left of the bar) from information whichhas become known (on the right of the bar). In the current example Floyd wants toknow P (Y = y|X = 0). This statement is read: “The probability that Y = y giventhat X = 0.” The updated probability is called a conditional probability becauseit has been conditioned on the given information.

How should the updated probability be computed? Imagine that the probabili-ties in the joint distribution we have been discussing came from a data set describingthe last 1000 days of sales. The contingency table of sales counts would look some-thing like


0 100 100 100 100 4001 100 200 100 0 4002 100 50 0 0 1503 50 0 0 0 50

350 350 200 100 1000

If Floyd wants to estimate P (Y |X = 0), he can simply consider the 400 dayswhen Jim sold zero cars, ignoring the rest. That is, he can normalize the (X = 0)row of the table by dividing everything in that row by 400 (instead of dividing by1000, as he would to get the joint distribution). If Floyd didn’t have the originalcounts he could still do the normalization, he would simply do it using probabilitiesinstead of counts. This thought experiment justifies the definition of conditional

probability

P (Y = y|X = x) =P (Y = y,X = x)

P (X = x). (2.2)

Notice that the denominator of equation (2.2) does not depend on y. It is simply anormalizing factor. Also, notice that if you summed the numerator over all possiblevalues of y, you would get P (X = x) in the numerator and denominator, so theanswer would be 1. The equation simply says to take the appropriate row or columnof the joint distribution and normalize it so that it sums to 1. We can easily computeall of the possible conditional distributions that Floyd would face if he were toldX = 0, 1, 2, or 3, and the conditional distributions that Jim would face if he weretold Floyd’s sales.



0 .25 .25 .25 .25 1.001 .25 .50 .25 .00 1.002 .67 .33 .00 .00 1.003 1.00 .00 .00 .00 1.00


0 .29 .29 .50 1.001 .29 .57 .50 .002 .29 .14 .00 .003 .13 .00 .00 .00

1.00 1.00 1.00 1.00Floyd’s conditional probabilities Jim’s conditional probabilities

given Jim’s sales P (Y |X) given Floyd’s sales P (X|Y )

So what does the information that X = 0 mean to Floyd? If we compare hismarginal sales distribution to the his conditional distribution given X = 0

No information .35 .35 .20 .10Jim sells 0 cars .25 .25 .25 .25

it appears (unsurprisingly) that Floyd has a better chance of having a big sales dayif Jim sells zero cars.

Putting It All Together

Let’s pause to summarize the probability jargon that we’ve introduced in this sec-tion. A joint distribution P (X,Y ) summarizes how two random variables varysimultaneously. A marginal distribution describes variation in one random variable,ignoring the other. A conditional distribution describes how one random variablevaries if the other is held fixed at some specified value. If you are given a joint distri-bution you can derive any conditional or marginal distributions of interest. However,to compute the joint distribution you need to have the marginal distribution of onevariable, and all conditional distributions of the other. This is a consequence of thedefinition of conditional probability (equation 2.2) which is sometimes stated as theprobability multiplication rule.

P (X,Y ) = P (Y |X)P (X)

= P (X|Y )P (Y )(2.3)

Equations (2.2) and (2.3) are the same, just multiply both sides of (2.2) by P (X =x). However, Equation (2.3) is more suggestive of how probability models are actu-ally built. It is usually harder to think about how two (or more) things vary simul-taneously than it is to think about how one of them would behave if we knew theother. Thus most probability distributions are created by considering the marginaldistribution of X, and then considering the conditional distribution of Y given X.We will illustrate this procedure in Section 2.2.3.


2.2.2 Bayes’ Rule

Probability distributions are a way of summarizing our beliefs about uncertain sit-uations. Those beliefs change when we observe relevant evidence. The method forupdating our beliefs to reflect the new evidence is called Bayes’ rule.

Suppose we are unsure about a proposition U which can be true or false. Forexample, maybe U represents the event that tomorrow will be an up day on thestock market, and notU means that tomorrow will be a down day. Historically,53% of days have been up days, and 47% have been down days, so we start offbelieving that P (U) = .531. But then we find out that the leading firm in thetechnology sector has filed a very negative earnings report just as the market closedtoday. Surely that will have an impact on the market tomorrow. Let’s call this newevidence E and compute P (U |E) (“the probability of U given E”), our updatedbelief about the likelihood of an up day tomorrow in light of the new evidence.

Bayes’ rule says that the updated probability is computed using the followingformula:

P (U |E) =P (E|U)P (U)

P (E)

=P (E|U)P (U)

P (E|U)P (U) + P (E|notU)P (notU).

(2.4)

The first line here is just the definition of conditional probability. If you knowP (E) and P (U,E) then Bayes’ rule is straightforward to apply. The second lineis there in case you don’t have P (E) already computed. You might recognize itas equation (2.1) which we encountered when discussing marginal probabilities.If not, then you should be able to convince yourself of the relationship P (E) =P (E|U)P (U) + P (E|notU)P (notU) by looking at Figure 2.2.

An Example Calculation Using Bayes Rule

In order to evaluate Bayes’ rule we need to evaluate P (E|U), the probability thatwe would have seen evidence E if U were true. In our example this is the probabil-ity that we would have seen a negative earnings report by the leading technologyfirm if the next market day were to be an up day. We could obtain this quan-tity by looking at all the up days in market history and computing the fractionof them that were preceded by negative earnings reports. Suppose that number isP (E|U) = 1% = 0.010. While we’re at it, we may as well compute the percentage ofdown days (notU) preceded by negative earnings reports. Suppose that number is

1These numbers are based on daily returns from the S&P 500, which are plotted in Figure 3.4on page 28.


Figure 2.1: The Reverend Thomas Bayes 1702–1761. He’s even older than that Gauss guyin Figure 2.10.

U

NotU

E

Figure 2.2: Venn diagram illustrating the denominator of Bayes’ rule: The probability ofE is the probability of “E and U” plus the probability of “E and NotU”.

P (E|notU) = 1.5% = 0.015. It looks like such an earnings report is really unlikelyregardless of whether or not we’re in for an up day tomorrow. However, the reportis certainly less likely to happen under U than notU . Bayes’ rule tells us that theprobability of an up day tomorrow, given the negative earnings report today, is

P (U |E) =P (E|U)P (U)

P (E|U)P (U) + P (E|notU)P (notU)

=(.010)(.53)

(.010)(.53) + (.015)(.47)

= 0.429.

Keeping It All Straight

Bayes’ rule is straightforward mathematically, but it can be confusing because thereare several pieces to the formula that are easy to mix up. The formula for Bayes’rule would be a lot simpler if we didn’t have to worry about the denominator. Notice


that

P (U |E) =P (E|U)P (U)


and

P (notU |E) =P (E|notU)P (notU)


both have the same denominator. When we’re evaluating Bayes’ rule we need tocompute P (E|U) and P (E|notU) to get the denominator anyway, so what if we justwrote the calculation as

P (U |E) ∝ P (E|U)P (U).

The ∝ sign is read “is proportional to,” which just means that there is a constantmultiplying factor which is too big a bother to write down. We can recover thatfactor because the probabilities P (U |E) and P (notU |E) must sum to one. Thus, ifthe equation for Bayes’ rule seems confusing, you can remember it as the followingprocedure.

1. Write down all possible values for U in a column on a piece of paper.

2. Next to each value write P (U), the probability of U before you learned aboutthe new evidence. P (U) is sometimes called the prior probability.

3. Next to each prior probability write down the probability of the evidence if Uhad taken that value. This is sometimes called the likelihood of the evidence.

4. Multiply the prior times the likelihood, and sum over all possible values of U .This sum is the normalizing constant P (E) from equation (2.4).

5. Divide by P (E) to get the posterior probability P (U |E) = P (U)P (E|U)/P (E).

This procedure is summarized in the table below.

prior likelihood Pri*Like posterior

Up 0.53 0.010 0.00530 0.4291498 = 0.00530/0.01235

Down 0.47 0.015 0.00705 0.5708502 = 0.00705/0.01235

-------

0.01235

Once you have internalized either equation (2.4) or the five step procedure listedabove you can remember them as: “The posterior probability is proportional to theprior times the likelihood.”


Why Bayes’ Rule is Important

The first time you see Bayes’ rule it seems like a piece of trivia. After all, itis nothing more than a restatement of the “multiplication rule” in equation (2.3)(which was a restatement of equation (2.2)). However, it turns out that Bayes’ ruleis the foundation of rational decision making, and may well be the e = mc2 of the21st century.

One example where Bayes theorem has made a huge impact is “artificial intelli-gence,” which means programming a computer to make intelligent seeming decisionsabout complex problems. In order to do do that you need to have some way to math-ematically express what a computer should “believe” about a complex scenario. Thecomputer also needs to “learn” as new information comes in. The computers’ be-liefs about the complex scenario are described using a complex probability model.Then Bayes’ theorem is used to update the probability model to as the computer“learns” about its surroundings. We will see several examples of Bayesian learningin Chapter 3.

2.2.3 A “Real World” Probability Model

The preceding sections have illustrated some of the issues that can arise when twouncertain quantities are considered. In the interest of simplicity we have dealtmainly with “toy” examples, which can mask some of the issues that come up inmore realistic settings. Let’s work on building a realistic probability model for afamiliar process: the daily returns of the S&P 500 stock market index. That soundsa bit daunting, so let’s limit the complexity of our task by only considering whethereach day’s returns are “up” (positive return) or “down” (negative return). We wantour model to compute the probability that the next n days will follow some specifiedsequence (e.g. with n = 4 we want to compute P (up, up, down, up)), and we wantit to work with any value of n.

One thing worth noticing is that the terms joint, conditional, and marginal

become a bit ambiguous when there are several random variables floating about.For example, suppose stock market returns over the next 4 days are denoted byX1, . . . ,X4. Suppose we’re told that day 1 will be an “Up” day, and we want toconsider what happens on days 2 and 3. Then P (X2,X3|X1) is a joint, marginal,and conditional distribution all at the same time. It is “joint” because it considersmore than one random thing (X2 and X3). It is “conditional” because somethingformerly random (X1) is now known. It is “marginal” because it ignores somethingrandom (X4).

The second thing we notice is that the “probability multiplication rule” startsto look scary. When applied to many random variables, the multiplication rule


Figure 2.3: Daily returns for the S&P 500 market index. The vertical axis excludes afew outliers (notably 10/19/1987) that obscure the pattern evident in the remainder of thedata.

becomes

P (X1, . . . ,Xn) =P (X1) × P (X2|X1) × P (X3|X2,X1)

× . . .

× P (Xn−1|Xn−2, . . . ,X1)

× P (Xn|Xn−1, . . . ,X1).

(2.5)

That is, you can factor the joint distribution P (X1, . . . ,Xn) by multiplying theconditional distributions of each Xi given all previous X’s. Why is that scary?Remember that each of the random variables can only assume one of two values:Up or Down. We can come up with P (X1) simply enough, just by counting howmany up and down days there have been in the past. These probabilities turn outto be

x Down Up

P (Xi) = x 0.474 0.526

Finding P (X2|X1) is twice as much work, we have to count out how many (UU),(UD), (DU), and (DD) transitions there were. After normalizing the transitioncounts we get the conditional probabilities

Xi = Down Up

Xi−1 =Down 0.519 0.481 1.00Up 0.433 0.567 1.00


Finding P (X3|X2,X1) is twice as much work as P (X2|X1), we need to find thenumber of times each pattern (DDD), (DDU), (DUD), (DUU), (UDD), (UDU),(UUD), (UUU) was observed.

Xi

Xi−2 Xi−1 Down Up

Down Down 0.501 0.499 1.00Down Up 0.412 0.588 1.00Up Down 0.539 0.461 1.00Up Up 0.449 0.551 1.00

Notice how each additional day we wish to consider doubles the amount of workwe need to do to derive our model. This quickly becomes an unacceptable burden.For example, if n = 20 we would have to compute over one million conditional prob-abilities. That is far too many to be practical, especially since there are only 14,000days in the data set. The obvious solution is to limit the amount of dependencethat we are willing to consider. The two most common solutions in practice are toassume independence or Markov dependence.

Independence

Two random variables are independent if knowing the numerical value of one doesnot change the distribution you would use to describe the other. Translated into“probability speak” independence means that P (Y |X) = P (Y ). If we were toassume that returns on the S&P 500 were independent, then we could compute theprobability that the next three days returns are (UUD) (two up days followed by adown day) as follows. The general multiplication rule says that

P (X1,X2,X3 = UUD) =P (X1 = U) × P (X2 = U |X1 = U)

× P (X3 = D|X2 = U,X1 = U).(2.6)

If we assume that X1, X2, and X3 are independent, then P (X2|X1) = P (X2) andP (X3|X1,X2) = P (X3), so the probability becomes

P (UUD) = P (X1 = U) × P (X2 = U) × P (X3 = D)

= (.526)(.526)(.474)

= 0.131.

(2.7)

The numbers here come from the marginal distribution of X1 on page 28. Wehave assumed that the marginal distribution does not change over time, which is acommon assumption in practice.


(a) Diameters of automobile crank shafts. (b) International airline passenger traffic.

Figure 2.4: The crank shaft diameters appear to be independent. The airline passengerseries exhibits strong dependence.

Independence is a strong assumption, but it is reasonable in many circumstances.Many of the statistical procedures we will discuss later assume independent obser-vations. You can often plot your data, as we have done in Figure 2.4, to checkwhether it is reasonable to assume independence. The left panel shows data from aproduction line which produces crank shafts to go in automobile engines. The crankshafts should ideally be 815 thousands of an inch in diameter, but there will be somevariability from shaft to shaft. Each day five shafts are collected and measured dur-ing quality control checks. Some shafts measure greater than 815, and some lower.But it does not seem like one shaft being greater or less than 815 influences whetherthe next shaft is likely to be greater or less than 815. That’s what it means forrandom variables to be independent.

Contrast the shaft diameter data set with the airline passenger data set shown inthe right panel of Figure 2.4. The airline passenger data series exhibits an upwardtrend over time, and it also shows a strong seasonal pattern. The passenger countsin any particular month are very close to the counts in neighboring months. Thisis an example of very strong dependence between the observations in this series.

Markov Dependence

Independence makes probability calculations easy, but it is sometimes implausible.If you think that Up days tend to follow Up days on the stock market, and vice versa,then you should feel uncomfortable about assuming the returns to be independent.The simplest way to to allow dependence across time do so is by assuming Markov

dependence. Mathematically, Markov dependence can be expressed

P (Xn|Xn−1, . . . ,X1) = P (Xn|Xn−1). (2.8)


Simply put, Markov dependence assumes that today’s value depends on yesterday’svalue but not the day before. A sequence of random variables linked by Markovdependence is known as a Markov chain.

Let’s suppose that the sequence of S&P 500 returns follows a Markov chain andcompute the probability that the next 4 days X1, . . . ,X4 follow the pattern UUDU .The general multiplication rule says that

P (UUDU) =P (X1 = U) × P (X2 = U |X1 = U)

× P (X3 = D|X2 = U,X1 = U)

× P (X4 = U |X3 = D,X2 = U,X1 = U).

(2.9)

Markov dependence means that P (X3|X1,X2) = P (X3|X2), and P (X4|X3,X2,X1) =P (X4|X3), so the probability becomes

P (UUDU) =P (X1 = U) × P (X2 = U |X1 = U)

× P (X3 = D|X2 = U) × P (X4 = U |X3 = D)

=(.526)(.567)(.433)(.481)

=0.062.

(2.10)

Again, the numbers here are based on the distributions on page 28.

Which Model Fits Best?

Now we have an embarrassment of riches. We have two probability models for theS&P 500 series. Which one fits best? There is a financial/economic theory calledthe random walk hypothesis that suggests the independence model should be theright answer. The random walk hypothesis asserts that markets are efficient, so ifthere were day-to-day dependence in returns, arbitrageurs would enter and removeit. Even so, the Markov chain model has considerable intuitive appeal.

How can we tell which model fits best? One way is to use Bayes’ rule. The thingwe’re uncertain about here is which model is the right one. Let’s call the modelM . The evidence E that we observe is the sequence of up and down days in theS&P 500 data. To use Bayes’ rule we need the prior probabilities P (M = Markov)and P (M = Indep) as well as the likelihoods: P (E|M = Markov) and P (E|M =Indep).

Before looking at the data we might have no reason to believe in one model overanother, so maybe P (M = Markov) = P (M = Indep) = .50. This is clearly asubjective judgment, and we need to check its impact on our final analysis, but let’sgo with the 50/50 prior for now.

The likelihoods are easy enough to compute. We just extend the computationsearlier in this section to cover the whole data set. We end up with


Model Likelihood

Markov e−9659

Independence e−9713

The e’s show up because we had to compute the likelihood on the log scale fornumerical reasons.2 Don’t be put off by the fact that the likelihoods are such smallnumbers. There are a lot of possible outcomes over the next 14000 days of the stockmarket. The chance that you will correctly predict all of them simultaneously isvery small (like e−9659). What you should observe is that the data are e53 timesmore likely under the Markov model than under the independence model.

If we plug these numbers into Bayes’ rule we get

P (M = Markov|E) =(.5)(e−9659)

(.5)(e−9659) + (.5)(e−9713)=

1

1 + e−53≈ 1.

Or, equivalently

P (M = Indep|E) =(.5)(e−9713)

(.5)(e−9659) + (.5)(e−9713)=

e−53

1 + e−53≈ 0.

The evidence in favor of the Markov model is overwhelming. When we saythat P (M = Markov|E) ≈ 1 there is an implicit assumption that the Markov andIndependence models are the only ones to be considered. There are other modelsthat fit these data even better than the Markov chain, but given a choice between theMarkov chain and the Independence model, the Markov chain is the clear winner.With such strong evidence in the likelihood, the prior probabilities that we chosemake little difference. For example, if we were strong believers in the randomwalk hypothesis we might have had a prior belief that P (M = Indep) = .999 andP (M = Markov) = .001. In that case we would wind up with

P (M = Markov|E) =(.001)e−9659

(.001)e−9659 + (.999)e−9713=

1

1 + (999)e−53=

1

1 + e−46≈ 1.

If there is strong evidence in the data (as there is here) then Bayes’ rule forcesrational decision makers to converge on the same conclusion even if they begin withvery different prior beliefs.

2Computers store numbers using a finite number of 0’s and 1’s. When the stored numbers getso small the computer tends to give up and call the answer “0.” This is an easy problem to getaround. Just add log probabilities instead of multiplying raw probabilities.

2.3. EXPECTED VALUE AND VARIANCE 33

2.3 Expected Value and Variance

Let’s return to the auto sales probability distribution from Section 2.1. It is prettysimple, but Section 2.2 showed that probability distributions can get sufficientlycomplicated that we may wish to summarize them somehow instead of workingwith them directly. We said earlier that you can think of a probability distributionas a histogram of a long series of future data. So it makes sense that we mightwant to summarize a probability distribution using tools similar to those we usedto summarize data sets. The most common summaries of probability distributionsare their expected value (aka their mean), and their variance.

2.3.1 Expected Value

One way we can guess the value that a random variable will assume is to look atits expected value, E(X). The expected value of a random variable is its long runaverage. If you repeated the experiment a large number of times and took theaverage of all the observations you got, that average would be about E(X). We cancalculate E(X) using the formula

E(X) =∑

x

xP (X = x)

Returning to the used car example, we don’t know how many cars we are going tosell tomorrow, but a good guess is

E(X) = (0 × 0.1) + (1 × 0.2) + (2 × 0.4) + (3 × 0.3) = 1.9

Of course X will not be exactly 1.9, so what is “good” about it? Suppose you facethe same probability distribution for sales each day, then think about the averagenumber of cars per day you will sell for the next 1000 days. On about 10% of thedays you would sell 0 cars, about 20% of the time you would sell 1 car, etc. Add upthe total number of cars you expect to sell (roughly 100 0’s, 200 1’s, 400 2’s, and300 3’s), and divide by 1000. You get 1.9, E(X), the long run average value. Notewe sometimes write E(X) as µ. It means exactly the same thing.

The E() operator is seductive3 because it takes something that you don’t know,the random variable X, and replaces it with a plain old number like 1.9. Thus it istempting to stick in 1.9 wherever you see X written. Don’t! Remember that 1.9 isonly a the long run average for the number of cars sold per day, whileX is specificallythe number you will sell tomorrow (which you won’t know until tomorrow).

The expected value operator has some nice properties that come in handy whendealing with sums (possibly weighted sums) of random variables. If a and b are

3It is the Austin Powers of operators.


known constants (weights) and X and Y are random variables then the followingrules apply.

• E(aX + bY ) = aE(X) + bE(Y )Example: E(3X + 4Y ) = 3E(X) + 4E(Y )

• E(aX + b) = aE(X) + bExample: E(3X + 4) = 3E(X) + 4

We will illustrate these rules a little later in Section 2.3.3.

2.3.2 Variance

Expected value gives us a guess for X. But how good is the guess? If X is alwaysvery close to E(X) then it will be a good guess, but it is possible that X is oftena long way away from E(X). For example X may be an extremely large value halfthe time and a very small value the rest of the time. In this case the expected valuewill be half way between but X is always a long way away. We need a measure ofhow close X is on average to E(X) so we can judge how good our guess is. This iswhat we use the variance, V ar(X), for. V ar(X) is often denoted by σ2 but theymean the same thing. It is calculated using the formula

V ar(X) = σ2 =∑

x

(x− µ)2P (X = x)

= E[(X − µ)2].

Remember that E() just says “take the average” so the variance of a random variableis the average squared deviation from the mean, just like the variance of a data set.If you like, you can think of x and s2 as the mean and variance of past data (i.e.data which has already “happened” and is in your data set), and E(X) and V ar(X)are the mean and variance of a long series of future data that you would see if youlet your data producing process run on for a long time.

To illustrate the variance calculation, the variance of our auto sales randomvariable is 0.89, calculated as follows.

x P (X = x) (x− µ) (x− µ)2 P (X = x)(x− µ)2

0 0.1 −1.9 3.61 0.36101 0.2 −0.9 0.81 0.16202 0.4 0.1 0.01 0.00403 0.3 1.1 1.21 0.3630

1 0.8900

2.3. EXPECTED VALUE AND VARIANCE 35

Variance has a number of nice theoretical properties, but it is not very easy tointerpret. What does a variance of .89 mean? It means that the average squareddistance of X from its mean is 0.89 “cars squared.” Just as in Chapter 1, varianceis hard to interpret because it squares the units of the problem.

Standard Deviation

The standard deviation of a random variable is defined as

SD(X) = σ =√

V ar(X).

Taking the square root of variance restores the natural units of the problem. Forexample the standard deviation of the above random variable is

SD(X) =√

0.89 = 0.94 cars.

Standard deviations are easy to calculate, assuming you’ve already calculated V ar(X),and they are a lot easier to interpret. A standard deviation of 0.94 means (roughly)that the typical distance of X from its mean is about 0.94. If SD(X) is small thenwhen X “happens” it will be close to E(X), so E(X) is a good guess for X. Thereis no threshold for SD(X) to be considered small. The smaller it is, the betterguesser you can be.

Rules for Variance

We mentioned that variance obeys some nice math rules. Here’s the main one. Notethat the following rule applies to V ar ONLY if X and Y are independent.4 Thatcontrasts with the rules for expected value on page 34, which apply all the time.Remember that X and Y are random variables, and a and b are known constants.

V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) (if X and Y are independent)

Here are some examples:

• V ar(3X + 4Y ) = 32V ar(X) + 42V ar(Y ) = 9V ar(X) + 16V ar(Y )

• V ar(X + Y ) = V ar(X) + V ar(Y )(a = 1, b = 1)

• V ar(X − Y ) = V ar(X) + V ar(Y )(a = 1, b = −1, the −1 gets squared)

4Section 3.2.3 explains what to do if X and Y are dependent. It involves a new wrinkle calledcovariance that would just be a distraction at this point.


Note that there are no rules for manipulating standard deviation. Todetermine SD(aX + bY ) you must first calculate V ar(aX + bY ) and then take itssquare root.

2.3.3 Adding Random Variables

Expected value and variance are very useful tools when you want to build morecomplicated random variables out of simpler ones. For example, suppose we knowthe distribution of daily car sales (from Section 2.1) and that car sales are indepen-dent from one day to the next. We’re interested in the distribution of weekly (5day) sales, W . In principle we could list out all the possible values that W couldassume (in this case from 0 to 15), then think of all the possible ways that dailysales could happen. For example, to compute P (W = 6) we would have to considerall the different ways that daily sales could total up to 6 (3 on the first day and 3on the last, one each day except for two on Thursday, and many, MANY more). Itis a hard thing to do because there are so many cases to consider, and this is justa simple toy problem!

It is much easier to figure out the mean and variance of W using the rules fromthe previous two Sections. The trick is to write W = X1 + X2 + X3 + X4 + X5,where X2 is the number of cars sold on day 2, and similarly for the other X’s. Then

E(W ) = E(X1 +X2 +X3 +X4 +X5)

= E(X1) + E(X2) + E(X3) + E(X4) + E(X5)

= 1.9 + 1.9 + 1.9 + 1.9 + 1.9

= 9.5

and

V ar(W ) = V ar(X1 +X2 +X3 +X4 +X5)

= V ar(X1) + V ar(X2) + V ar(X3) + V ar(X4) + V ar(X5)

= .89 + .89 + .89 + .89 + .89

= 4.45,

which means that SD(X) =√

4.45 = 2.11. That was MUCH easier than actuallyfiguring out the distribution of W and tabulating E(W ) and V ar(W ) directly.What if we cared about the weekly sales figure expressed as a daily average insteadof the weekly total? Then we just consider W/5, with E(W/5) = 9.5/5 = 1.9 andV ar(W/5) = 4.45/52 = .178, so SD(W/5) =

√.178 = .422.

Of course we still don’t know the entire distribution of W . All we have are thesetwo useful summaries. It would be nice if there were some way to approximate the

2.4. THE NORMAL DISTRIBUTION 37

Don’t Get Confused! 2.2 The difference between X1 +X2 and 2X.

In examples like our weekly auto sales problem many people are tempted towrite W = 5X instead of W = X1 + · · · +X5. The temptation comes fromthe fact that all five daily sales figures come from the same distribution, sothey have the same mean and the same variance. However, they are notthe same random variable, because you’re not going to sell exactly the samenumber of cars each day.The distinction makes a practical difference in the variance formula (amongother places). If V ar(X1) = σ2, then V ar(5X1) = 25σ2, but V ar(X1 + · · ·+X5) = 5σ2. Is this sensible? Absolutely. If you were gambling in Las Vegas,then X1 + · · · +X5 might represent your winnings after 5 $1 bets while 5Xwould represent your winnings after one $5 bet. The one big bet is muchriskier (i.e. has a larger variance) than the five smaller ones.

distribution of W based just on its mean and variance. We’ll get one in Section 2.5,but we have one more big idea to introduce first.

2.4 The Normal Distribution

Thus far we have restricted our attention to random variables that could assumea denumerable set of values. While that is natural for some situations, we willoften wish to model continuously varying phenomena such as fluctuations in thestock market. It is impractical to list out all possible values (and correspondingprobabilities) of a continuously varying process. Mathematical probability modelsare usually used instead. There are many probability models out there,5 but themost common is the normal distribution, which we met briefly in Chapter 1. Wewill use normal probability calculations at various stages throughout the rest of thecourse. If the distribution of X is normal with mean µ and standard deviation σthen we write

X ∼ N (µ, σ).

This expression is read “X is normally distributed with mean µ and standard devi-ation σ.” Section 7.3 describes some non-normal probability models. If you thoughtX obeyed one of them you would write E or P (or some other letter) instead of N .

5A few are described in Section 7.3.


Calculating probabilities for a normal random variable

The normal table is set up6 to answer “what is the probability that X is less thanz standard deviations above the mean?” So the first thing that must be done whencalculating normal probabilities is to change the units of the problem into “standarddeviations above the mean.” You do this by subtracting the mean and dividing bythe standard deviation. Thus you replace the probability calculation P (X < x)with the equivalent event

P

(X − µ

σ<x− µ

σ

)

= P (Z < z).

Subtracting the mean and dividing by the standard deviation transforms X intoa standard normal random variable Z. The phrase “standard normal” just meansthat Z has a mean of 0 and standard deviation 1 (i.e. Z ∼ N (0, 1)). Subtracting themean and dividing by the standard deviation changes the units of x from whateverthey were (e.g. “cell phone calls”) to the units of z, which are “standard deviationsabove the mean.” Subtracting the mean and dividing by the standard deviation isknown as z-scoring. Figure 2.5 illustrates the effect that z-scoring has on the normaldistribution.

Procedurally, normal probabilities of the form P (X ≤ x) can be calculated usingthe following two-step process.

1. Calculate the z score using the formula

z =x− µ

σ

Note that z is the number of standard deviations that x is above µ.

2. Calculate the probability by finding z in the normal table.

For example if X ∼ N (3, 2) (i.e. µ = 3, σ = 2) and we want to calculate P (X ≤ 5)then

z =5 − 3

2=

2

2= 1.

In other words we want to know the probability that X is less than one standarddeviation above its mean. When we look up 1 in the normal table we see thatP (X < 5) = P (Z < 1) = 0.84137. About 84 in every 100 occurrences of a normal

6Different books set up normal tables in different ways. You might have seen a normal tableorganized differently in a previous course, but the basic idea of how the tables are used is standard.

7We can afford to be sloppy with < and ≤ here because the probability that a normal randomvariable is exactly equal to any fixed number is 0.


−5 0 5 10

0.00

0.05

0.10

0.15

0.20

X

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Z

Figure 2.5: Z-scoring. Suppose X ∼ N (3, 2). The left panel depicts P (X < 5). The rightpanel depicts P (Z < 1) where 1 is the z-score (5 − 3)/2. The only difference between thefigures is a centering and rescaling of the axes.

random variable will be less than one standard deviation above its mean. The tabletells us P (Z < z) so if we want P (Z > z) we need to rewrite this as 1−P (Z < z) i.e.P (X > 5) = 1−P (X < 5) = 1− 0.8413 = 0.1587. Because the normal distributionis symmetric, P (Z > z) = P (Z < −z). For example (draw a pair of pictures likeFigure 2.5):

P (X > 1) = P (Z > −1)

= P (Z < 1) = 0.8413.

Finally P (Z < −z) = P (Z > z) = 1 − P (Z < z). For example:

P (X < 1) = P (Z < −1)

= 1 − P (Z < 1) = 0.1587

Further examples

1. If X ∼ N (5,√

3) what is P (3 < X < 6)? This is the same as asking forP (X < 6) − P (X < 3) so we need to calculate two z scores.

z1 =6 − 5√

3= 0.58, z2 =

3 − 5√3

= −1.15


−2 0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

x

d

(a) P (3 < X < 6)

−2 0 2 4 6 8 10 120.

000.

050.

100.

150.

20

x

d

(b) P (6 < X < 8)

Figure 2.6: To compute the probability that a normal random variable is in aninterval (a, b), compute P (X < b) and subtract off P (X < a).

Therefore (see Figure 2.6(a)),

P (X < 6) − P (X < 3) = P (Z < 0.58) − P (Z < −1.15)

= P (Z < 0.58) − P (Z > 1.15)

= P (Z < 0.58) − (1 − P (Z < 1.15))

= 0.7190 − (1 − 0.8749) = 0.5939.

2. If X ∼ N (5,√

3) what is P (6 < X < 8)? This is the same as asking forP (X < 8) − P (X < 6) so we need to calculate two z scores.

z1 =8 − 5√

3= 1.73, z2 =

6 − 5√3

= 0.58

Therefore (see Figure 2.6(b)),

P (X < 8) − P (X < 6) = P (Z < 1.73) − P (Z < 0.58)

= 0.9582 − 0.7190

= 0.2392.


Figure 2.7: Normal quantile plot for CEO ages.

Checking the Normality Assumption

The normal distribution comes up so often in a statistics class that it is important toremember that not all distributions are “normal.” The most effective way to checkwhether a distribution is approximately normal is to use a normal quantile plot,otherwise known as a quantile-quantile plot or a Q-Q plot. Recall from Chapter 1the data set listing the 800 highest paid CEO’s in 1994. One of the variables thatlooked approximately normally distributed was CEO ages. Figure 2.7 shows thenormal quantile plot for that data.

To form the normal quantile plot, the computer orders the 800 CEO’s fromsmallest to largest. Then it figures out how small you would expect the smallestobservation from 800 normal random variables to be. Then it figures out how smallyou would expect the next smallest observation to be, and so on. The actual datafrom each CEO is plotted against the data you would expect to see if the variablewas normally distributed. If what you do see is about the same as what you wouldexpect to see if the data were normal, then the dots in the normal quantile plot willfollow an approximate straight line. (If both axes were on the same scale, then thisline would be the 45 degree line.)

The most interesting thing in a normal quantile plot are the dots, although JMPprovides some extra assistance in interpreting the plot. The reference line in theplot is the 45 degree line that shows where you would expect the dots to lie. Thebowed lines on either side of the reference line are guidelines to help you decidehow far the dots can stray from the reference line before you can claim a departurefrom normality. Don’t take one or two observations on the edge of the plot very


(a) Skewness: CEO Compensation (top 20 outliersremoved)

(b) Heavy Tails: Corporate Profits

Figure 2.8: Examples of normal quantile plots for non-normal data.

seriously. If you notice a strong bend in the middle of the plot then that is evidenceof non-normality.

It is easier to see departures from normality in normal quantile plots than in his-tograms or boxplots because histograms and boxplots sometimes mask the behaviorof the variable in the tails of the distribution. Examples of non-normal variablesfrom the CEO data set are shown in Figure 2.8. The histograms for these variablesappear in Figure 1.5 on page 10.

2.5 The Central Limit Theorem

The central limit theorem (CLT) explains why the normal distribution comes upas often as it does. The CLT, which we won’t state formally, says that the sum ofseveral random variables has a normal distribution. It is an approximation that getsbetter as more variables are included in the sum. How many do you need before theCLT kicks in? The answer depends on how close the individual random variablesthat you’re adding are to being normal themselves. If they’re highly skewed (likeCEO compensation) then you might need a lot. Otherwise, once you’ve got around30 random variables in the sum the you can feel pretty comfortable assuming thesum is normally distributed.

Recall our interest in the distribution of weekly car sales from Section 2.3.3.

2.5. THE CENTRAL LIMIT THEOREM 43

w

Den

sity

2 4 6 8 10 12 14

0.00

0.05

0.10

0.15

Figure 2.9: Distribution of weekly car sales, and a normal approximation.

Figure 2.9 shows the actual distribution of weekly car sales (the histogram) alongwith the normal curve with the mean and standard deviation that we derived inSection 2.3.3. The fit is not perfect (the weekly sales distribution is skewed slightlyto the left), but it is pretty close. The weekly sales distribution is the sum ofonly 5 random variables. The normal approximation would fit even better to thedistribution of monthly sales.

Figure 2.10: The German 10 Mark bank note, showing the normal distribution (faintly,but right in the center) and Carl Friedrich Gauss (1775–1855), who first derived it.

There are many phenomena in life that are the result of several small randomcomponents. The CLT explains why you would expect such phenomena to be nor-mally distributed. There are a few caveats to the CLT which help explain why notevery random variable is normally distributed. The random variables being added


are supposed to be independent and all come from the same probability distribution.In practice that isn’t such a big deal. The CLT works as long as the dependencebetween the variables isn’t too strong (i.e. they can’t all be exactly the same num-ber), and the random variables being added are on similar enough scales that oneor two of them don’t dominate the rest.

The normal distribution and the central limit theorem have had a huge impacton science. So much so that Germany placed a picture of the normal curve andits inventor, a German mathematician named Gauss, on their 10 Mark bank note(before they switched to the Euro, of course).

Chapter 3

Probability Applications

For many students who are learning about probability for the first time, the subjectseems abstract and somehow divorced from the “real world.” Nothing could befurther from the truth. Probability models play a central role in several businessdisciplines, including finance, marketing, operations, and economics. This Chapterfocuses on applying the probability rules learned in Chapter 2 to problems faced inthese disciplines.

Obviously we won’t be able to go very deep into each area. Otherwise we won’thave time to learn what probability can tell us about statistics and data analysis,which is a central goal of this course. Our goal in this Chapter is to present a fewfundamental problems from basic business disciplines and to see how these problemscan be addressed using probability models. In the process we will learn more aboutprobability.

3.1 Market Segmentation and Decision Analysis

One of the basic tools in marketing is to identify market segments containing similargroups of potential customers. If you know the defining characteristics of a marketsegment then (hopefully) you can tailor a marketing strategy to each segment anddo better than you could by applying the same strategy to the whole market.

Market segmentation is a good illustration of decision theory, or using proba-bility help make decision under uncertain circumstances. The uncertainty comesfrom the fact that you don’t know with absolute certainty the market segment towhich each of your potential customers should belong. However, it is possible toassign each potential customer a distribution describing the probability of segmentmembership. Presumably this distribution depends on observable characteristicssuch as age, credit rating, etc. Decision theory is about translating each probability

45

46 CHAPTER 3. PROBABILITY APPLICATIONS

distribution into an action. For example, you may have developed two marketingstrategies for the new gadget your firm has developed. Strategy E targets “EarlyAdopters,” and Strategy F targets “Followers.” Which approach should you applyto Joe, who is 32 years old, makes between $60K-80K per year, and owns an Ipod?

3.1.1 Decision Analysis

Decision analysis is about making a trade off between the cost of making a baddecision and the probability of making a good one. To continue with the marketsegmentation example, suppose you’ve determined that there is a 20% chance thatJoe is an Early Adopter, and an 80% chance that he is a Follower. Joe’s age, in-come, and Ipod ownership were presumably used to arrive at these probabilities.Section 3.1.2 explores segment membership probabilities in greater detail. Intu-itively it looks like you should treat Joe as a Follower, because that is the segmenthe is most likely to belong to. However, if Early Adopters are more valuable cus-tomers than Followers, it might make sense to treat Joe as belonging to somethingother than his most likely category.

Actions and Rewards

You can apply one of two strategies to Joe, and Joe can be in one of two categories.(More strategies and categories are possible, but lets stick to two for now.) In orderto use decision analysis you need to know how valuable Joe will be to you underall four combinations. This information is often summarized in a reward matrix.A reward matrix explains what will happen if you choose a particular action whenthe world happens to be in a given state. For example, suppose each Early Adopteryou discover is worth $1000, and each Follower is worth $100. Early Adopters areworth more because Followers will eventually copy them. However, to get theserewards you have to treat each group appropriately. If you treat an Early Adopterlike a Follower then he may think your product isn’t sufficiently “cutting edge” towarrant his attention. If you treat a Follower like an Early Adopter then he maydecide your technology is too complicated. Thus you may have the following rewardmatrix.

Early Adopter Follower

Strategy E 1000 10Strategy F 150 100

These are high stakes in that you lose about 90% of the customer’s potential valueif you choose the wrong strategy.

3.1. MARKET SEGMENTATION AND DECISION ANALYSIS 47

Risk Profile

The mechanics of decision theory involve combining the information in the rewardmatrix with a probability distribution about which state is correct. The resultis a risk profile, a set of probability distributions describing the reward that youwill experience by taking each action. You have two options with Joe, so the riskprofile here involves two probability distribution. Recall that Joe is 80% likely tobe a Follower, so if you treat him as an Early Adopter you have an 80% chance ofmaking only $10, but a 20% chance of making $1000. If you treat him as a Followeryou have an 80% chance of making $100, but a 20% chance of making $150. Thus,the risk profile you face is as follows.

Reward $10 100 150 1000

Strategy E .8 0 0 .2Strategy F 0 .8 .2 0

Choosing an Action

Once a risk profile is computed, your decision simply boils down to deciding whichprobability distribution you find most favorable. When decisions involve only amoderate amount of money, the most reasonable way to distinguish among thedistributions in your risk profile is by their expected value. The expected returnunder Strategy E is (.8)(10) + (.2)(1000) = $208. The expected return underStrategy F is (.8)(100) + (.2)(150) = $110. So Joe should be treated as an EarlyAdopter, even though it is much more likely that he is a Follower.

Expected value is a good summary here because you are planning to market to alarge population of people like Joe, about 20% of whom are Early Adopters. As yourepeat the experiment of advertising to customers like Joe over a large populationyour average reward per customer will settle down close to the long run average, orexpected value.

If a decision involves substantially greater sums of money, such as decidingwhether your firm should merge with a large competitor, then expected value shouldnot be the only decision criterion. For example, if the stakes above were changedfrom dollars to billions of dollars, representing the market value of the firm after theaction is taken, then many people will find a guaranteed market value of $100-150billion preferable to the high chance of $10 billion, even if there is a chance of a“home run” turning the company into a trillion dollar venture.

Is this realistic?

As with probability models, decision analysis is as realistic as its inputs. For decisionanalysis to produce good decisions you need realistic reward information and a


believable probability model describing the states of the unknown variables. Theentries in the reward matrix often make decision analysis seem artificial. It iscertainly not believable that every Early Adopter will be worth exactly $1,000,for example. However, experts in marketing science have reasonably sophisticatedprobability models that they can employ to model things like the amount andprobability of a customer’s purchase under different sets of conditions. Expectedvalues from these types of models can be used to fill out a reward matrix withnumbers that have some scientific validity. If actions are chosen based on the highestexpected reward then it makes sense for the entries of the reward matrix to beexpected values, because it is “legal” to take averages of averages. The details ofthe models are too complex to discuss in an introductory class, but they can befound in Marketing elective courses.

Decisions can also involve choosing more than one action. In fact, there is anentire discipline devoted to the theory of complex decisions that expands on thebasic principles outlined above. Interested students can learn more in OperationsManagement elective courses.

3.1.2 Building and Using Market Segmentation Models

Although we will say nothing further about the models used to fill out the rewardmatrix, we can introduce a bit more realism about the probability of customersbelonging to a particular market segment. One way to determine segment member-ship is simply to ask people if they engage in a particular activity. Collect this typeof information from a subset of the market you want to learn about, and also collectinformation that you can actually observe for individuals in the broader market. Forexample,1 suppose the producers of the television show Nightline want to markettheir show to different demographic groups. They have an advertising campaigndesigned to reinforce the opinions of viewers who already watch Nightline, and an-other to introduce the show to viewers who do not watch it. The producers take arandom sample of 75 viewers and get the results shown below

Level Number Mean Std Dev

No 50 34.9600 5.8238

Yes 25 57.9200 11.1502

Assuming the broader population looks about like the sample (an assumption wewill examine more closely in Chapter 4), we might assume that 66.66% of the pop-ulation are non-watchers with ages that are distributed approximately N (35, 5.8),and the remaining 33.33% of the population are watchers with ages distributed ap-proximately N (58, 11). That is, we could simply fit normal models to each observed

1This example modified from Albright et al. (2004).

3.1. MARKET SEGMENTATION AND DECISION ANALYSIS 49

segment. (We could fit other models too, if we knew any and thought they mightfit better.)

Now suppose we have information (i.e. age) about a potential subject in frontof whom we could place one of the two ads. Clearly, older viewers tend to watch theprogram more than younger viewers. The age of the potential viewer in questionis 42, which is between the mean ages of watchers and non-watchers. What is thechance that he is a Nightline watcher? This is clearly a job for Bayes’ rule. Wehave a prior probability of .33 and we need to update this probability to reflect thefact that we know this person to be 42 years old. Let W denote the event that theperson is a watcher, and let A denote the person’s age. Bayes’ rule says P (W |A =42) ∝ P (W )P (A = 42|W ) and P (notW |A = 42) ∝ p(notW )P (A = 42|notW ).Clearly P (W ) = .33 and P (notW ) = .66.

To get P (A = 42|W ) we have two choices. We could use the techniques describedin Section 2.4 to compute P (A < 43) − P (A < 42) for a normal distribution withµ = 58 and σ = 11 (or µ = 35 and σ = 5.8 for notW ). We could also approximatethis quantity by the height of the normal curve evaluated at age = 42. (See page 185for an excel command to do this.) The two methods give very similar answers, butthe second is more typical. We get P (A = 42|W ) = 0.0126 and P (A = 42|notW ) =0.0332. Now Bayes’ rule is straightforward:

segment prior likelihood prior*like posterior

W .33 .0126 0.004158 0.16

notW .66 .0332 0.021912 0.84

---------

0.02607

Thus there is a 16% chance that the 42 year subject is a Nightline watcher. It iseasy enough to program a computer to do the preceding calculation for several agesand plot the results, which are shown in Figure 3.1. We see that the probability ofviewership increases dramatically as Age moves from 40 to 50.

More Complex Settings

The approach outlined above is very flexible. It can obviously be extended to anynumber of market segments by simply fitting a different model to each segment.It can incorporate several observable characteristics (such as age, income, and ge-ographic region) by developing a joint probability model for the observed charac-teristics for each market segment. The multivariate normal distribution (which wewill not discuss) is a common choice.


20 30 40 50 60 70 80

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Age

WatchersNon−Watchers

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

ageP

(nig

htlin

e|ag

e)

(a) (b)

Figure 3.1: (a) Age distributions for Nightline watchers and non-watchers. (b)P (W |A): the probability of watching Nightline, given Age.

3.2 Covariance, Correlation, and Portfolio Theory

Investors like to get the most return they can with the least amount of risk. It isgenerally accepted that a well diversified investment portfolio lowers an investor’srisk. However there is more to portfolio diversification than simply purchasingmultiple stocks. For example, a portfolio consisting only of shares from firms inthe same industry cannot be considered diversified. This section investigates seeksto calculate the amount of additional risk incurred by investors who own shares ofclosely related financial instruments.

3.2.1 Covariance

The covariance between two variables X and Y is defined as

Cov(X,Y ) = E((X − E(X))(Y − E(Y ))).

For context, imagine that X is the return on an investment in one stock, and Ythe return on another. If the joint distribution of X and Y is unavailable (as isoften the case in practice) the covariance can be estimated from a sample of n pairs

3.2. COVARIANCE, CORRELATION, AND PORTFOLIO THEORY 51

(xi, yi) using the formula

Cov(X,Y ) =1

n− 1

n∑

i=1

(xi − x)(yi − y).

If Cov(X,Y ) is positive we say that X and Y have a positive relationship. Thismeans that when X is above its average then Y tends to be above its average aswell. Note that this does not guarantee that any particular Y will be large or small,only that there is a general tendency towards big Y ’s being associated with big X’s.If Cov(X,Y ) is negative we say that X and Y have a negative relationship. Thismeans that as X increases Y tends to decrease. If X and Y are independent thenCov(X,Y ) = 0. Note, however, that a covariance of zero does not necessarily implythat X and Y are independent. A covariance of zero means that there is no linearrelationship between X and Y , but there could be a nonlinear relationship (e.g. aquadratic relationship).

One of the most common uses of covariance occurs when calculating the varianceof a sum of random variables. Recall that if X and Y are independent then

V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ).

If X and Y are not independent then we can still perform the calculation usingCov(X,Y ).

V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) + 2abCov(X,Y ).

Notice that if X and Y are independent, then Cov(X,Y ) = 0, so the general formulacontains the simple formula as a special case. The good news is that you don’t haveto remember two different formulas. The bad news is that the formula you haveto remember is the more complicated of the two. The sidebar on page 54 containsa trick to help you remember the general variance formula, even if there are morethan two random variables involved.

The best way to look at covariances for more than two variables in a time is toput them in a covariance matrix like the one in Table 3.1. The diagonal elementsin a covariance matrix are the variances of the individual variables (in this case thevariances of monthly stock returns). The off-diagonal elements are the covariancesbetween the variables representing each row and column. A covariance matrix issymmetric about its diagonal because Cov(X,Y ) = Cov(Y,X).

3.2.2 Measuring the Risk Penalty for Non-Diversified Investments

Suppose you invest in a stock portfolio by placing w1 = 2/3 of your money in Searsstock and 1/3 of your money in Penney stock. What is the variance of your stock


portfolio? Let S represent the return from Sears and P the return from Penney.The total return on your portfolio is T = w1S + w2P . The variance formula says(using numbers from Table 3.1)

V ar(T ) = w21V ar(S) + w2

2V ar(P ) + 2w1w2Cov(S,P )

= (2/3)2(0.00554) + (1/3)2(0.00510) + 2(2/3)(1/3)(0.00335)

= 0.00452.

So SD(T ) = 0.067 =√

0.00452. Notice that the variance of your stock portfoliois less than the variance of either individual stock, but it is more than it would beif Sears and Penney had been uncorrelated. If Cov(S,P ) = 0 then you wouldn’thave had to add the factor of 2(2/3)(1/3)(0.00335), which you can think of asa “covariance penalty” for investing in two stocks in the same industry. Thuscovariance explains why it is better to invest in a diversified stock portfolio. If Searsand Penney had been uncorrelated, then you would have had V ar(T ) = 0.00303, orSD(T ) = 0.055.

What if you wanted to “short sell” Penney in order to buy more shares of Sears?In other words, suppose your portfolio weights were w1 = 1.5 and w2 = −0.5. Then

V ar(T ) = w21V ar(S) + w2

2V ar(P ) + 2w1w2Cov(S,P )

= (1.5)2(0.00554) + (−.5)2(0.00510) + 2(1.5)(−.5)(0.00335)

= 0.008715.

So SD(T ) = 0.093. The formula can be extended to as many shares as you like(see the sidebar on page 54). It can be shown that for a portfolio with a largenumber of shares the factor that determines risk is not the individual variances butthe covariance between shares. If all shares had zero covariance the portfolio wouldalso have a variance that approaches zero.

3.2.3 Correlation, Industry Clusters, and Time Series

It is hard to use covariances to measure the strength of the relationship betweentwo variables because covariances depend on the scale on which the variables aremeasured. If we change the units of the variables we will also change the covariance.This means that we have no idea what a “large covariance” is. Therefore, if we wantto measure how strong the relationship between two variables is, we use correlation

rather than covariance.The correlation between two variables X and Y is defined as

Corr(X,Y ) =Cov(X,Y )

SD(X)SD(Y )


Sears K-Mart Penney Exxon Amoco Imp_Oil Delta United

Sears 0.00554 0.00382 0.00335 0.00109 0.00094 0.00120 0.00254 0.00340

K-Mart 0.00382 0.00725 0.00415 0.00077 0.00042 0.00055 0.00302 0.00378

Penney 0.00335 0.00415 0.00510 0.00042-0.00002 0.00022 0.00252 0.00333

Exxon 0.00109 0.00077 0.00042 0.00230 0.00211 0.00202 0.00083 0.00071

Amoco 0.00094 0.00042-0.00002 0.00211 0.00444 0.00257 0.00050 0.00028

Imp_Oil 0.00120 0.00055 0.00022 0.00202 0.00257 0.00621 0.00049 0.00083

Delta 0.00254 0.00302 0.00252 0.00083 0.00050 0.00049 0.00762 0.00667

United 0.00340 0.00378 0.00333 0.00071 0.00028 0.00083 0.00667 0.01348

(a) Covariance matrix

Sears K-Mart Penney Exxon Amoco Imp_Oil Delta United

Sears 1.0000 0.6031 0.6307 0.3060 0.1900 0.2042 0.3907 0.3928

K-Mart 0.6031 1.0000 0.6820 0.1882 0.0734 0.0817 0.4062 0.3829

Penney 0.6307 0.6820 1.0000 0.1220 -0.0045 0.0400 0.4043 0.4023

Exxon 0.3060 0.1882 0.1220 1.0000 0.6598 0.5338 0.1974 0.1276

Amoco 0.1900 0.0734 -0.0045 0.6598 1.0000 0.4894 0.0862 0.0358

Imp_Oil 0.2042 0.0817 0.0400 0.5338 0.4894 1.0000 0.0709 0.0906

Delta 0.3907 0.4062 0.4043 0.1974 0.0862 0.0709 1.0000 0.6580

United 0.3928 0.3829 0.4023 0.1276 0.0358 0.0906 0.6580 1.0000

(b) Correlation matrix

Table 3.1: Correlation and covariance matrices for the monthly returns of eight stocks inthree different industries.

Correlation is usually denoted with a lower case r or the Greek letter ρ (“rho”).Correlation has a number of very nice properties.

• Correlation is always between −1 and 1.

• A correlation near 1 indicates a strong positive linear relationship.

• A correlation near −1 indicates a strong negative linear relationship.

• Correlation is a “unitless measure.” In other words whatever units (feet,inches, miles) we measure X and Y in we get the same correlation (but adifferent covariance).

• Just as for covariance a correlation of zero indicates no linear relationship.

Correlations are often placed in a matrix just like covariances. One of thethings you can see from the correlation matrix in Table 3.1(b) is that stocks in thesame industry are highly correlated with one another, relative to stocks in differentindustries. For example, the correlation between Sears and Penney is .63, while the


Don’t Get Confused! 3.1 A general formula for the variance of a linearcombination

The formula for the variance of a portfolio which is composed of severalsecurities is as follows:

V ar(n∑

i=1

wiXi) =n∑

i=1

n∑

j=1

wiwjCov(Xi,Xj).

This formula isn’t as confusing as it looks. What it says is to write down allthe covariances in a big matrix. Multiply each covariance by the product ofthe relevant portfolio weights, and add up all the answers.

Portfolio Weights

C31

C21

V C12 C13

C23V

C32 V3

1

2

w3

w2

w1

w1 w2 w3

Covariance Matrix

If you think about the variance formula using this picture it should (amongother things) help you remember the formula for V ar(X − Y ) when X andY are correlated. If there are more than two securities then you wouldprobably want to use a computer to evaluate this formula.

correlation between Penney and Imperial Oil is .04. The correlation between Searsand the oil stocks is higher, presumably because Sears has an automotive divisionand Penney does not. These same relationships are present in the covariance matrix,but they are harder to see because some stocks are more variable than others.

We’ve established what it means for a correlation to be 1 or −1, but whatdoes a correlation of .6 mean? We’re going to have to put that off until we learnabout something called R2 in Chapter 5. In the mean time, Figure 3.2 shows thecorrelations associated with a few scatterplots so that you can get an idea of whata “strong” correlation looks like.

Correlation and Scatterplots

We can get more information from a scatterplot of X and Y than from calculatingthe correlation. So why bother calculating the correlation? In fact one should alwaysplot the data. However, the correlation provides a quick idea of the relationship.


−2 −1 0 1 2

−2

−1

01

2

X

Y

(a) r = 1

−2 −1 0 1 2

−2

−1

01

2X

Y

(b) r = 1

−2 −1 0 1 2

−2

−1

01

2

X

Y

(c) r = 0

−2 −1 0 1 2

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

X

Y

(d) r = 0

−2 −1 0 1 2

01

23

4

X

Y

(e) r = 0

−2 −1 0 1 2

−2

−1

01

2

X

Y

(f) r = −1

−2 −1 0 1 2

−2

02

4

X

Y

(g) r = .76

−2 −1 0 1 2

−4

−2

02

X

Y

(h) r = .54

−2 −1 0 1 2

−5

05

X

Y

(i) r = .24

Figure 3.2: Some plots and their correlations.


Don’t Get Confused! 3.2 Correlation vs. Covariance

Correlation and covariance both measure the strength of the linear relation-ship between two variables. The only difference is that covariance dependson the units of the problem, so a covariance can be any real number. Cor-relation does not depend on the units of the problem. All correlations arebetween -1 and 1.

This is especially useful when we have a large number of variables and are tryingto understand all the pairwise relationships. One way to do this is to produce ascatterplot matrix where all the pairwise scatterplots are produced on one page.However, a plot like this starts to get too complex to absorb easily once you includeabout 7 variables. On the other hand a table of correlations can be read easilywith many more variables. One can rapidly scan through the table to get a feelfor the relationships between variables. Because correlations take up less spacethan scatterplots, you can include a correlation matrix in a report to give an ideaof the relationship. By contrast, including a similar number of scatterplots mightoverwhelm your readers.

Autocorrelation

Correlation also plays an important role in the study of time series. Recall that inSection 2.2.3 we determined that there was “memory” in the S&P 500 data series,meaning that the Markov model was clearly preferred to a model that assumed upand down days occur independently. Correlation allows us to measure the strengthof the day-to-day relationship. How? Correlation measures the strength of therelationship between different variables, but the time series of returns occupies onlyone column in the data set. The answer is to introduce “lag” variables. A lagvariable is a time series that has been shifted up one row in the data set, as illustratedin Figure 3.3(b). The lag variable at time t has the same value as the original seriesat time t − 1. Thus the lag variable represents what happened one time periodago. Thus the correlation between the original series and the lag variable can beinterpreted as the correlation in the series from one time period to the next. Thename autocorrelation emphasizes that the correlation is between present and pastvalues of the same time series, rather than between totally distinct variables. Noticethat one could just as easily shift the series by any number k rows to compute thecorrelation between the current time period and k time periods ago. A graph ofautocorrelations at the first several lags is called the autocorrelation function.

Figure 3.3(a) shows the autocorrelation function for the S&P 500 data series.

3.3. STOCK MARKET VOLATILITY 57

The lag 1 autocorrelation is only .08, which is not very large (relative to the correla-tions in Figure 3.2, anyway). Thus, while we can be sure that there is some memorypresent in the time series, the autocorrelations say that the memory is weak.

(a) (b)

Figure 3.3: (a) Autocorrelations for the S&P 500 data series. October 19, 1987 has beenexcluded from the calculation. (b) Illustration of the “lag” variable used to compute theautocorrelations.

3.3 Stock Market Volatility

One of the features of financial time series data is that market returns tend togo through periods of high and low volatility. For example, consider Figure 3.4,which plots the daily return for the S&P 500 market index from January 3, 1950 toOctober 21, 2005.2

Notice that the overall level of the returns is extremely stable, but that thereare some time periods when the “wiggles” in the plot are more violent than others.For example, 1995 seems to be a period of low volatility, while 2001 is a period ofhigh volatility. After the fact it is rather clear which time periods belong to “high”and “low” volatility states, but it can be hard to tell whether or not a transition isoccurring when the process is observed in real time. Clearly there is a first-moveradvantage to be had for analysts that can correctly identify the transition. Forexample, if an analyst is certain the market is entering a high volatility period then

2Data downloaded from finance.yahoo.com.


Figure 3.4: Daily returns for the S&P 500 market index. The vertical axis excludes afew outliers (notably 10/19/1987) that obscure the pattern evident in the remainder of thedata.

the analyst can move his customers into less risky positions (e.g. more bonds andfewer stocks).

If an analyst has “stock” and “bond” strategies designed to be used during lowand high volatility periods he could presumably write down the expected returnsunder each strategy in a reward matrix. Then the analyst just needs to know whenthe market’s volatility state changes. Notice that the analyst is in a sticky situation.He needs to react quickly to volatility changes to serve his customers well, but ifhe reacts to every “blip” in the market then his clients will become annoyed withhim, perhaps thinking he is making pointless trades with their money to collectcommissions.

One way to model data like Figure 3.4 is to assume that the (unobserved)high/low volatility state follows a Markov chain, and that data from high and lowvolatility states follow different normal distributions. Because we don’t get to seewhich days belong to which states, estimating the parameters of a model like this isis hard, and requires special software. However, suppose the transition probabilitiesfor the Markov chain and parameters for the normal distributions were estimatedto be

TodayYesterday Lo Hi

Lo .997 .003Hi .005 .995

Mean SD

Lo 0.0 .007Hi 0.0 .020

The transition probabilities suggest that high and low volatility states persistfor long periods of time. That is, if today is in a low volatility state then the

3.3. STOCK MARKET VOLATILITY 59

−0.04 −0.02 0.00 0.02 0.04

010

2030

4050

Return

LowHigh

Figure 3.5: Distribution of returns under the low and high volatility states. The dottedvertical line shows today’s data.

probability that tomorrow is low volatility is very high (.997), and similarly forhigh volatility. Each day the analyst can update the probability that the market isin a high volatility state based on that day’s market return.

Denote the volatility state at time t by St. Suppose that yesterday’s (i.e. timet − 1) probability was P (St−1 = Hi) = .001 and that today’s market return wasRt = .02. How do we compute the probability for today P (St|Rt)? This is aBayes’ rule problem. Yesterday we had a probability distribution describing whatwould happen today. We need to update that probability distribution based on newinformation (today’s return). We can get the marginal distribution for St because wehave a marginal distribution for St−1 and conditional distributions for P (St|St−1).That means the joint distribution is

St

St−1 Lo Hi

Lo (.999)(.997) (.999)(.003)Hi (.001)(.005) (.001)(.995)

=

St

St−1 Lo Hi

Lo 0.996003 0.002997Hi .000005 0.000995

Thus P (St = Lo) = 0.996003 + .000005 = .996 and P (St = Hi) = .004. Toupdate these probabilities based on today’s return Rt = .02 we need to computeP (Rt = .02) using the height of the normal curves at .02 for each volatility state(see Figure 3.5). We use a computer to compute the height of the normal curves,3

3If you feel uncomfortable doing this you can instead compute the probability that Rt is in a


and plug them into Bayes’ rule as follows:

State prior likelihood pri*like post

Low 0.996 0.9 0.896 0.949

High 0.004 12.1 0.048 0.051

-----------

0.944

Notice what happened. The analyst saw data that looked 12 times more likely tohave come from the high volatility state than the low volatility state. But becausehe knew that yesterday was low volatility, and that volatility states are persistent,he regarded the new evidence skeptically and still believes that it is much morelikely that today is a low volatility state than a high one.

small interval around .2, say (.19, .21). You get different likelihoods, but the ratio between themwill be approximately 12:1, so you will get about the same answers out of Bayes’ rule.

Chapter 4

Principles of StatisticalInference: Estimation andTesting

This Chapter introduces some of the basic ideas used to infer characteristics of thepopulation or process that produced your data based on the limited information inyour dataset. The main objective of estimation is to quantify how confident you canbe in your estimate using something called a confidence interval. The main goal ofhypothesis testing is to determine whether patterns you see in the data are strongenough so that we can be sure they are not just random chance. The materialon probability covered in Chapter 2 plays a very important role in estimation andtesting. In particular, the Central Limit Theorem from Section 2.5, which describeshow averages behave, plays a very important role because many of the quantitieswe wish to estimate and theories we wish to test involve averages.

4.1 Populations and Samples

A population is a large collection of individuals, entities, or objects that we wouldlike to study. For example, we might be interested in some feature describingthe population of publicly held corporations, or the population of customers whohave purchased your product, or the population of red blood cells in your body.Most of the time it will be impossible, or at least highly impractical, to take themeasurements we would like for every member of the population. For example, thepopulation may be effectively infinite, such as is in quality control problems wherethe population is all the future goods your production process will ever produce.

A sample is a subset of a population. Not all samples are created equally. In

61

62 CHAPTER 4. ESTIMATION AND TESTING

SamplePopulation

Figure 4.1: A sample is a subset of a population. Sample statistics are used to estimatepopulation parameters.

this Chapter and in most that follow we will assume that the sample is a simple

random sample. Think of a simple random sample1 as if the observations weredrawn randomly out of a hat. An optional Section (7.4) in the “Further Topics”Chapter discusses the key ingredients of a good sampling scheme and what can gowrong if you have a bad one.

In Chapter 1 we learned that even if we had the whole population in front of uswe would have to summarize it somehow. We can use the summaries from the sampleto estimate the population summaries, but we will need some way to distinguishthe two types of summaries in our discussions. Population summaries are known asparameters and are denoted with Greek letters (see Appendix C). It is customary todenote a population mean by µ and a population standard deviation by σ. Samplesummaries are known as statistics. We have already seen the established notation xand s for the sample mean, and standard deviation. By themselves, sample statisticsare of little interest because the sample is just a small fraction of the population.However, the magic of statistics (the field of study) is that statistics (the numbers)can tell us something about the population parameters we really care about.

1There is a mathematical definition of simple random sampling which involves something calledthe hypergeometric distribution. It is mind-numbingly dull, even to statisticians.

4.2. SAMPLING DISTRIBUTIONS 63

4.2 Sampling Distributions (or, “What is Random About

My Data?”)

It is hard for some people to see where randomness and probability enter intostatistics. They look at their data sets and say “These numbers aren’t random.They’re right there!” In a sense that is true. The numbers in your data set arefixed numbers. They will be the same tomorrow as they are today. However, therewas a time before the sample was taken when these concrete numbers were randomvariables. Your data aren’t random anymore, but they are the result of a randomprocess.

The trick to understanding how sample statistics relate to population parametersis to mentally put yourself back in time to just before the data were collected andthink about the process of random sampling that produced your data. Supposethe population you wish to sample from has mean µ and standard deviation σ(calculated as in Chapter 1). If you randomly select one observation from thatpopulation, then that one observation is a random variable X1 with expected valueµ and standard deviation σ (in the sense of Chapter 2). The probability distributionof X1 is the histogram that you would plot if you could see the entire population.The observation gets promoted from the random variable X1 to the data point x1

once you actually observe it. The same is true for X2, X3, . . . , Xn.

If the data in your data set are the result of a random process, then the samplestatistics describing your data must be too. (If you took another sample, you wouldget a different mean, standard deviation, etc.) A sampling distribution is simplythe probability distribution describing a particular sample statistic (like the samplemean) that you get by taking a random sample from the population. Don’t let thename confuse you, it is just like any other probability distribution except that it isfor a special random variable: a sample statistic. We need to understand as muchas we can about a statistic’s sampling distribution, because we only get to see one

observation from that sampling distribution (e.g. each time we take a sample wesee only one sample mean). Let’s think about the sampling distribution of X. Weknow three key facts.

1. First, E(X) = µ. You can show this is true using the rules for expected valueson page 33. What this says is that the sample mean is an unbiased estimate ofthe population mean. Sometimes the sample mean will be too big, sometimesit will be too small, but “on average” it gives you the right answer.

2. We know that V ar(X) = σ2/n because of the rules for variance on page 35.This is important, because the smaller V ar(X) is the better chance X has ofbeing close to µ. Because variance is hard to interpret, we usually look at the


standard deviation of X instead. When we talk about the standard deviationof a statistic we call it the standard error. The standard error of X is2

SE(X) =√

V ar(X) = σ/√n.

For example, if the standard error of X is 1 we know that X is typicallyabout 1 unit away from µ. Generally speaking, the smaller the standard errorof a statistic is, the more we can trust it. The formula for SE(X) tells usX is a better guess for µ when σ is small (the individual observations in thepopulation have a small standard deviation) or n is large (we have a lot ofdata in our sample).

3. The third key idea is the central limit theorem. It says that the average ofseveral random variables is normal even if the random variables themselvesare not normal. Remember that if the data are normally distributed then anyindividual observation has about a 95% chance of being within 2 standarddeviations of µ. If the data are not normally distributed we can’t make thatstatement.

The central limit theorem is so important because it says, X occurs within 2standard errors of µ 95% of the time even if the data (the individual observa-tions in the sample or population) are non-normal.

What these three facts tell us is that the number x in our data set is the resultof one observation drawn from a normal distribution with mean µ and standarderror σ/

√n. We still don’t know the numerical values of µ and σ, but Sections 4.3

and 4.5.1 show how to use this fact to “back in” to estimates of µ.

4.2.1 Example: log10 CEO Total Compensation

To make the idea of a sampling distribution concrete, consider the CEO compensa-tion data set from Chapter 1. Imagine the collection of 800 CEO’s is a populationfrom which you wish to draw a sample, and that you can afford to obtain infor-mation from a sample of only 20 CEO’s. Figure 4.2 shows the histogram of log10

CEO compensation for all 800 CEO’s in the data set. It is somewhat skewed to theright, so you might not feel comfortable modeling this population using the normaldistribution. In this contrived example we can actually compute the populationmean, µ = 6.17.

2The “of X” is important. We’re focusing on X right now, but in the coming Chapters thereare other statistics we will care about, such as the slope of a regression line. These other statisticshave standard errors too, which will come from different formulas than SE(X).

4.3. CONFIDENCE INTERVALS 65

Log10 Total Compensation

5 6 7 8

01

23

4

Figure 4.2: The white histogram bars are log10 CEO compensation, our hypotheticalpopulation. The gray histogram bars are the means of 1000 samples of size 20 randomlyselected from the population of 800 CEO’s. They represent the sampling distribution ofX. If you took a random sample of 20 CEO’s and constructed its mean, you would get oneobservation from the gray histogram.

The Figure also shows a gray histogram which was created by randomly drawingmany samples of size 20 from the CEO population. We took the mean of eachsample, then plotted a histogram of all those sample means. The gray histogram,which is the sampling distribution of X in this problem, has all the propertiesadvertised above: it is centered on the population mean of 6.17, it has a muchsmaller standard deviation than the individual observations in the population (by afactor of

√20, though you can’t tell by just looking at the Figure), and it is normally

distributed even though the population is not.

Remember Figure 4.2 fondly, because it is the last time you’re going to see anentire population or entire sampling distribution. In practice you only get to seeone observation from the gray histogram. The trick is to know enough about howsampling distributions behave so that you have some idea about how far that onex from your sample might be away from the population µ you wish you could see.

4.3 Confidence Intervals

In the previous Section we saw that we could use X as a guess for µ, and thatthe standard error of X (SE(X) = σ/

√n) gave us an idea of how good our guess


Don’t Get Confused! 4.1 Standard Deviation vs. Standard Error

• SD measures the spread of the data. If the SD is big then it wouldbe hard for you to guess the value of a single observation drawn atrandom from the population.

• SE measures the amount of trust you can put in an estimate such as X .If the standard error of an estimate is small then you can be confidentthat it is close to the true population quantity it is estimating (e.g.that x is close to µ).

is. Confidence intervals build on this idea. A confidence interval gives a range ofpossible values (e.g. 10 − 20) which we are highly confident µ lies within. Forexample if we said that a 95% confidence interval for µ was 10−20 this would meanthat we were 95% sure that µ lies between 10 and 20. The question is, how do wecalculate the interval?

A confidence interval for µ with σ known

First assume that the population we are looking at has mean µ and standard devi-ation σ. Then X has mean µ and standard error σ/

√n. The central limit theorem

also assures us that X is normally distributed. This is useful because we know thata normal random variable is almost always (95% of the time) within 2 standarddeviations of its mean. So X will almost always be within 2σ/

√n of µ. That means

we can be 95% certain that µ is no more than 2σ/√n away from x.3 Therefore, if

we take the interval [x − 2σ/√n, x + 2σ/

√n] we have a 95% chance of capturing

µ. (In fact if we want to be exactly 95% sure of capturing µ we only need to use1.96 rather than 2 but this is a minor point.) What if we want to be 99% sure oronly 90% sure of being correct? If you look in the normal tables you will see thata normal will lie within 2.57 standard deviations of its mean 99% of the time andwithin 1.645 standard deviations of its mean 90% of the time. Therefore in generalwe get

[x− zσ/√n, x+ zσ/

√n]

where z is 1.96 for a 95% interval, 1.645 for a 90% interval and 2.57 for a 99%interval. This formula applies to any other certainty level as well. Just look up zin the normal table.

3Note the switch. In the previous sentence µ was known and X was not. This sentence marksour move into the “real world” where we see x and are trying to guess µ.


Not on the test 4.1 Does the size of the population matter?

The formula for the standard error of X assumes the individual observa-tions are independent. When taking samples from a finite population, theobservations are actually very slightly correlated with one another becauseeach unit in the sample reduces the chances of another unit being in thesample. If you take a simple random sample of size n from a population ofsize N then it can be shown that

SE(X) =σ√n

√

N − n

N − 1≈ σ√

n

√

1 − n

N

The extra factor at the end is called the finite population correction factor(FPC). In practice the FPC makes almost no difference unless your samplesize is really big or the population is really small. For example, if yourpopulation has a billion people in it and you take a HUGE sample of amillion people then the FPC is

√

1 − 1 million/1 billion = 0.9995. Contrastthat with the 1/

√n factor of 1/

√1 million = .001 and it soon becomes

apparent that the size of the sample is much more important than the sizeof the population.

4.3.1 Can we just replace σ with s?

The confidence interval formula for µ uses σ which is the population standard de-viation. But if we don’t know µ, we probably don’t know σ either. A naturalalternative is to use the sample standard deviation s instead of the populationstandard deviation σ. We expect s to be “close” to σ, so why not use

[

x− zs√n, x+ z

s√n

]

?

If we simply replace σ with s then we ignore the uncertainty introduced by using anestimate (s) in place of the true quantity σ. Our confidence intervals would tend tobe too narrow, so they would not be correct as often as they should be. For exampleif we use z = 1.96, so that we would expect to get a 95% confidence interval, theinterval may in fact only be correct (cover µ) 80% of the time. To fix this problemwe need to make the intervals a little wider so they really are correct 95% of thetime. To do this we use something called the t distribution. All you really need toknow about the t distribution is

1. You use it instead of the normal when you have to guess the standard devia-tion.


−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

NormalT3T10T30

Figure 4.3: The normal distribution and the t distribution with 3, 10, and 30 degrees offreedom. As the sample size (DF ) grows, the normal and t distributions become very close.

2. It is very similar to the normal, but with fatter tails.

3. There is a different t distribution for each value of n.

When n is large4 the t and normal distributions are almost identical. To make along story short the confidence interval you should be using is

[

x− ts√n, x+ t

s√n

]

.

Note the difference between t and z. Both measure the number of standard errors arandom variable is above its mean, but z counts the number of true standard errors,while t counts estimated standard errors.

The best way to find t is to use a computer, because there is a different t distri-bution for every value of n. There are tables that list some of the more interestingcalculations for the t distribution for a variety of degrees of freedom, but we won’tbother with them. Figure 4.3 plots the t distribution for a few different sample sizesnext to the normal curve. When the sample size is small (there are few “degreesof freedom”) the t-distribution has much heavier tails than the normal. To capture95% of the probability you might have to go well beyond 2 standard errors. Thus if

4Greater than 30 is an established rule of thumb.


Not on the test 4.2 What are “Degrees of Freedom?”

Suppose you have a sample with n numbers in it and you want to estimatethe variance using s2. The first thing you do is calculate the mean, x. Thenyou calculate (x1 − x), (x2 − x), . . . , (xn − x), square them, and take theiraverage. If you didn’t square the deviations from the mean, their averagewould always be zero! Because the deviations from the mean must sum tozero, they don’t represent 100 numbers worth of information. There are100 numbers there, but they are not all “free” because they must obey aconstraint.The phrase “degrees of freedom” means the number of “free” numbers avail-able in your data set. Generally speaking, each observation in the data setadds one degree of freedom. Each parameter you must estimate before youcan calculate the variance takes one away. That is why we divide by n− 1when calculating s2.

you had a very small sample size your formula for a 95% confidence interval mightbe [x ± 3SE(X)] or [x ± 4SE(X)]. As the sample size grows, s becomes a betterguess for σ and the formula for a 95% confidence interval soon becomes very closeto what it would be if σ were actually known.

4.3.2 Example

A human resources director for a company wants to know the average annual lifeinsurance expenditure for the members of a union with which she is about to engagein labor negotiations. She is considering whether it would be cost effective for thecompany to provide a life insurance benefit to the union members. The HR directorobtains a random sample of 61 employees who provide the data in Figure 4.4. Finda 95% confidence interval for the union’s average life insurance expenditure.

In this case finding the interval is easy, because it is included in the computeroutput: (429.48, 532.55). How did the computer calculate the interval? It is bestto think of it in three steps. First, find the point estimate, or the single best guessfor the thing you’re trying to estimate. The best point estimate for a populationmean is a sample mean, so the point estimate here is x = 481. Second, computethe standard error of the point estimate, which is 201.2195/

√61 = 25.7635. Third,

compute the 95% confidence interval as x ± 2SE(x). If you carry this calculationout by hand you may notice that you don’t get exactly the same answer found inthe computer output. That’s because we used “2” for a number t that the computerlooked up on its internal t-table with α = .05 and 60 = 61 − 1 degrees of freedom.


Mean 481.0164

Std Dev 201.2195

Std Err Mean 25.7635

upper 95% Mean 532.5511

lower 95% Mean 429.4817

N 61.0000

Figure 4.4: Annual life insurance expenditures for 61 union employees.

The computer’s answer is more precise than ours, but we’re pretty close.

The confidence interval is shown graphically in Figure 4.4 as the diamond in theboxplot. Notice that the diamond covers very little of the histogram. A commonmisperception that people sometimes have is that 95% confidence intervals are sup-posed to cover 95% of the data in the population. They’re not. A 95% confidenceinterval is supposed to have a 95% chance of containing the thing you’re tryingto estimate, in this case the mean of the population. After observing the sampleof 61 people, the HR director still doesn’t know the overall average life insuranceexpenditure for the entire union, but a good bet is that it is between $429 and $532.

What if the interval is too wide for the HR director’s purpose? There are onlytwo choices for obtaining a shorter interval, and both of them come at a cost. Thefirst is to obtain more data, which will make SE(X) smaller. However, obtainingmore data is costly in terms of time, money, or both. The second option is to accepta larger probability of being wrong (i.e. a larger α, the probability that the intervalreally doesn’t contain the mean of the population), by going out fewer SE’s fromthe point estimate. If you want, you can have an interval of zero width, but youwill have a 100% chance of being wrong!

How much more data does the HR director need to collect? That depends onhow narrow an interval is desired. Right now the interval has a width of about$100, or a margin of error of about $50. (The margin of error is the ± term in aconfidence interval.) Suppose the desired margin of error E is ±$25. The marginof error for a confidence interval is

E = ts√n.

The HR director wants a 95% confidence interval, so t ≈ 2, and we can feel prettygood about s ≈ 200 based on the current data set. She can then solve for n ≈ 256.

Of course you can solve the formula for n without substituting for the other

4.4. HYPOTHESIS TESTING: THE GENERAL IDEA 71

letters, which gives:

n =

(ts

E

)2

.

Assuming you have done a pilot study or have some other way to make a guessat s, this is an easy “back of the envelope” calculation to tell you how much datayou need to get an interval with the desired margin of error E and confidence level(determined by t).

4.4 Hypothesis Testing: The General Idea

Hypothesis testing is about ruling out random chance as a potential cause for pat-terns in the data set. For example, suppose the human resources manager fromSection 4.3.2 needs to be sure that the per-capita insurance premium for union em-ployees is less than $500. The sample of 61 people has an average premium of $481.Is that small enough so that we can be sure the population mean is less than $500?Obviously not, since the 95% confidence interval for the population mean stretchesup to $532. It could easily be the case that µ = $500 and we just saw an x = $481by random chance.

Our example makes it sound like hypothesis tests are simply an application ofconfidence intervals. In many instances (the so-called t-tests) the two techniquesgive you the same answer. However, confidence intervals and hypothesis tests aredifferent in two respects. First, a hypothesis test compares a statistic to a pre-specified value, like the HR director’s $500 threshold. Hypothesis tests are oftenused to compare differences between two statistics, such as two sample means.In such instances the “specified value” is almost always zero. Second, hypothesistests can handle some problems that confidence intervals cannot. For example,Section 4.5.3 describes a hypothesis test for determining whether two categoricalvariables are independent. There is no meaningful confidence interval to look at inthat context. Here is how hypothesis tests work, step by step.

1. You begin by assuming a null hypothesis, which is that the population quantityyou’re testing is actually equal to a specified value, and that any discrepancyyou see in your sample statistic is just due to random chance.5 The nullhypothesis is written H0, such as in “H0 : µ = $500” or “H0 : the variablesare independent.” Remember that a hypothesis test uses the sample to testsomething about the population. Thus it is incorrect to write H0 : x = 500 orH0 : x = 481. You can see x. You don’t need to test for it.

5It sounds like the null hypothesis is a pretty boring state of the world. If it were exciting, wewouldn’t give it a hum-drum name like “null.”


Don’t Get Confused! 4.2 Which One is the Null Hypothesis?

There is a bit of art involved in selecting the null hypothesis for a hypothesistest. You will get better at it as you see more tests. Here are some guidelines.First, the null hypothesis has to specify an exact model, because in order tocompute a p-value you (or some math nerd you keep around for problemslike this) need to be able to figure out what the distribution of your teststatistic would be if the null hypothesis were true. For example, µ = 815or β = 0 are valid null hypotheses, but µ > 815 is not. Consequently, thenull hypothesis is almost always simpler than the alternative. For example,if you are testing the hypotheses “two variables are independent” and “twovariables are dependent” then “independent” is the natural null hypothesisbecause it is simpler.Also, be sure to remember that a hypothesis test is testing a theory abouta population or a process, not a sample or a statistic. Thus it is incorrectto write something like H0 : x = 815 when you really mean H0 : µ = 815.

2. The next step is to determine the alternative hypothesis you want to testagainst. Often the alternative hypothesis is simply the opposite of the nullhypothesis. It is written as H1 or Ha. For example, H1 : µ 6= $500 or Ha :the variables are related to one another. Sometimes you will only care aboutan alternative hypothesis in a specified direction. The HR director needs toshow that µ < $500 before she is authorized to offer the life insurance benefitin negotiations, so her natural alternative hypothesis is Ha : µ < $500.

3. Identify a test statistic that can distinguish between the null and alternativehypotheses. The choice of a test statistic is obvious in many cases. Forexample, if you are testing a hypothesis about a population mean, a differencebetween two population means, or the slope of the population regression line,then good test statistics are the sample mean, the difference between twosample means, and the slope of the sample regression line. Some test statisticsare a bit more clever. We’ll see an example in Section 4.5.3.

Test statistics are often standardized so that they do not depend on the unitsof the problem. Instead of x, you might see a test statistic represented as

t =x− µ0

SE(X),

where SE(X) = s/√n and µ0 is the value specified in the null hypothesis,

like $500. A test statistic which has been standardized by subtracting off its


hypothesized mean and dividing by its standard error is known by a specialname. It is a t-statistic. You should recognize this standardization as nothingmore than the “z-scoring” that we learned about in Section 2.4. A t-statistictells how many standard errors the first thing in its numerator is above the sec-ond. Because it has been standardized, you get the same t-statistic regardlessof whether x is measured in dollars or millions of dollars.

4. The final step of a hypothesis test looks at the test statistic to determinewhether it is large or small (i.e. close or far from the value specified in H0).Some people skip this step when working with t statistics, because we havesome sense of what a big t looks like (i.e. around ±2). However, there areother test statistics out there, with names like χ2 and F , where the definition of“big” isn’t so obvious. Even with t statistics, especially with small sample sizesof n < 30, the “magic number” needed to declare a test statistic “significantlylarge” will be different for different sample sizes.

The p-value is the main tool for measuring the strength of the evidence thata test statistic provides against H0. A p-value is the probability, calculatedassuming H0 is true, of observing a test statistic that is as or more extremethan the one in our data set. In our HR director example, if the populationmean had been µ = 500 then the probability of seeing a sample mean of 481or smaller is 0.2321. In other words, it wouldn’t be particularly unusual forus to see sample means like $481 if the true population mean were $500.

4.4.1 P-values

Once you have a p-value, hypothesis testing is easy. The rule with p-values is

Small p-value ⇒ Reject H0.

The smaller the p-value, the stronger the evidence against H0. Here’s why. If thep-value is very small then there are two possibilities.

1. The null hypothesis is true and we just got very strange data by bad luck.

2. The null hypothesis is false and we should conclude that the alternative is infact correct.

The p-value measures how “unlucky” we would have to be to see the data in ourdata set if we were in case 1. If the p-value is small enough (say less than 5%) thenwe conclude that there’s no way we’re that unlucky, so we must be in case 2 and wecan “reject the null hypothesis.” On the other hand if the p-value is not too smallthen we “fail to reject the null hypothesis.” This does not mean that we are sure


the null hypothesis is true, just that we don’t have enough evidence to be certain itis false. Roughly speaking here is the language you can use for different p-values.

p Evidence against H0

> .1 None.05 to .1 Weak.01 to .05 Moderate< .01 Strong

The p-value is not the probability that H0 is true. If it were, we would begin toprefer Ha as soon as p < .5 rather than .05. Instead, the p-value is a kind of “whatif” analysis. It measures how likely it would be to see our data, if H0 had beentrue. The less likely the data would be if H0 were true, the less comfortable we arewith H0. It takes some mental gymnastics to get your mind around the idea, butthat’s the trade-off with p-values. Using them is easy, understanding exactly whatthey say is a little harder.

4.4.2 Hypothesis Testing Example

Hypothesis tests have been developed for all sorts of questions: are two meansequal? Are two variances equal? Are two distributions the same? In an ideal worldyou would know the details about how each of these tests worked. However youoften may not have time to get into the nitty gritty details of each test. Instead,you can find a hypothesis test with a null and alternative hypothesis that fit yourproblem, plug your data into a computer, and locate the p-value.

For example, suppose you want to know if two populations have different stan-dard deviations. This might occur in a quality control application, or you mightwant to compare two stocks to see if one is more volatile than the other. Figure 4.5shows computer output from a sample of two potato chip manufacturers. To the eyeit appears that brand 2 has a higher standard deviation (i.e. is less reliable) thanbrand 1. Is this a real discrepancy or could it simply be due to random chance?

Figure 4.5 presents output from four hypothesis tests. Each test compares thenull hypothesis of equal variances to an alternative that the variances are unequalusing a slightly different test statistic. Each test has a significant (i.e. small) p-value, which indicates that the variances are not the same (small p-value says toreject the null hypothesis that the variances are the same). Thus the difference wesee is too large to be the result of random chance. Brand 1 is more reliable (i.e. hasa smaller standard deviation) than brand 2.

Notice the advantage of using p-values. We don’t have to know how large a“Brown-Forsythe F ratio” has to be in order to understand the results of the Brown-


Test F Ratio DFNum DFDen Prob> F

O’Brien[.5] 9.0356 1 46 0.0043

Brown-Forsythe 10.9797 1 46 0.0018

Levene 15.2582 1 46 0.0003

Bartlett 10.0318 1 0.0015

Figure 4.5: Potato chip output. P-values are located in the column marked “Prob > F .”

Forsythe test. We can tell what Brown and Forsythe would say about our problemjust by knowing their null and alternative hypothesis and looking at their p-value.

4.4.3 Statistical Significance

When you reject the null hypothesis in a hypothesis test you have found a “statis-tically significant result.” For example, the standard deviations in the potato chipoutput were found to be significantly different. All that means is that the differenceis too large to be strictly due to chance.

You should view statistical significance as a minimum standard. You shouldignore small patterns in the data set that fail tests of statistical significance. Whenyou find a statistically significant result, you know the result is “real,” but there isno guarantee that it is important to your decision making process. People sometimesrefer to this distinction as “statistical significance vs. practical significance.”

For example, suppose it is very expensive to adjust potato chip filling machines,and there are industry guidelines stating that bags must be filled to within ±1oz.Then the statistically significant difference between the standard deviations of thetwo potato chip processes is of little practical importance, because both processesare well within the industry limit. Of course if calibrating the filling machines is


cheap, then Figure 4.5 says brand 2 should recalibrate.

4.5 Some Famous Hypothesis Tests

Section 4.4.2 seems to argue that all you need to do a hypothesis test is the nullhypothesis and the p-value. To a large extent that is true, but there are a fewextremely famous hypothesis tests that you should know how to do “by hand”(with the aid of a calculator). This Section presents three tests which come upfrequently enough that it is worth your time to learn them. The three tests are:the one sample t-test for a population mean, the z-test for a proportion, and the χ2

test for independence between two categorical variables.

4.5.1 The One Sample T Test

You use the one sample t-test to test whether the population mean is greater than,less than, or simply not equal to some specified value µ0. Thus the null hypothesisis always H0 : µ = µ0, where you specify a number for µ0. We have already seenone example of the one sample t-test performed by our friend the HR director.

All the one sample t-test does is check whether x is close to µ0 or far away.“Close” and “far” are measured in terms of standard errors using the t-statistic6

t =x− µ0

SE(X),

where SE(X) = s/√n. The only complication comes from the fact that there are

three different possible alternative hypotheses, and thus three different possible p-values that could be computed. You only want one of them. The appropriate Ha

depends only on the setup of the problem. It does not depend at all on the datain your data set.

One Tail, or Two? (How to Tell and Why it Matters)

The one sample t-test tests the null hypothesis H0 : µ = µ0 where µ0 is our hy-pothesized value for the mean of the population. There are 3 possible alternativehypotheses:

(a) Ha : µ 6= µ0. This is called a two tailed (or two sided) test.

(b) Ha : µ > µ0. This is called a one tailed test.

6In a hypothesis test you calculate t and use it to compute a p-value. This is the opposite ofconfidence intervals, which look up t so that it matches a pre-specified probability such as 95%.

4.5. SOME FAMOUS HYPOTHESIS TESTS 77

400 450 500 550 600

0.00

00.

005

0.01

00.

015

400 450 500 550 600

0.00

00.

005

0.01

00.

015

400 450 500 550 600

0.00

00.

005

0.01

00.

015

(a) p = 0.4642 (b) p = .7679 (c) p = 0.2321

Figure 4.6: The three p-value calculations for the three possible alternative hypotheses inthe one sample t-test. Based on the data from Figure 4.4.

(c) Ha : µ < µ0. This is called a one tailed test.

The null hypothesis determines where the sampling distribution for X is cen-tered. The alternative hypothesis determines how you calculate your p-value fromthe sampling distribution. This can be a bit confusing, but it helps to remem-ber that a p-value provides evidence for rejecting the null hypothesis in favor of a

specified alternative. For the one sample t-test you can think of the p-value as theprobability of seeing an X that supports Ha even more than the x in your sample.Our HR director was trying to show that µ < $500, so the relevant calculation isthe probability that she would see an X even smaller than $481 (her sample mean)if µ really was $500 and the random sampling process were repeated again. Hercalculation is depicted in Figure 4.6(c).

The other two probability calculations in Figure 4.6 are irrelevant to the HRdirector, but they illustrate how the p-values would be calculated under the otheralternative hypotheses. What type of x would support Ha if it had been µ 6= 500?An x far away from $500 in either direction. Thus if Ha is µ 6= µ0 then the p-valuecalculates the probability that you would see future X ’s that are even farther awayfrom µ0 than the one in your sample if H0 were true. Likewise, if Ha is µ > µ0

then the p-value is the probability that you would see X ’s even larger than the xin your sample if H0 were true and the sampling process were repeated. Obviously,the upper and lower tail p-values must sum to 1, and the two tailed p-value is twicethe smaller of the one tailed p-values.

Remember that you choose the relevant alternative hypothesis from the contextof the problem without regard to whether you saw x > µ0 or x < µ0. Otherwise,people would only ever do one tailed tests. If you are not sure which alternative


hypothesis to use, pick the two tailed test. If you’re wrong then all you’ve done isapply a tougher standard than you needed to. That could keep you out of courtwhen you become CEO.7

Example 1

One of the components used in producing computer chips is a compound designed tomake the chips resistant to heat. Industry standards require that a sufficient amountof the compound be used so that the average failing temperature for a chip is 300degrees (Fahrenheit). The compound is relatively expensive, so manufacturers don’twant to use more than is necessary to meet the industry guidelines. Each day asample of 30 chips is taken from the day’s production and tested for heat resistance,which destroys the chips. Suppose, on a given day the average failure point for thesample of 30 chips is 305 degrees, with a standard deviation of 8 degrees. Does itappear the appropriate heat resistance target is being met?

Let µ be the average population failure temperature. The null hypothesis isH0 : µ = 300. What is the appropriate alternative? We are interested in deviationsfrom the target temperature on either the positive or negative side, so the bestalternative is Ha : µ 6= 300. The standard error of the mean is 8/

√30 = 1.46, so

the t-statistic is 3.42. We can use the normal table to construct the p-value becausethere are at least 30 observations in the data set. What p-value should we calculate?We want to know the probability of seeing sample means at least as far away from300 degrees as the mean from our sample, so the p-value is P (X ≥ 305) + P (X ≤295) ≈ p(Z > 3.42) + P (Z < −3.42). (The approximate equality is because we’reusing the normal table instead of the inconvenient t-table.) Looking up 3.42 on thenormal table gives us a p-value of .0003 + .0003 = .0006.

If the true heat tolerance in the population of computer chips was 300 degrees,we would only see a sample mean this far from 300 about 6 times out of 10,000.That makes us doubt that the true mean really is 300 degrees. It looks like theaverage heat tolerance is higher than we need it to be.

Example 2

An important use of the one sample t-test is in “before and after” studies wherethere are two observations for each subject, one before some treatment was appliedand one after. You can use the one sample t-test to determine whether there is anybenefit to the two treatments by computing the difference (after-before) for eachperson and testing whether the average difference is zero. This type of test is so

7Mandatory Enron joke of 2003.


important it merits its own name: the paired t-test, even though it is just the onesample t-test applied to differences.

For example, suppose engineers at Floogle (a company which designs internetsearch engines) have developed a new version of their search engine that they wouldlike to market as “our fastest ever.” They test the new version against their previous“fastest ever” search engine by running both engines on a suite of 40 test searches,randomly choosing which program goes first on each search to avoid potential biases.The difference in search times (new-old) is recorded for each search in the testsuite. The average difference is -3.775 (average search for the new engine was 3.775milliseconds less than under the old engine). The 40 differences have a standarddeviation of 21.4 milliseconds. Do the data support the claim that the new engineis “their fastest ever?”

Let µ be the difference in average search times between the two search engines.The appropriate null hypothesis here is µ = 0 (no difference in average speed).What should the alternative be? The engineers want to show that the new engineis “significantly faster” than the old one, so the natural alternative hypothesis isµ < 0. The standard error of the differences is SE(X) = 21.4/

√40 = 3.38, so the

t-statistic is t = −3.775/3.38 = −1.12. Should we compute an upper tail, lower tail,or two tailed p-value? Our alternative hypothesis is µ < 0, so we should compute alower-tailed p-value. From the normal table (because n > 30) we find p ≈ 0.1314.The large p-value says that the new search engine didn’t win the race by a sufficientmargin to show that it is faster than the old one.

4.5.2 Methods for Proportions (Categorical Data)

Recall from Chapter 1 that continuous data are often summarized by means, andcategorical data are summarized by the proportion of the time they occur. Let pdenote the proportion of the population with a specified attribute. The sample pro-portion is p (pronounced “p-hat”). You compute p by simply dividing the numberof “successes” in your sample by the sample size.

For a concrete example, suppose a car stereo manufacturer wishes to estimatethe proportion of its customers who, one year after purchasing a car stereo, wouldwould recommend the manufacturer to a friend. Suppose the manufacturer obtainsa random sample of 100 customers, 73 of whom respond positively. Then our bestguess at p is p = .73.

There is good news, and even more good news about proportions. The goodnews is that proportions are actually a special kind of mean. Imagine that person iis labeled with a number xi, which is 1 if they would recommend our car stereo to afriend, and 0 if they would not. We can calculate the proportion of favorable reviewsby averaging all those 0’s and 1’s. Thus, everything we learned about sample means


Don’t Get Confused! 4.3 The Standard Error of a Sample Proportion.

There are three ways to calculate SE(p) in practice. All of them recognizeSE(p) =

√

p(1 − p)/n. They differ in what you should plug in for p.

guess for p Used in. . . Rationale

p Confidence intervals Best guess for p.p0 Hypothesis testing You’re assuming H0 is true.

H0 says p = p0.1/2 Estimating n for a fu-

ture studyA conservative “worst case”estimate of p. Usable evenwith no data.

carries over to sample proportions. In particular, the central limit theorem impliesp obeys a normal distribution.

The second bit of good news about proportions it that it is even easier to calcu-late the standard error of p than it is for X . Here’s why. Data which only assumethe values 0 and 1 (or sometimes -1 and 1) are called indicator variables or dummy

variables. Note the following useful fact about dummy variables. Let

Xi =

1 with probability p

0 with probability 1 − p.

Then it is easy to show E(Xi) = p, and V ar(Xi) = p(1− p). (Try it and see! Hint:in this one special case X2

i = Xi.) What’s so useful about our useful fact? Becauseproportions are just means in disguise, and we know that SE(X) = SD(X)/

√n,

our “useful fact” says

SE(p) =

√

p(1 − p)

n.

So with proportions, if you have a guess at p you also have a guess at SE(p).

Confidence Intervals

To produce a confidence interval for p we use the formula

p± zSE(p)


where SE(p) =√

p(1 − p)/n and z comes from the normal table.8 Suppose, forexample, that we sample n = 100 items from a shipment and find that 15 aredefective. What is a 95% confidence interval for the proportion of defective itemsin the whole shipment? The point estimate is p = 15/100 = .15 and its standarderror is SE(p) =

√

(.15)(.85)/100 = 0.0357, so the confidence interval is

0.15 ± 1.96 × 0.0357 = [0.08, 0.22].

The shipment contains between 8% and 22% defective items (with 95% confidence).You may notice that, underneath the square root sign, SE(p) is a quadratic

function of p. The standard error is zero if p = 0 or p = 1. That makes sense,because if the shipment contained either all defectives or no defectives then everysample we took would give us either p = 1 or p = 0 with no uncertainty at all.We get the largest SE(p) when p = 1/2. It is useful to assume p = 1/2 whenyou are planning a future study and want to know how much data you need toachieve a specified margin of error. Recall from page 71 that to get a margin oferror E you need roughly n = (ts/E)2 observations. If you want a 95% confidenceinterval then t ≈ 2, and if you assume p = 1/2 then s (the standard deviation of anindividual observation) is

√

p(1 − p) =√

(1/2)(1/2) = 1/2. Therefore to estimatea proportion to within ±E you need roughly n ≈ 1/E2 observations.

To illustrate, suppose you wish to estimate the proportion of people planning tovote for the Democratic candidate in the next election to within ±2%. You wouldwant a sample of n = 1/(.02)2 = 1/(.0004) = 2500 voters. To get the margin oferror down to ±1% you would need 1/(.0001) = 10, 000 voters. Standard errorsdecrease like 1/

√n, so to cut SE(p) in half you need to quadruple the sample size.

Hypothesis Tests

Suppose your plant employs an extremely fault tolerant production process, whichallows you to accept a shipment as long as you can be confident that there arefewer than 20% defectives. Should you accept the shipment with 15 defectives inthe sample of 100? You could answer with a hypothesis test of H0 : p = .2 versusHa : p < .2. Then SE(p) =

√

(.2)(.8)/100 = .04 and the test statistic is

z =0.15 − 0.2

0.04= −1.25.

We used .2 instead of .15 in the formula for SE(p) because hypothesis tests computep-values by assuming H0 is true, and H0 says p = .2. Our test statistic says that

8We use the normal table and not the t table here because one of the t assumptions is that thedata are normally distributed, which can’t be true when the data are 0’s and 1’s. Therefore theresults for proportions presented in this section are for “large samples” with n > 30.


p is 1.25 standard errors below .2. For the alternative hypothesis Ha : p < p0 thep-value is

P (Z < −1.25) = 0.1056 from the normal table.

If H0 were true and p = .2 then we would see p ≤ .15 about 10% of the time, whichis not all that unusual. We should reject the shipment, because we can’t reject H0.

4.5.3 The χ2 Test for Independence Between Two Categorical Vari-ables

The final hypothesis test in this Chapter is a little different from the others becausethere is no confidence interval to which it corresponds. The test is called the χ2 test.(χ is the Greek letter “chi”, which is pronounced with a hard “k” sound and rhymeswith “sky.”) The χ2 test investigates whether there is a relationship between twocategorical variables. For example is there a relationship between gender and type ofcar purchased? Figure 4.7 describes the type of car purchased by a random sampleof 263 men and women. In the sample women buy a slightly higher percentage offamily cars than men, and men buy a slightly higher percentage of sporty cars thanwomen. The hypothesizes for the χ2 test are

H0 : The two variables are independent (i.e. no relationship)

Ha : There is some sort of relationship

To decide which we believe we produce a contingency table for the two variablesand calculate the number of people we would expect to fall in each cell if thevariables really were independent. If observed numbers (i.e. the numbers thatactually happened) are very different from what we would expect if X and Y wereindependent then we will conclude there must be a relationship.

The first step is to compute how many observations we would expect to see ineach cell of the table if the variables were actually independent. Recall that if tworandom variables X and Y are independent then P (X = x and Y = y) = P (X =x)P (Y = y). From the table in Figure 4.7 we see that the proportion of men inthe sample is .5457 (= 144/263) and the proportion of sporty cars in the sample is.3004 (=79/263). If TYPE and GENDER were independent we would expect theproportion of “men with sporty cars” in the data set to be (.5457)(.3004), whichmeans we would expect to see 263(.5457)(.3004)= 43.25 observations in that cell ofthe table. In more general terms,

Ei = npxpy

= nxny/n,


Test ChiSquare Prob>ChiSq

L. Ratio 1.420 0.4915

Pearson 1.417 0.4924

Figure 4.7: Automobile preferences for men and women.

where Ei is the expected cell count in the i’th cell of the table, and px and py are themarginal proportions for that cell. The first equation is how you should think aboutEi being calculated when someone else (i.e. the computer) does the calculation foryou. If you have to do the calculation yourself use the second form, which is ashortcut you get by noticing px = nx/n and py = ny/n, where nx and ny are themarginal counts for the cell (e.g. number of men and number of sporty cars).

Once you have determined how many observations you would expect to see ineach cell of the table, you need to see whether the table you actually observed isclose or far from the table you would expect if the variables were independent. ThePearson χ2 test statistic9 fits the bill.

X2 =∑

i

(Ei −Oi)2

Ei,

where Ei and Oi are the expected and observed counts for each cell in the table. Youdivide by Ei because you expect cells with bigger counts to also be more variable.Dividing by Ei in each term puts cells with large and small expected counts on alevel playing field. Otherwise X2 is just a way to measure the distance between yourobserved table and the table you would expect if the variables were independent.

How large does X2 need to be to conclude X and Y are related? That dependson the number of rows and columns in the table. If there are R rows and C columns,then you compare X2 to the χ2 distribution with (R−1)(C−1) degrees of freedom.(Imagine chopping off one row and one column of the table and counting the cells

9There is another chi-square statistic called the “likelihood ratio” statistic, denoted G2, thatwe won’t cover. Even though they are calculated differently, G2 ≈ X2 and both statistics arecompared to the same reference distribution.


Not on the test 4.3 Rationale behind the χ2 degrees of freedom calculation

The χ2 test compares two models. The “degrees of freedom” for the test isthe difference in the number of parameters needed to fit each model. Thesimpler model assumes the two variables are independent. To specify thatmodel you need to estimate R−1 probabilities for the rows and C−1 prob-abilities for the columns. The “−1” comes from the fact that probabilitieshave to sum to one. The more complicated model assumes that the two vari-ables are dependent, so each cell in the table gets its own probability. That’sRC − 1 “free” probabilities. A little arithmetic shows that the complicatedmodel has (R − 1)(C − 1) more parameters than the simple model.

in the smaller table.) The χ2 distribution, like the t distribution, is a probabilitydistributions we think you should know about, but not have to learn tables for.Figure 4.8 shows the χ2 distribution with 5 degrees of freedom. The χ2 distributionwith d degrees of freedom has mean d and standard deviation

√2d. Thus for larger

tables you need a larger X2 to declare statistical significance, which makes sensebecause larger tables contribute more terms to the sum.

Large X2 values make you want to reject H0 (and say there is a relationshipbetween the two variables), so the p-value for the χ2 test must be the probability inthe upper tail of Figure 4.8. I.e. the probability that you would see an observed tableeven farther away from the expected table if the variables were truly independentand a second random sample was taken.

In the auto choice example, X2 = 1.417, on (3-1)(2-1)=2 degrees of freedom, fora p-value of .4924. If TYPE and GENDER really were independent, we would seeobserved tables this far or farther from the expected tables about half the time. Thesmall differences we see between men and women in the sample could very easily bedue to random chance. To see what a relationship looks like, recall Figure 1.6 onpage 11, which showed the preferences for automobile type across three different agegroups. The data in Figure 1.6 produce X2 = 29.070, on 4=(3-1)(3-1) degrees offreedom, yielding a p-value p < .0001. There is very strong evidence of a relationshipbetween TYPE and AGEGROUP.

The χ2 test is a test for the entire table. In principle you could design a hypoth-esis test to examine subsets of a contingency table, but we will not discuss themhere. A good strategy with contingency tables is to perform a χ2 test first, to see ifthere is any relationship between the variables in the table. If there is, you can usemethods similar to Section 4.5.2 to compare individual proportions.


0 5 10 15 20

0.00

0.05

0.10

0.15

Figure 4.8: The χ2 distribution with 5 degrees of freedom. p-values come from the uppertail of the distribution.

Caveat

There is one important caveat to keep in mind for the χ2 test. The p-values aregenerally not trustworthy if the expected number of observations in any cell of thetable is less than 5. However, if that is the case then you may form a new tableby collapsing one or more of the offending categories into a single category (i.e.replacing “red” and “blue” with “red or blue”).

Chapter 5

Simple Linear Regression

In Chapter 6 we will consider relationships between several variables. The twomost common reasons for modeling joint relationships are to understand how onevariable will be affected if we change another (for example how will profit change ifwe increase production) and actually predict one variable using the value of another(e.g. if we price our product at $X what will be our sales Y ). There are any numberof ways that two variables could be related to each other. We will start by examiningthe simplest form of relationship, linear or straight line but it will become clear thatwe can extend the methods to more complicated relationships.

5.1 The Simple Linear Regression Model

The idea behind simple linear regression is very easy. We have two variables X andY and we think there is an approximate linear relationship between them. If sothen we can write the relationship as

Y = β0 + β1X + ε

where β0 is the intercept term (i.e. the value of Y when X = 0), β1 is the slope(i.e. the amount that Y increases by when we increase X by 1) and ε is an errorterm (see Figure 5.1). The error term comes in because we don’t really think thatthere is an exact linear relationship, only an approximate one. A fundamentalregression assumption is that the error terms are all normally distributed with thesame standard deviation σ.

An equivalent way of writing the linear regression model is

Y ∼ N (β0 + β1X,σ).

87

88 CHAPTER 5. SIMPLE LINEAR REGRESSION

Figure 5.1: An illustration of the notation used in the regression equation.

Recall that Y ∼ N (µ, σ) means “Y is a normally distributed random variable withmean µ and standard deviation σ.” Writing the regression model this way helpsmake the connection to Chapters 2 and 4. All that has changed is that now we’reletting µ depend on some background variable that we’re calling X.

If we are interested in how Y changes when we change X then we look at theslope β1. If we wish to predict Y for a given value of X we use y = β0 + β1x.In practice there is a problem: β0 and β1 are unknown (just like µ was unknownin Chapter 4) because we can’t see all the X’s and Y ’s in the entire population.Instead we get to see a sample of X and Y pairs and need to use these numbers toguess β0 and β1. For any given values of β0 and β1 we can calculate the residualsei = yi− yi, which measure how far the y that we actually saw is from its prediction.The residuals are signed so that a positive residual means the point is above theregression line, and a negative residual means the point is below the line. Wechoose the line (i.e. choose β0 and β1) that makes the sum of the squared residualsas small as possible (we square the residuals to remove the sign). Said another way,regression tries to minimize

SSE = e21 + e22 + · · · + e2n.

SSE stands for the “sum of squared errors” and the estimates for β0 and β1 arecalled b0 and b1. Note that the line that we get is just a guess for the true line (justas x is a guess for µ). Therefore b0 and b1 are random variables (just like x) becauseif we took a new sample of X and Y values we would get a different line.

You obtain guesses for β0 and β1 by minimizing SSE. What about a guess for σ?

5.1. THE SIMPLE LINEAR REGRESSION MODEL 89

Not on the test 5.1 Why sums of squares?

There are a number of reasons why the sum of squared errors is the modelfitting criterion of choice. The first is that it is a relatively easy calculusproblem. You could (but thankfully don’t have to) take the derivative ofSSE with respect to β0 and β1 (viewing all the xi’s and yi’s as constants)and solve the system of two equations without too much difficulty.On a deeper level, sums of squares tell us how far one thing is from another.Do you remember the “distance formula” from high school algebra? It saysthat the squared distance between two ordered pairs (x1, y1) and (x2, y2) isd2 = (x2 − x1)

2 + (y2 − y1)2. The corresponding formula works for ordered

triples plotted in 3-space, and for ordered n-tuples in higher dimensions.The sum of squared errors is the squared distance from the observed vectorof responses (y1, . . . , yn) to the prediction vector (y1, . . . , yn). By minimizingSSE, regression gets those two vectors to be as close as possible.

In earlier Chapters, we said that variance was the “average squared deviation fromthe mean.” The same is true here, but in a regression model the “mean” for eachobservation depends on X. So instead of averaging (yi − y)2, we average (yi − yi)

2.One other difference is that in a regression model we must estimate two parameters(the slope and intercept) to compute yi, so we use up one more degree of freedomthan we did when our best guess for Y was simply y. Thus the estimate of σ2 is

s2 =

∑ni=1(yi − yi)

2

n− 2.

Now look at the sum in the numerator of s2. That’s SSE! People sometimes calls2 the mean square error, or MSE, because to calculate the variance you take themean of the squared errors. Those same people would call s =

√s2 the “root” mean

square error, or RMSE.When you estimate a regression model you estimate the three numbers b0, b1,

and s. The first two tell you where the regression line goes. The last one gives youa sense of how spread out the Y ’s are around the line.

5.1.1 Example: The CAPM Model

Figure 5.2 shows the results obtained from regressing monthly stock returns forSears against a “value weighted” stock index representing the overall performanceof the stock market. The “capital asset pricing model” (CAPM) from finance saysthat you can expect the returns for an individual stock to be linearly related to the


Summary of Fit

RSquare 0.426643

RSquare Adj 0.424106

Root Mean Square Error 0.056324

Mean of Response 0.009779

Observations 228

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept -0.003307 0.003864 -0.86 0.3931

VW Return 1.123536 0.086639 12.97 <.0001

Figure 5.2: Fitting the CAPM model for Sears stock returns.

overall stock market. The equation of the estimated line describing this relationshipcan be found in the “Parameter Estimates” portion of the computer output. Thefirst column lists the names of the X variables in the regression. The second columnlists their coefficients in the regression equation.1 So the regression equation is

Sears = −0.003307 + 1.123536(VW Return).

You can find s in the “Summary of Fit” table under the heading “Root Mean SquareError.” In this example s = 0.056324.

The CAPM model refers to the estimated equation as the “security market line.”The slope of the security market line is known as the stock’s “beta” (a referenceto the standard regression notation). It provides information about the stock’svolatility relative to the market. If β = 2 then a one percentage point change in themarket return would correspond to a two percentage point change in the return ofthe individual security. Thus if β > 1 the stock is more volatile than the market asa whole, and if β < 1 the stock is less volatile than the market.

You can find regression-based volatility information in the prospectus for yourinvestments. Figure 5.3 shows the volatility measurements from the prospectus ofone of Fidelity’s more aggressive mutual funds. It lists β, which we just discussed,R2 will be discussed in Section 5.2.2, and “standard deviation” is s.

1All statistics programs organize their estimated regression coefficients in exactly this way.

5.2. THREE COMMON REGRESSION QUESTIONS 91

Figure 5.3: Volatility measures for an aggressive Fidelity fund. Source: www.fidelity.com.

5.2 Three Common Regression Questions

There are three common questions often asked of a regression model.

1. Can I be sure that X and Y are actually related?

2. If there is a relationship, how strong is it? An equivalent way to phrase thisquestion is; “What proportion of the variation have I explained?”

3. What Y would I predict for a given X value and how sure am I about myprediction?

5.2.1 Is there a relationship?

This question really comes down to asking whether β1 = 0, because if the slope iszero X disappears out of the regression model and we get Y = β0 + ε. The questionis, “Is b1 far enough away from 0 to make us confident that β1 6= 0?” This shouldsound familiar because it is exactly the same idea we used to test a population meanin the one sample t-test. (Is x far enough from µ0 that we can be sure µ 6= µ0?)We perform the same sort of hypothesis test here. Start with H0 : β1 = 0 versusH1 : β1 6= 0. We calculate b1 and its standard error and then take the ratio

t =b1

SE(b1).

As before, the t-statistic counts the number of standard errors that b1 is from zero.If the absolute value of t is large then we will reject the null hypothesis and concludethere must be a relationship between X and Y . As always, we determine whethert is large enough by looking at the p-value it generates. If the p-values is small wewill conclude that there is a relationship. We usually hope that the p-value is small,otherwise we might as well just stop because there is no evidence that X and Y arerelated at all. The small p-value for the slope in Figure 5.2 leaves no doubt thatthere is a relationship between Sears stock and the stock market.


Don’t Get Confused! 5.1 R2 vs. the p-value for the slope

R2 tells you how strong the relationship is. The p-value for the slopetells you whether R2 is sufficiently large for you to believe there is any

relationship. Put another way, the p-value answers a yes-or-no questionabout the relationship between X and Y . The smaller the p-value the moresure you can be that there is at least some relationship between the twovariables. R2 answers a “how much” question. You shouldn’t even look atR2 unless the p-value is significant.

The only difference between the t-test for a regression coefficient and the t-testfor a mean is that the standard error is computed using a different formula.

SE(b1) =s

sx

√n− 1

This formula says that b1 is easier to estimate (its SE is small) if

1. s is small. s is the standard deviation of Y around the regression line, so ifthe points are tightly clustered around the regression line it is easier to guessthe slope.

2. sx is large. sx is the standard deviation of the X’s. If the X’s are highlyspread out, it is easier to guess the regression slope.

3. n is large. As with most things, the estimate of the slope improves as moredata are available.

You could also use SE(b1) to construct a confidence interval for β1. Just aswith means, you can construct a 95% confidence interval for β1 as [b1 ± 2SE(b1)].How meaningful this confidence interval is depends on whether you can give β1 ameaningful interpretation. In Figure 5.2, β1 is the stock’s volatility measure, so aconfidence interval for β1 represents a range of plausible values for the stock’s truevolatility. The standard error of b1 in Figure 5.2 is about .086, so a 95% confidenceinterval for Sears stock is roughly (1.12± .17) = (.95, 1.29). The confidence intervalcontains 1, which says that Sears stock may be no more volatile than the stockmarket itself.

5.2.2 How strong is the relationship?

Once we have decided that there is some sort of relationship between X and Ywe want to know how strong the relationship is. We measure the strength of the


relationship between X and Y using a quantity called R2. To calculate R2 we usethe formula

R2 = 1 − SSE

SST

where SST =∑

(yi − y)2. Think of SST as the variability you would have if youignored X and estimated each y with y (the variability of Y about its mean). Thinkof SSE as the variability left over after fitting the regression (the variability of Yabout the regression line). Then R2 calculates the proportion of variability that youhave explained using the regression. R2 is always between zero and one. A numberclose to one indicates a large proportion of the variability has been explained. Anumber close to zero indicates the regression did not explain much at all! Note that

R2 ≈ 1 − s2es2y

where se is the standard deviation of the residuals and sy is the standard deviationof the Y ’s.2 For example, if the sample standard deviation of the y’s (about theaverage y) is 1, and the standard deviation of the residuals (about the regressionline) is 0.5 then R2 ≈ .75 because 0.52 is about 75% smaller than 12.

You might be wondering why we wouldn’t just use the correlation coefficient welearned about in Section 3.2.3 to measure of how closely X and Y are related. Infact it can be shown that

R2 = r2

so the two quantities are equivalent. An advantage of R2 is that it retains itsmeaning even if there are several X variables, which will be the case in Chapter 6and in most real life regression applications. Correlations are limited to describingone pair of variables at a time. Of course, correlations tell you the direction of therelationship in addition to the strength of the relationship. That’s why we keepthem around instead of always using R2.

5.2.3 What is my prediction for Y and how good is it?

Once we have an estimate for the line, making a prediction of Y for a given x iseasy. We just use

y = b0 + b1x.

However, there are two sorts of uncertainty about this prediction.

2The approximate equality is because s2e = SSE/(n − 2), while s2

y = SST/(n − 1). If n is largethen (n − 1)/(n − 2) ≈ 1.


1. Remember that b0 and b1 are only guesses for β0 and β1 so y = b0 + b1x is aguess for the population regression line. I.e. y is a guess for the “µ” associatedwith this particular x. We use a confidence interval to determine how good aguess it is.

2. Even if we knew the parameters of the regression model perfectly the individ-ual data points wouldn’t lie exactly on the line. A prediction interval providesa range of plausible values for an individual future observation with the spec-ified x value. The interval combines our uncertainty about the location ofthe population regression line with the variation of the points around thepopulation regression line.

Figure 5.4 illustrates the difference between these two types of predictions. It showsthe relationship between a firm’s monthly sales and monthly advertising expenses.Suppose you are thinking of setting an advertising policy of spending a certainamount each month for the foreseeable future. You are taking a long term view soyou probably care more about the long run monthly average value of sales than youdo month to month variation. In that case you want to use a confidence interval.However, if you are considering a one month media blitz and you want to forecastwhat sales will be if you spend $X on advertising, then the prediction interval iswhat you want.

Confidence Intervals for the Regression Line

Suppose you are interested in predicting the numerical value of the populationregression line at a new point x∗ of your choosing. (Translation: suppose you wantto estimate the long run monthly average sales if you spend $2.5 million per monthon advertising.) You would compute your confidence interval by first calculating

y = b0 + b1x∗.

Then the interval is y ± 2SE(y), where

SE(y) = s

√

1

n+

(x∗ − x)2

(n− 1)s2x.

This is another one of those formulas that isn’t as bad as it looks when you breakit down piece by piece. First, pretend the second term under the square root signisn’t there. Then SE(y) is just s/

√n like it was when we were estimating a mean.

The ugly second term increases SE(y) when x∗ (the x value where you want tomake your prediction) moves away from x. All that says is that it is harder toguess where the regression line goes when you’re far away from x. “Far” is defined


(a) (b)

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t| RSquare 0.28412

Intercept 66151.758 11709.47 5.65 <.0001 RMSE 14503.29

ADVEXP 23.32421 6.349464 3.67 0.0008 N 36

Mean of ADVEXP: 1804.444

SD of ADVEXP: 386.096

Figure 5.4: Regression output describing the relationship between sales and advertisingexpenses, including (a) confidence and (b) prediction intervals. ($1000’s) There is a “bow”effect present in both sets of intervals, but it is more pronounced in (a).

relative to sx, the standard deviation of the X’s. Finally, you have (x∗ − x)2 ands2x because the formula for variances is based on squared things. The “ugly secondterm” is the thing that makes the confidence intervals Figure 5.4(a) bow away fromthe regression line far from the average ADVEXP.

The formula for SE(y) is useful because it helps us understand how confidenceintervals behave. However, the best way to actually construct a confidence interval ison a computer. Suppose we wanted an interval estimate of long run average monthlysales assuming we spend $2.5 million per month on advertising. The regressionequation says that y = 66151 + 23.32(2500) = 124, 451, or about $124 million. The95% confidence interval when x∗ = 2500 is (114230, 134694), or between $114 and$135 million. You can see these numbers in Figure 5.4(a) by going to x∗ = 2500 onthe X axis and looking straight up.


Prediction Intervals for Predicting Individual Observations

SE(y) measures how sure you are about the location of the population regressionline at a point x∗ of your choosing. Predicting an individual y value (let’s call ity∗) for an observation with that x∗ is more difficult, because you have to combineyour uncertainty about the location of the regression line with the variation of they’s around the line. If you knew β0, β1, and σ2 then your 95%prediction intervalfor y∗ would be (y ± 2σ). As it is, we recognize that y = y + residual, so

V ar(y∗) = s2(

1

n+

(x∗ − x)2

(n− 1)s2x

)

︸︷︷︸

V ar(y)

+

(

s2)

︸︷︷︸

V ar(residual)

.

Consequently, the standard error for a predicted y∗ value is

SE(y∗) = s

√

1

n+

(x∗ − x)2

(n− 1)s2x+ 1.

An ugly looking formula, to be sure. But notice that if you had all the data in thepopulation (i.e. if n → ∞) then it would just be s. Also notice that under thesquare root sign you have the same formula you had for SE(y) plus a “1,” whichwould actually be s2 if you distributed the leading factor of s through the radical.

The extra “1” under the square root sign means that prediction intervals arealways wider than confidence intervals. Everything else under the square root signmeans that prediction intervals are always wider than they would be if you knewthe equation of the true population regression line.

The 95% prediction interval when ADVEXP=2500 is (93263, 155662), or from$93 to $156 million.

Interpolation, Extrapolation, and the Intercept

Prediction and confidence intervals both get very wide when x∗ is far from x. Usingyour model to make a prediction at an x∗ that is well beyond the range of thedata is known as extrapolation. Predicting within the range of the data is known asinterpolation. Wide confidence and prediction intervals are the regression model’sway of warning you that extrapolation is dangerous. An additional problem withextrapolation is that there is no guarantee that the relationship will follow the samepattern outside the observed data that it follows inside. For example a straight linemay fit well for the observed data but for larger values of X the true relationshipmay not be linear. If this is the case then an extrapolated prediction will be evenworse than the wide prediction intervals indicate.

5.3. CHECKING REGRESSION ASSUMPTIONS 97

The notion of extrapolation explains why we rarely interpret the intercept termin a regression model. The intercept is often described as the expected Y when Xis zero. Yet we very often encounter data where all the X’s are far from zero, sopredicting Y when X = 0 is a serious extrapolation. Thus it is better to think ofthe “intercept” term in a regression model as a parameter that adjusts the heightof the regression line than to give it a special interpretation.

5.3 Checking Regression Assumptions

It is important to remember that the regression model is only a model, and theinferences you make from it (i.e. prediction intervals, confidence intervals, andhypothesis tests) are subject to criticism if the model fails to fit the data reasonablywell. We can write the regression model as

Y ∼ N (Y , σ)

where Y = β0 + β1X. This simple equation contains four specific assumptions,known as the regression assumptions. In order of importance they are:

1. Linearity (Y depends on X through a linear equation)

2. Constant variance (σ does not depend on X)

3. Independence (The equation for Y does not depend on any other Y ’s)

4. Normality (Observations are normally distributed about the regression line)

This section is about checking for violations of these assumptions, what happensif each one is violated, and possible means of correcting violations. The main toolfor checking the regression assumptions is a residual plot, which plots the residualsvs. either X or Y . It doesn’t matter which, because Y is just a linear functionof X. For simple regressions we typically put X on the X axis of a residual plot.In Chapter 6 there will be several X variables, so we will put Y (a one numbersummary of all the X’s) on the X axis instead.

Why look at residuals? The residuals are what is left over after the regressionhas done all the explaining that it possibly can. Thus if the model has capturedall the patterns in the data, the residual plot will just look like random scatter. Ifthere is a pattern in the residual plot, that means there is a pattern in the data thatthe model missed, which implies that one of the regression assumptions has beenviolated. When we detect an assumption violation we want to do something to getthe pattern out of the residuals and into the model where we can actually use it tomake better predictions. The “something” is usually either changing the scale of


y^2

log y, sqrt y, 1/yexp(x), x^2

sqrt x, log x, 1/xsqrt y, log y, 1/y

sqrt x, log x1/xy^2

x^2

Figure 5.5: “Tukey’s Bulging Rule.” Some suggestions for transformations you can tryto correct for nonlinearity. The appropriate transformation depends on how the data are“bulging.”

the problem by transforming one of the variables, or adding another variable to themodel.

5.3.1 Nonlinearity

Nonlinearity simply means that a straight line does a poor job of describing thetrend in the data. The hallmark of nonlinearity is a bend in the residual plot. Thatis, look for a pattern of increasing and then decreasing residuals (e.g. a “U” shape).You can sometimes see strong nonlinear trends in the original scatterplot of Y vs.X, but the bend is often easier to see using the residual plot.

Trick: If you are not sure whether you see a bend, try mentally dividing theresidual plot into two or three sections. If most of the residuals in one section areon one side of zero, and most in another are on the other side, then you have abend. You can also test for nonlinearity by fitting a quadratic (i.e. both linear andsquared terms) and testing whether the squared term is significant.

Fixing the problem: There are two fixes for nonlinearity.

1. If there is a “U” shape in the scatterplot (or an inverted “U”) then fit aquadratic instead of a straight line.

2. If the trend in the data bends, but does not completely change direction, youshould consider transforming either the X and/or Y variables.

The most popular transformations are those that raise a variable to a power likeX2, X1/2 (the square root of X), or X−1 = 1/X. There is a sense3 in which X0

3See “Not on the Test” 5.2 on page 99.


Not on the test 5.2 Box-Cox transformations

There is a theory for how to transform variables known as Box-Cox trans-

formations. Denote your transformed variable by

w =xα − 1

α.

If α 6= 0 then w is “morally” xα. If you remember “L’Hopital’s Rule” fromcalculus then you can show that as α→ 0 then w → loge x.

corresponds to logX, which for some reason is usually a good transformation tostart with. The most common way to carry out the transformation is by creatinga new column of numbers in your data table containing the transformed variable.Then you can run an ordinary linear regression using the transformed variable.

There is something of an art to choosing a good transformation. There is sometrial and error involved, and there could be several choices that fit about equally well.Tukey’s bulging rule provides some some advice on how to get started. You shouldthink about transformations as stretching or shrinking the axis of the transformedvariable. Powers greater than one stretch, and powers less than one shrink. Forexample, if the trend in the data bends like the upper right quadrant of Figure 5.5then you can “straighten out” the trend by stretching either the X or Y axes, i.e.by raising them to a power greater than 1. If then trend looks like the lower leftquadrant of Figure 5.5 then you want to contract either the X or Y axis. Given achoice, we typically prefer transformations that contract the data because they canalso limit the influence of unusual data points, which we will discuss in Section 5.4.

Nonlinearity Example 1

A chain of liquor stores is test marketing a new product. Some stores in the chainfeature the product more prominently than others. Figure 5.6 shows weekly salesfigures for several stores plotted against the number of “display feet” of space theproduct was given on each store’s shelves.

There is an obvious bend in the relationship between display feet and sales,which you could characterize as “diminishing returns.” The bend does not changedirection, so we prefer to capture it using a transformation rather than fitting aquadratic. The bend looks like the upper left quadrant of Figure 5.5, so we couldstraighten it out by either stretching the Y axis or compressing the X axis. Let’sstart with a transformation from DisplayFeet to LogDisplayFeet. Figure 5.6(c)shows what the scatterplot looks like when we plot Sales vs. LogDisplayFeet, along


(a) (b) (c)

Figure 5.6: Relationship between amount of display space allocated to a new product andits weekly sales. (a) Original scale showing linear regression and log transform. (b) Residualplot from linear regression. (c) Linear regression on transformed scale.

with the estimated regression line. The curved line in Figure 5.6(a) shows the sameestimated regression line but plotted on the original scale. The log transformationchanges the scale of the problem from one where a “straight line” assumption is notvalid to another scale where it is.

The log transformation fits the data quite nicely, but there other transformationsthat would fit just as well. Figure 5.7 compares computer output for the log andreciprocal (1/X) transformations. Graphically, both transformations seem to fit thedata pretty well. Numerically, R2 is about the same for both models.4 It is a littlehigher for the reciprocal model, but not enough to get excited about. So whichtransformation should we prefer?

We want the model to be as interpretable as possible. That means we wanta transformation that fits the data well, but also has some economic or physicaljustification. The question is, what do we think the relationship between Sales andDisplayFeet will look like as DisplayFeet grows? If we think sales will continue torise, albeit at a slower and slower rate, then the log model makes more sense. If wethink that sales will soon plateau, and won’t increase no matter how many feet ofdisplay space the produce receives, then the reciprocal model is more appropriate.Statistics can tell you that both models fit about equally well. The choice betweenthem is based on your understanding of the context of the problem.5

Let’s suppose we prefer the reciprocal model. Now that we have it, what can we

4You can compare R2 for these models because Y has not been transformed. Once you transformY then R2 is measured differently for each model. I.e. it doesn’t make sense to compare percentvariation explained for log(sales) to the percent variation explained for 1/sales.

5That’s good news for you. Your ability to use judgments in situations like this makes you morevaluable than a computer. Human judgment, even more than Arnold Schwarzenegger or KeanuReeves, will keep computers from taking over the earth.


Log model:

RSquare 0.815349

RMSE 41.3082

Reciprocal model:

RSquare 0.826487

RMSE 40.04298


Intercept 83.560256 14.41344 5.80 <.0001

Log(DispFeet) 138.62089 9.833914 14.10 <.0001


Intercept 376.69522 9.439455 39.91 <.0001

1/(DisplayFeet) -329.7042 22.51988 -14.64 <.0001

Figure 5.7: Comparing the log (heavy line) and reciprocal (lighter line) transformationsfor the display space data.

do with it? Suppose that the primary cost of stocking the product is an opportunitycost of $50/foot. That is, the other products in the store collectively generate about$50 in sales per foot of display space. How much of the new product should we stock?If we stock x display feet then our marginal profit will be

π(x) = β0 + β1/x︸︷︷︸

extra sales revenue

− 50x︸︷︷︸

opportunity cost

.

If you know some calculus6 you can figure out the optimal value of x is

x =

√

−β1

50.

Our estimate for β1 is b1 = −329, so our best guess at the optimal display amount is√

329/50 ≈ 2.5 feet. How sure are we about this guess? Well β1 is plausibly between−329 ± 2(22.5) = (−374,−284), so if we plug these numbers into our formula forthe optimal x we find it is between 2.38 and 2.73 feet.

6If not then you can hire one of us to do the calculus for you, for an exorbitant fee!


For budgeting purposes we want to know the long run average weekly sales wecan expect if we give the product its optimal space. Our best guess for weekly salesis 376.69522 − 329.7042/2.5 = $244.81, but that could range between $232.62 and$257.00. These number come from our confidence interval for the regression line atx∗ = 2.5, which we obtained from the computer. Will it be profitable to stock theitem over the long run? Stocking the product at 2.5 feet of display space will cost us$125 per week in opportunity cost, so we estimate that the long run average weeklyprofit is somewhere between $107.62 and $127. That means we’re pretty sure theproduct will be profitable over the long haul. Finally, the 95% prediction intervalfor a given week’s sales with 2.5 feet of display space goes from $163 to $326. If astore sells less than $163 of the product in a given week they might receive a visitfrom the district manager to see what the problem is. If they sell more than $326then they’ve done really well.

Nonlinearity Example 2

While we’ve been learning about correlation and regression, our HR director fromChapter 4 has been off negotiating with her labor union. The negotiating teamthought they had a deal, but when it was presented to the union for a vote thedeal faced stiff resistance from the oldest union members. In Chapter 4 the HRdirector determined that a life insurance benefit was too expensive to include inmanagement’s initial offer. Now she’s wondering if it is time offer the benefit,despite the cost, in order to disproportionately sway the older union members.

Figure 5.8 shows a scatterplot from a simple random sample of union membersthat the HR director obtained before the labor negotiations. The plot shows therelationship between a union member’s age and the amount they pay in life insur-ance. The HR director fits a linear regression of LifeCost on Age and notices thethe regression line has a positive slope with a significant p-value. That means olderemployees pay more for life insurance, so the oldest employees ought to support herproposed benefit, right?

It would, if we believed in the linear relationship. Examine the residual plot inFigure 5.8. Most of the residuals in the middle third of the plot are positive, andmost of the residuals in the other two thirds are negative. That is evidence of anonlinear relationship. The quadratic model from Figure 5.8 is much better fit tothese data. Notice that the quadratic term has a significant p-value, which saysthat it is doing a useful amount of explaining. Also notice that R2 is much higherfor the quadratic model than for the linear model. In Chapter 6 we will learn thatR2 ALWAYS increases when you add a variable to the model (as we did with thequadratic term). Thus we couldn’t justify the quadratic model if R2 were only alittle bit higher. The significant p-value for the quadratic term says that R2 went



Intercept 258.87651 105.6601 2.45 0.0173 RSquare 0.073527

Age 5.0732059 2.3444 2.16 0.0345 RMSE 195.3152


Intercept 360.62865 95.48353 3.78 0.0004 RSquare 0.300978

Age 4.6138293 2.05667 2.24 0.0287 RMSE 171.1107

(Age-43.787)^2 -0.717539 0.16517 -4.34 <.0001

Figure 5.8: Output from linear and quadratic regression models for insurance premiumregressed on age. The residual plot is from the linear model.

up by enough to justify the term’s inclusion. Figure 5.9 plots the residuals fromthe quadratic model. The residuals are more evenly scattered throughout the plot,particularly at the edges, which makes us comfortable that we’ve captured the bend.

The quadratic model differs from the linear model because it says that oldermembers pay more for life insurance up to a point, then their monthly premiumsbegin to decline. Perhaps the older employees locked into their life insurance pre-miums long ago. The equation for the quadratic model is

Cost = 360.63 + 4.6(Age) − 0.72(Age − 43.787)2

The computer centers the quadratic term around the average age in the data set(i.e. x = 43.787) to prevent something called collinearity that we will learn aboutin Chapter 6. A bit of calculus7 shows that if you have a quadratic equation writtenas a(x− x)2 + bx+ c then the optimal value of x is x− (b/2a). Thus our regressionmodel predicts that 47 year old union employees pay the most for life insurance.

7Or a HUGE consulting fee.


Figure 5.9: Residuals from the quadratic model fit to the insurance data.

The old geezers standing in the way of a deal probably wouldn’t be swayed by a lifeinsurance benefit.

5.3.2 Non-Constant Variance

Non-constant variance8 means that points at some values of X are more tightlyclustered around the regression line than at other values of X. To check for non-constant variance look for a funnel shape in the residual plot i.e. residuals closeto zero to start with and then further from zero for larger X values. Often as Xincreases the variance of the errors will also increase.

Trick: Try mentally dividing the plot in half. If the residuals on one half seemtightly clustered, while the residuals the other half seem more scattered, then youhave non-constant variance.

Consequence: There are two consequences of non-constant variance. First,points with low variance should be weighted more heavily than points with highvariance, so there is a slight bias in the regression coefficients. More seriously, allthe formulas for prediction intervals, confidence intervals, and hypothesis tests thatwe like to use for regression depend on s. If the residuals have non-constant variancethen s is meaningless because it doesn’t make sense to summarize the variance witha single number.

Fixing the Problem: If Y is always positive, try transforming it to log(Y ).There is another solution called weighted least squares which you may read about,but we will not discuss.

8Also known as heteroscedasticity, although the term is currently out of favor with statisticiansbecause we’ve learned that 8 syllable words make people not like us. At least we’re assuming that’sthe reason.


(a) (b) (c)

Figure 5.10: Cleaning data. (a) Regression line fit to raw data and log y transformation.(b) Regression of log y on log x. (c) Regression of

√y on

√x.

Example

Goldilocks Housekeeping Service is a contractor specializing in cleaning large officebuildings. They want to build a model to forecast the number of rooms that can becleaned by X cleaning crews. They have collected the data shown in Figure 5.10.The data exhibit obvious non-constant variance (the points on the right are farmore scattered than those on the left). The company tries to fix the problem bytransforming to log(rooms), but that creates a non-linear effect because there wasroughly a straight line trend to begin with. Linearity is restored when the companyregresses log(rooms) on log(crews), but then it looks like the non-constant variancegoes the other way because the points on the left side of Figure 5.10(b) seem morescattered than those on the right. Goldilocks replaces the log transformations withsquare roots, which don’t contract things as much as logs do. Figure 5.10(c) showsthe results, which look just right.

When plotted on the data’s original scale, the regression line for√rooms vs.√

crews looks almost identical to the line for rooms vs. crews. So what hasGoldilocks gained by addressing the non-constant variance in the data? ConsiderFigure 5.11. The left panel shows the prediction intervals from the data modeled onthe raw scale. Notice how the prediction intervals are much wider than the spread ofthe data when the number of crews is small, and narrower than they need to be whenthe number of crews is large. Figure 5.11(b) shows prediction intervals constructedon the transformed scale where the constant variance assumption is much morereasonable. The prediction intervals in Figure 5.11(b) track the variation in thedata much more closely than in panel (a).


(a) (b)

Figure 5.11: The effect of non-constant variance on prediction intervals. Panel (a) is fiton the original scale. The model in panel (b) is fit on the transformed scale, then plottedon the original scale.

5.3.3 Dependent Observations

“Dependent observations” usually means that if one residual is large then the nextone is too. To find evidence of dependent observations, look for tracking in theresidual plot. The best way to describe tracking is when you see a residual farabove zero, it takes many small steps to get to a residual below zero, and vice versa.The technical name for this “tracking” pattern is autocorrelation. Unfortunately, itcan sometimes be difficult to distinguish between autocorrelation and non-linearity.The good news is that autocorrelation only occurs with time series data, so if youdon’t have time series data you don’t have to worry about this one.

Trick: The X variable in your plot must represent time in order to see track-ing. This becomes even more important when we deal with several X variables inmultiple regression.

Consequence: If you see tracking in the residual plot then that means today’sresidual could be used to predict tomorrow’s residual. This means you could bedoing a better job than you’re doing by just including the long run trend (i.e. aregression where your X variable is time). The obvious thing to do here is to put“today’s” residual in the model somehow. We will learn how to put both trend andautocorrelation in the model when we learn about multiple regression in Chapter 6.For now, the best way to deal with autocorrelation may be to simply regress yt onthe lag variable yt−1. You can think of a lag variable as “yesterday’s y.” Regressing


(a) (b) (c)

Figure 5.12: Cell phone subscriber data (a) on raw scale with log and square root trans-formations, (b) after y1/4 transformation, (c) residuals from panel (b).

a variable on its lag is sometimes called an autoregression.

Example

Figure 5.12 plots the number of cell phone subscribers, industry wide, every sixmonths from December 1984 to December 1995. Nonlinearity is the most seriousproblem with fitting a linear regression to these data. Figure 5.12(a) shows theresults of fitting a linear regression to log y, which bends too severely, and

√y

which doesn’t bend quite enough. Figure 5.12(b) shows the scatterplot obtained bytransforming Subscribers to the 1/4 power. That’s not a particularly interpretabletransformation, but it definitely fixes our problem with nonlinearity. Figure 5.12(c)shows the residual plot obtained after fitting a linear model to Figure 5.12(b). Thereis definite tracking in the residual plot, which is evidence of autocorrelation, wheretomorrow’s residual is correlated with today’s residual.

If you wanted to forecast the number of cell phone subscribers in June 1996 (thenext time period, period 24) using these data then it looks like a regression basedon Figure 5.12(b) would do a very good job. After all, it has R2 = .997, the highestwe’ve seen so far. However you could do even better if you could also incorporatethe pattern in Figure 5.12(c), which says that you should probably increase yourforecast for period 24. That’s because the residuals for the last few time periodshave been positive, so the next residual is likely to be positive as well.

Consider Figure 5.13, which shows output from the regression of subscribers tothe 1/4 power on a lag variable. Both models have a very high R2. The modelbased on the long run trend over time has R2 = .997, while the autoregression hasR2 = .999. That doesn’t seem like a big change in R2, but notice that s in theautoregression is only half as large as it is in the “long run trend” model. Perhapsthe best reason to prefer the lag model is that when you plot its residuals vs. time,


(a) (b)

Lag Variable

RSquare 0.999063

RMSE 0.51842

N 22

Trend

RSquare 0.996987

RMSE 0.972197

N 23

(c) (d)

Figure 5.13: Output for the lag model compared to the trend model in Figure 5.12. (a)Subscribers to the 1/4 power vs. its lag. (b) Plot of residuals vs. time. (c) Summary of fitfor the lag model. (d) Summary of fit for the trend model.

as in Figure 5.13(b), there is a much weaker pattern than the residual plot in Fig-ure 5.12(c). Table 5.1 gives the predictions for each model, obtained by calculatingpoint estimates and prediction intervals on the y1/4 scale, then raising them to thefourth power. The small differences on the fourth root scale translate into predic-tions that differ by several million subscribers. The autoregression predicts moresubscribers than the trend model, which is probably accurate given that the last fewresiduals in Figure 5.12(c) are above the regression line. The prediction intervals forthe autoregression are tighter than the trend model. It looks like the autoregression

Point Lower 95% Upper 95%

Model Estimate Prediction Prediction

Lag 39.347 37.023 41.778

Trend 34.276 30.498 38.394

Table 5.1: Point and interval forecasts (in millions) for the number of cell phone subscribersin June 1996 based on the trend model and the autoregression.


Figure 5.14: Number of seats in a car as a function of its weight (left panel). Normalquantile plot of residuals (right panel).

can predict to within about ±2 million subscribers, whereas the trend model canpredict to within about ±4 million. The period 24 point estimate for the trendmodel is only very slightly larger than the actual observed value for period 23.

5.3.4 Non-normal residuals

Normality of the residuals is the least important of the four regression assumptions.That’s because there is a “central limit theorem” for regression coefficients just likethe one for means. Therefore your intervals and tests for b1 and the confidenceinterval for the regression line are all okay, even if the residuals are not normallydistributed.

Non-normal residuals are only a problem when you wish to infer somethingabout individual observations (e.g. prediction intervals). If the residuals are non-normal then you can’t assume that individual observations will be within ±2σ ofthe population regression line 95% of the time.

You check the normality of the residuals using a normal quantile plot, just likeyou would any other variable. Unfortunately, JMP doesn’t make normal quantileplot of the residuals by default. You have to save them as a new variable and makethe plot yourself.

Example

One place where you can expect to find non-normal residuals is when modeling adiscrete response. For example, Figure 5.14 shows that the number of seats in acar tends to be larger for heavier cars. The slope of the line is positive, and it hasa significant p-value. The p-value is trustworthy, but it would be difficult to usethis model to predict the number of seats in a 3000 pound car. You could estimate


(a) (b)

Figure 5.15: Outliers (two in panel(a)) are points with big residuals. Leverage points (onein panel (b)) have unusual X values. Both plots show regression lines with and without theunusual points. None of the points influence the fitted regression lines.

the average number of seats per car in a population of 3000 pound cars, but foran individual car the number of seats is certainly not normally distributed, so ourtechnique for constructing a prediction interval wouldn’t make sense.

5.4 Outliers, Leverage Points and Influential Points

In addition to violations of formal regression assumptions, the subject of Section 5.3,you should also check to see if your analysis is dominated by one or two unusualpoints. There are two types of unusual data points that you should be aware of.

1. An outlier is a point with an unusual Y value for its X value. That is, outliersare points with big residuals.

2. A high leverage point is an observation with an X that is far away from theother X’s.

Outliers and high leverage points affect your regression model differently. It ispossible to have a point that is both a high leverage point and an outlier. Such pointsare guaranteed to have substantial influence over the regression line. An influential

point is an observation whose removal from the data set would substantially changethe fitted line.

Regardless of whether an outlier or high leverage point influences the fittedline, these observations can make a serious impact on the standard errors we use

5.4. OUTLIERS, LEVERAGE POINTS AND INFLUENTIAL POINTS 111

in constructing confidence intervals, prediction intervals, and hypothesis tests. Wehave discussed three types of standard errors in this Chapter: for the slope of theline, for the y value of the regression line at a point x∗, and for an individualobservation y∗.

SE(b1) =s

sx

√n− 1

SE(y) = s

√

1

n+

(x∗ − x)2

s2x(n− 1)SE(y∗) = s

√

1

n+

(x∗ − x)2

s2x(n − 1)+ 1

Note that all three standard errors depend on s, the residual standard deviation, andsx, the standard deviation of the x’s. Outliers inflate s, which increases standarderrors and makes us less sure about things. High leverage points inflate sx, which isin the denominator of all three standard error formulas. That means high leveragepoints make us more sure about our inferences. That extra certainty comes at acost: we can’t check the assumptions of the model, particularly linearity, betweenthe high leverage point and the rest of the data.

5.4.1 Outliers

Figure 5.15(a) shows the relationship between average house price ($thousands) andper-capita monthly income ($hundreds) for 50 ZIP codes obtained from the 1990Census. The data were collected by an intern who was unable to locate the averagehouse price for two of the ZIP codes. The intern entered “0” for the observationswhere the house price was unavailable.

Figure 5.15(a) shows regression lines fit with and without the outliers generatedby the intern recording 0 for the house prices. The outliers have very little effecton the regression line. They decrease the intercept somewhat, but the slope of theline is nearly unchanged. Outliers typically do not affect the fitted line very muchunless they are also high leverage points. The two outliers in Figure 5.15(a) areobservations with typical income levels, so they are not high leverage points.

Figure 5.16 shows the computer output for the two regression lines in Fig-ure 5.15(a). Indeed the estimated regression equations are very similar. However,RMSE is roughly a factor of 4.5 larger if the outliers are included in the data set.The effects of the increased RMSE can be seen in R2 (.660 vs. .069) and the standarderror for the slope, which is directly proportional to s. Because SE(b1) is increased,the t-statistic and the p-value for the slope are insignificant when the outliers areincluded, even though the slope is highly significant without the outliers.

How big does a residual have to be before we call it an outlier? There is no hardand fast rule, but you should measure the residuals relative to s. Keep in mindthat you expect about 5% of the residuals to be more than ±2 residual standarddeviations from zero, and only 0.3% of the points to be more than ±3s from zero.


Full Data Set:


Intercept 139.91994 40.26643 3.47 0.0011 RMSE 46.51617

INCOME 5.3208392 2.821008 1.89 0.0653 N 50

Outliers Excluded:


Intercept 137.53823 9.256721 14.86 <.0001 RMSE 10.66548

INCOME 6.1341704 0.649089 9.45 <.0001 N 48

Figure 5.16: Regression output for the house price data.

The two 0’s in Figure 5.15(a) are between 4 and 5 residual standard deviations fromzero, so they are clearly outliers.

Generally speaking, outliers don’t change the fitted line by very much unless theyare also high leverage points, but they can affect the certainty of our inferences agreat deal.

5.4.2 Leverage Points

There is one month in Figure 5.15 where the value weighted stock market indexlost roughly 20% of its value. That is the infamous 1987 crash where the Dow fellover 500 points in a single day. That would be a large one day decline today (whenthe Dow is at about 9000). It was catastrophic in 1987, when the Dow was tradingat just over 2000, and people were marveling that it was that high. The point is ahigh leverage point.

October 1987 may have been disastrous for the stock market, but it is prettyinnocuous as a data point. Figure 5.17 shows the computer output for the modelsfit with and without October 1987. RMSE is virtually unchanged, though R2 isa little higher. The slope and intercept of the line barely move when the point isadded or deleted. The high leverage point has the expected effect on SE(b1), butthe effect is minor.

High leverage points get their name from their ability to exert undue “leverage”on the regression line. Imagine each data point is attached to the line by a spring.The farther a point is from the regression line the more force its spring is exerting.When an observation has an extreme X value, all the other observations collectivelyact like the fulcrum of a lever centered at x. The farther a high leverage point isfrom x, the less work its spring has to do to pull the line towards it, just like itwas pulling on a very long lever. October 1987 happens to have a Y value whichis right where the regression line predicts it would be, so its spring isn’t pulling the

5.4. OUTLIERS, LEVERAGE POINTS AND INFLUENTIAL POINTS 113

Full Data:


Intercept -0.003307 0.003864 -0.86 0.3931 RMSE 0.056324

VW Return 1.123536 0.086639 12.97 <.0001 N 228

Leverage Point Excluded


Intercept -0.002627 0.003917 -0.67 0.5031 RMSE 0.056311

VW Return 1.0890647 0.092626 11.76 <.0001 N 227

Summary of leverages:

Mean 0.0087719

N 228

Figure 5.17: Computer output for the CAPM model fit with and without the high leveragepoint.

line very hard. The next Section shows an of a high leverage point that moves theline a lot.

There is an actual number, denoted hi, that can be calculated to determinethe leverage of each point.9 You don’t need to know the formula for hi (but seepage 114 if you’re interested), but you should know that it depends entirely on xi.The farther xi is from x, the more leverage for the point. Each hi is always between1/n and 1, and the hi’s for all the points sum to 2. Some computer programs (butnot JMP) warn you if a data point has hi greater than three times the averageleverage (i.e. 2/n). Figure 5.17 plots the leverages for the CAPM data set. Sureenough, October 1987 stands out as a high leverage point.

5.4.3 Influential Points

This last example of unusual points shows an instance of a high leverage point thatdoes move the line a lot.

9The letter h is used for leverages because they are computed from something called the “hatmatrix” which is beyond our scope, even in a “Not on the test” box.


Not on the test 5.3 Why leverage is “Leverage”

For simple regression the formula for leverage is

hi =1

n+

(xi − x)2

(n− 1)s2x.

You may recognize this formula as the thing under the square root sign in theformula for SE(y). The intuition behind leverage is that the fitted regressionline is very strongly drawn to high leverage points. That means high leveragepoints tend to have smaller residuals than typical points, so the residual fora high leverage point ought to have smaller variance than the other points.It turns out that V ar(ei) = s2(1 − hi), where ei is the i’th residual. Ifan observation had the maximum leverage of 1 then V ar(ei) = 0 becauseits pull on the fitted line is so strong that the line is forced to go directlythrough the leverage point. Most observations have hi close to 0, so thatV ar(ei) ≈ s2. For multiple regression the formula for hi becomes sufficientlycomplicated that we can’t write it down without “matrix algebra,” which isbeyond our scope.

A construction company that builds beach cottages typically builds them be-tween 500 and 1000 square feet in size. Recently an order was placed for a 3500square foot “cottage,” which yielded a healthy profit. The company wants to ex-plore whether they should be building more large cottages. Data from the last 18cottages built by the company (roughly a year’s work) are shown in Figure 5.18.

5.4.4 Strategies for Dealing with Unusual Points

When a point appreciably changes your conclusions you can

perform your analyses with and without the point and report both results.

Or use transformations to work on a scale where the point is not as influential.For example if a point has a large x value and we transform x by taking log(X) thenlog(X) will not be nearly as large.

Okay to delete unusual points if

Point was recorded in error, or Point has a big impact on model, and you onlywant to use the model to predict “typical” future observations.

Not okay to delete unusual points: Just because they don’t fit your model. Fitthe model to the data, not vice versa. When you want to predict future observationslike the unusual point. (e.g. large cottages)

5.5. REVIEW 115

(a) (b) (c)

Point Included:


Intercept -416.8594 1437.015 -0.29 0.7755 RMSE 3570.379

SqFeet 9.7505469 1.295848 7.52 <.0001 N 18

Point Excluded:


Intercept 2245.4005 4237.249 0.53 0.6039 RMSE 3633.591

SqFeet 6.1370246 5.556627 1.10 0.2868 N 17

Figure 5.18: Computer output for cottages data with and without the high leverage point.The first two panels plot confidence intervals for the regression line fit (a) with and (b)without the high leverage point. Panel (c) plots hi for the regression: maxhi = .94680.

5.5 Review

Correlation and covariance measure the strength of the linear relationship betweentwo variables. Regression models the actual relationship. It is hard to test whethera correlation is zero, but easy to test whether a population line has zero slope (whichamounts to the same thing).

The slope is the most interesting part of the regression.

Chapter 6

Multiple Linear Regression

Multiple regression is the workhorse that handles most statistical analyses in the“real world.” On one level the multiple regression model is very simple. It isjust the simple regression model from Chapter 5 with a few extra terms added toy. However multiple regression can be more complicated than simple regressionbecause you can’t plot the data to “eyeball” relationships as easily as you couldwith only one Y and one X.

The most common uses of multiple regression fall into three broad categories

1. Predicting new observations based on specific characteristics.

2. Identifying which of several factors are important determinants of Y .

3. Determining whether the relationship between Y and a specific X persistsafter controlling for other background variables.

These tasks are obviously related. Often all three are part of the same analysis.One thing you may notice about this Chapter is that there are many fewer

formulas. Multiple regression formulas are most often written in terms of “matrixalgebra,” which is beyond the scope of these notes. The good news is that thecomputer understands all the matrix algebra needed to produce standard errorsand prediction intervals.

6.1 The Basic Model

Multiple linear regression is very similar to simple linear regression. The key dif-ference is that instead of trying to make a prediction for the response Y based ononly one predictor X we use many predictors which we call X1,X2, . . . ,Xp. In mostreal life situations Y does not depend on just one predictor so a multiple regression

117

118 CHAPTER 6. MULTIPLE LINEAR REGRESSION

can considerably improve the accuracy of our prediction. Recall the simple linearregression model is

Yi = β0 + β1Xi + εi.

Multiple regression is similar except that it incorporates all the X variables,

Yi = β0 + β1Xi,1 + · · · + βpXi,p + εi

where Xi,1 indicates the ith observation from variable 1 etc. We have all the sameassumptions as for simple linear regression i.e. εi ∼ N (0, σ) and independent. Themultiple regression model can also be stated as

Yi ∼ N (β0 + β1Xi1 + · · · + βpXip, σ),

which highlights how the multiple regression model fits in with Chapters 2, 4, and 5.The regression coefficients should be interpreted as the expected increase in Y if weincrease the corresponding predictor by 1 and hold all other variables constant.

To illustrate, examine the regression output from Figure 6.1(a), in which a car’sfuel consumption (number of gallons required to go 1000 miles) is regressed on thecar’s horsepower, weight (in pounds), engine displacement, and number of cylinders.Figure 6.1(a) estimates a car’s fuel consumption using the following equation

Fuel = 11.49 + 0.0089Weight + 0.080HP + 0.17Cylinders + 0.0014Disp.

You can think of each coefficient as the marginal cost of each variable in termsof fuel consumption. For example, each additional unit of horsepower costs 0.080extra gallons per 1000 miles. Figure 6.1(b) provides output from a simple regressionwhere Horsepower is the only explanatory variable. In the simple regression it lookslike each additional unit of horsepower costs .18 gallons per 1000 miles, over twice asmuch as in the multiple regression! The two numbers are different because they aremeasuring different things. The Horsepower coefficient in the multiple regressionis asking “how much extra fuel is needed if we add one extra horsepower withoutchanging the car’s weight, engine displacement, or number of cylinders?” The simpleregression doesn’t look at the other variables. If the only thing you know about acar is its horsepower, then it looks like each horsepower costs .18 gallons of gas.Some of that .18 gallons is directly attributable to horsepower, but some of it isattributable to the fact that cars with greater horsepower also tend to be heaviercars. Horsepower acts as a proxy for weight (and possibly other variables) in thesimple regression.

6.2. SEVERAL REGRESSION QUESTIONS 119

Summary of Fit ANOVA Table

RSquare 0.852963 Source DF Sum of Sq. Mean Square F Ratio

RMSE 3.387067 Model 4 7054.3564 1763.59 153.7269

N 111 Error 106 1216.0559 11.47 Prob > F

Total 110 8270.4123 <.0001

Parameter Estimates


Intercept 11.49468 2.099403 5.48 <.0001

Weight(lb) 0.0089106 0.001066 8.36 <.0001

Horsepower 0.0804557 0.013982 5.75 <.0001

Cylinders 0.1745649 0.553151 0.32 0.7529

Displacement 0.0014316 0.013183 0.11 0.9137

(a)


Intercept 25.324465 1.494657 16.94 <.0001 RMSE 4.877156

Horsepower 0.1808646 0.011501 15.73 <.0001 N 113

(b)

Figure 6.1: (a) Multiple regression output for a car’s fuel consumption (gal/1000mi) re-gressed on the car’s weight (lbs), horsepower, engine displacement and number of cylinders.(b) Simple regression output including only HP.

6.2 Several Regression Questions

Just as with simple linear regression there are several important questions we aregoing to be interested in answering.

1. Does the entire collection of X’s explain anything at all? That is, can we besure that at least one of the predictors is useful?

2. How good a job does the regression do overall?

3. Does a particular X variable explain something that the others don’t?

4. Does a specific subset of the X’s do a useful amount of explaining?

5. What Y value would you predict for an observation with a given set of Xvalues and how accurate is the prediction?


6.2.1 Is there any relationship at all? The ANOVA Table and theWhole Model F Test

This question is answered by a hypothesis test known as the “whole model F test.”An F statistic compares a regression model with several X’s to a simpler model withfewer X’s (i.e. with the coefficients of some of the X’s set to zero). The “simplermodel” for the whole model F test has all the slopes set to 0. That is, the nullhypothesis is H0 : β1 = β2 = · · · = βp = 0 versus the alternative Ha: at least one βnot equal to zero. In practical terms the whole model F test asks whether any ofthe X’s help explain Y . The null hypothesis is “no.” The alternative hypothesis is“at least one X is helpful.”

So how does an F statistic compare the two models? Remember that the jobof a regression line is to minimize the sum of squared errors (SSE). The F statisticchecks whether including the X’s in the regression reduces SSE by enough to justifytheir inclusion. In the special case of a regression with no X’s in it at all the sumof squared errors is called SST (which stands for “total sum of squares”) instead ofSSE. That is,

SST =n∑

i=1

(yi − y)2.

SST measures Y ’s variance about the average Y , ignoring X. Thus SST/(n− 1) isthe sample variance of Y from Chapter 1. Despite its name, we still think of SSTas a sum of squared errors because if all the slopes in a regression are zero then theintercept is just y. By contrast,

SSE =

n∑

i=1

(yi − yi)2

measures the amount of variation left after we use the X’s to make a predictionfor Y . In other words, SSE/(n − p− 1) is the variance of Y around the regressionline (aka the variance of the residuals). These are the same quantities that we sawin Chapter 5 except that now we are using several X’s to calculate y. If SST isthe variation that we started out with and SSE is the variation that we ended upwith then the difference SST − SSE must represent the amount explained by theregression. We call this quantity the model sum of squares, SSM. It turns out that

SSM =n∑

i=1

(yi − y)2.

If SSM is large then the regression has explained a lot, so we should say that atleast one of the X’s helps explain Y . How large does it need to be? Clearly we need


to standardize SSM somehow. For example if Y measures heights in feet we canmake SSM 144 times as large simply by measuring in inches instead (think aboutwhy). To get around this problem we divide SSM by our estimate for σ2 i.e.

s2 = MSE =SSE

n− p− 1

where MSE stands for “mean square error.”1 We also expect SSM to be larger ifthere are more X’s in the model, even if they have no relationship to Y . To getaround this problem we also divide SSM by the number of predictors.

MSM =SSM

p

where MSM stands for the “Mean Square explained by the Model.” (Okay, that’sa dumb name, but it helps remind you of MSE, to which MSM is compared. Plus,we didn’t make it up.) You can think about MSM as the amount of explaining thatthe model achieves per degree of freedom (aka per X variable in the model). If wecombine these two ideas together i.e. dividing SSM by the MSE and also by p weget the F statistic

F =SSM/p

MSE=MSM

MSE.

The F statistic is independent of units. When F is large we know that at least oneof the X variables helps predict Y . The computer calculates a p-value to help usdetermine whether F is large enough.2 To compute the p-value the computer needsto know how many degrees of freedom were used in the numerator of F (i.e. forcomputing MSM) and how many were in the denominator (for computing MSE).The F-statistic in Figure 6.1 has 4 DF in its numerator and 106 in its denominator.

All these sums of squares are summarized in something called an ANOVA (Anal-ysis of Variance) table.

Source df SS MS F p∗

Model p SSM SSMp

MSMMSE p-value

Error n− p− 1 SSE SSEn−p−1

Total n− 1 SST

1It turns out that we’ve been using this rule all along. In Chapter 5 we had p = 1, so we dividedby n − 2. Chapters 1 and 4 had p = 0 so we divided by n − 1. Now we have p predictors in themodel so we divide SSE by n − p − 1.

2The p-value here is the probability that we would see an F statistic even larger than the onewe saw, if H0 were true and we collected another data set.


Don’t Get Confused! 6.1 Why call it an “ANOVA table?”

The name “ANOVA table” misleads some people into thinking that thetable says something about variances. Actually, the object of an ANOVAtable is to say something about means, or in this case a regression model,which is a model for the mean of each Y given each observation’s X’s. It iscalled an ANOVA table because it uses variances to see whether our modelfor means is doing a good job.

For example, the ANOVA table in Figure 6.1 tells us that a model with 4variables had been fit, the number of observations was 111 (111 − 1 = 110), ourestimate for σ2 is 11.47 (MSE) and MSM is 1763.59. Furthermore, MSM wassignificantly larger than MSE (the F ratio is 153.7269) and the probability of thishappening if none of the 4 variables had any relationship to Y was only 0.0001. Thesmall p-value says that at least one of the variables is helping to explain Y .

One way to think about the ANOVA table is as the “balance sheet” for theregression model. Think of each degree of freedom as money. You “spend” degreesof freedom by putting additional X variables into the model. The “sum of squares”column explains what you got for your money. If you don’t use any degrees offreedom, then you will have SSE = SST . In the cars example, we had 110 degreesof freedom to spend, and we spent 4 of them to move 7054 of our variability fromthe “unexplained” box (SSE) to the “explained” box (SSM) of the table. We are leftwith 1216 which remains unexplained. The “mean square” column tries to decideif we got a good deal for our money. The model mean square (MSM) answersthe question, “How much explaining, on average, did you get per degree of freedomthat you spent?” If MSM is large then our degrees of freedom were well spent. Asusual, the definition of “large” depends on the scale of the problem, so we have tostandardize it somehow. It turns out that the right way to do this is to divide bythe variance of the residuals (MSE), which gives us the F -ratio.

6.2.2 How Strong is the Relationship? R2

The ANOVA table and whole model F test try to determine whether any variabil-ity has been explained by the regression. R2 estimates how much variability hasbeen explained. Explaining variability works in exactly the same way as for simpleregression. We still look at

R2 = 1 − SSE

SST=SSM

SST.


If this number is close to 1 then our predictors have explained a large proportionof the variability in Y . If R2 is close to zero then the predictors do not help usmuch. Notice that correlation is not a meaningful way of describing the collectiverelationship between Y and all the X’s, because correlation only deals with pairs ofvariables. This is one of the reasons we use R2 because it gives an overall measureof the relationship.

One fact about R2 that should be kept in mind is that even if you add a variablethat has no relationship to Y the new R2 will be higher! In fact if you add as manypredictors as data points you will get R2 = 1. This may sound good but in fact itusually means that any future predictions that you make are terrible. We’ll showan example in Section 6.7 where we can add enough garbage variables to “predictthe stock market” extremely well (high R2) for data in our data set but do a lousyjob predicting future data.

6.2.3 Is an Individual Variable Important? The T Test

One of the first steps in a regression analysis is to perform the whole model F testdescribed in Section 6.2.1. If its p-value is not small then we might as well stopbecause we have no evidence that regression is helping. However, if the p-value issmall, so that we can conclude that at least one of the predictors is helping, thenthe question becomes “which ones?”

The main tool for determining whether an individual variable is important isthe t-test for testing H0 : βi = 0. This is the same old t-test from Chapters 4 and 5that computes how many SE’s each coefficient is away from zero. All variableswith small p-values are probably useful and should be kept. However, because ofissues like collinearity (discussed in Section 6.4) some of the variables with large p-values might be useful as well.3 That is, an apparently insignificant variable mightbecome significant if another insignificant variable is dropped. This leads into modelselection and data mining ideas which we discuss in Section 6.7. For now let’s justsay that the right way to drop insignificant variables from the model is to do itone at a time. That way you can be sure that you don’t accidentally throw awaysomething valuable.

When you test an individual coefficient in a multiple regression you’re askingwhether that coefficient’s variable explains something that the other variables don’t.For example, in Figure 6.1 the coefficient of Cylinders appears insignificant. Doesthat mean that the number of cylinders in a car’s engine has nothing to do withits fuel consumption? Of course not. We all know that 8 cylinder cars use more

3There is another issue, called multiple comparisons, that we will ignore for now but pick upagain in Section 6.7.2.


gas than 4 cylinder cars. The small p-value for Cylinders is saying that if you al-ready know a car’s weight, horsepower, and displacement then you don’t need toknow Cylinders too. (Actually you don’t need Displacement either, but you don’tknow that until you drop Cylinders first.) The order that you enter the X vari-ables into the regression has no effect on the p-values of the individual coefficients.Each p-value calculation is done conditional on all the other variables in the model,regardless of the order in which they were entered into the computer.

It is desirable to get insignificant variables out of the regression model. If avariable is not statistically significant then you don’t have enough information inthe data set to accurately estimate its slope. Therefore if you include such a variablein the data set all you’re doing is adding noise to your predictions. To illustratethis point we re-ran the regression from Figure 6.1 after setting aside 25 randomlychosen observations (i.e. we excluded them from the regression). We used thefitted regression model to predict fuel consumption for the observations we setaside, and we computed the residuals from these predictions. Then we dropped theinsignificant variables Cylinders and Displacement (dropping them one at a time, tomake sure Cylinders did not become significant when Displacement was dropped)and made the same calculations. The two regressions produced the following resultsfor the 25 “holdout” observations:

variables in model SD(residuals)

all four 3.57only significant 3.48

The model with more variables does worse than the model based only on sig-nificant variables. Even experienced regression users sometimes forget that thatkeeping insignificant variables in the model does nothing but add noise.

6.2.4 Is a Subset of Variables Important? The Partial F Test

Sometimes you want to examine whether a group of X’s, taken together, does auseful amount of explaining. The test for doing this is called the partial F test.The partial F test works almost exactly the way same as the whole model F test.The difference is that the whole model F test compares a big regression model (withmany X’s) to the mean of Y . The partial F test compares a big regression model toa smaller regression model (with fewer X’s). To calculate the partial F statistic youwill need the ANOVA tables from the full (big) model and the null (small) model.The formula for the partial F statistic is

F =∆SSE/∆DF

MSE full.


(a) Full Model

Source DF Sum of Squares Mean Square F Ratio

Model

3 0.46388379 0.154628 38.4811

Error 115

0.46210223 0.004018 Prob > F

Total 118 0.92598602 < .0001

(b) Null Model


Model

1 0.44897622 0.448976 110.1240

Error 117

0.47700979 0.004077 Prob > F

Total 118 0.92598602 < .0001

Table 6.1: ANOVA tables for the partial F test. Singly boxed items are used in thenumerator of the test. Doubly boxed items are used in the denominator.

Note the similarity between the partial F statistic and the whole model F statistic.In the whole model F we called ∆SSE = SSM and ∆DF = p. Also notice thatthe whole model F test and the partial F test use the MSE from the big regressionmodel, not the small one.

For example, Table 6.1(a) contains the ANOVA table for the regression of Searsstock returns on the returns from IBM stock, the VW stock index, and the S&P500. Table 6.1(b) is the ANOVA table for the regression with just the VW stockindex as a predictor; IBM and S&P500 have been dropped. The question is whetherboth IBM and the S&P500 can be safely dropped from the regression. The partialF statistic is

F =(0.47700979 − 0.46210223)/(3 − 1)

0.004018= 1.855097

To find the p-value for this partial F statistic you need to know how many degrees offreedom were used to calculate the numerator and the denominator. The numerator

DF is simply the difference in the number of parameters for the two models. Here thenumerator DF is 3-1=2. The denominator DF is the number of degrees of freedomused to calculate MSE. In our example the denominator DF is 115. By plugging theF statistic and its relevant degrees of freedom into a computer which knows aboutthe F distribution, we find that the p-value for our F statistic is p = 0.1610887.This is pretty large as p-values go. In particular it is larger than .05, so we can’treject the null hypothesis that the variables being tested have zero coefficients. Thebig p-value says it is safe to drop both IBM and S&P500 from the model.

The partial F test is the most general test we have for regression coefficients. It


can test any subset of coefficients you like. If the “subset” is just a single coefficientthen the partial F test is equivalent to the t test from Section 6.2.3.4 If the “subset”is everything then the partial F test is the same as the whole model F test.

The most common use of the partial F test is when one of your X variables is acategorical variable with several levels, which we will see in Section 6.5. The partialF test is particularly relevant for categorical X’s because each categorical variablegets split into a collection of simpler “dummy variables” which we will either excludeor enter into the model as a group.

6.2.5 Predictions

Finally to make a prediction we just use

Y = b0 + b1X1 + · · · + bpXp

where b0, . . . , bp are the estimated coefficients. As with simple linear regression, weknow that this guess at Y is not exactly correct, so we want an interval that lets usknow how sure we can be about the estimate. If we are trying to guess the averagevalue of Y (i.e. just trying to guess the population regression line) then we calculatea confidence interval. If we want to predict a single point (which has extra variancebecause even if we know the true regression line the point will not be exactly onit) we use a prediction interval. The prediction interval is always wider than theconfidence interval.

The standard error formulas for obtaining prediction and confidence intervalsin multiple regression are sufficiently complicated that we won’t show them. Theintervals are easy enough to obtain using a computer (see page 181 for JMP instruc-tions). The regression output omits information that you would need to create theintervals without a computer’s help.5

Here are prediction and confidence intervals for the car example in Figure 6.1.We dropped the insignificant variables from the model, so now the only factors areWeight and HP. Suppose we want to estimate the fuel consumption for a 4000 poundcar with 200HP.

Interval Gal/1000mi ⇒ mi/gal

Confidence (63.45, 66.62) (15.01, 15.76)Prediction (57.92, 72.15) (13.86, 17.27)

4In this special case you get F = t2, where t is the t-statistic for the one variable you’re testing.5The estimated regression coefficients are correlated with one another. To compute the intervals

yourself you would need the covariance matrix describing their relationship. Then you could usethe general variance formula on page 54.

6.3. REGRESSION DIAGNOSTICS: DETECTING PROBLEMS 127

The confidence and prediction intervals were constructed on the scale of the Yvariable in the multiple regression model (chosen to avoid violating regression as-sumptions). Then we transformed them back to the familiar MPG units in whichwe are used to measuring fuel economy by simply transforming the endpoints ofeach interval. It looks like our car is going to be a gas guzzler.

6.3 Regression Diagnostics: Detecting Problems

Just as in simple regression, you should check your multiple regression model forassumption violations and for the possible impact of outliers and high leveragepoints. Multiple regression uses the same assumptions as simple regression.

1. Linearity.

2. Constant variance of error terms.

3. Independence of error terms.

4. Normality of error terms.

Multiple regression uses the same types of tools as simple regression to check forassumptions violations i.e. look at the residuals. As with simple regression the maintool to solve any violations is to transform the data.

Outliers also have the same definition in multiple regression as in simple regres-sion. An outlier is a point with a big residual, so the best place to look for outliers isin a residual plot. The definition of a high leverage point changes a bit. In multipleregression a high leverage point is an observation with an unusual combination ofX values. For example, it is possible for a car to have rather high (but not extreme)horsepower and rather low (but not extreme weight). Neither variable is extremeby itself, but the car can still be a high leverage point because it is an unusualHP-weight combination.

There are two different families of tools for detecting assumption violation andunusual point issues in multiple regression models. You can either use tools thatwork with the entire model at once, or you can use tools that look at one variableat a time. Section 6.3.1 describes the best “one variable at a time” tool, known asa leverage plot, or added variable plot. Section 6.3.2 describes a variety of “wholemodel” tools.

6.3.1 Leverage Plots

Leverage plots show you the relationship between Y and X after the other X’s inthe model have been “accounted for.” See the call-out box on page 129 for a more


(a) (b)

Figure 6.2: (a) Leverage plot and (b) scatterplot showing the relationship between a car’sfuel consumption and number of cylinders. The leverage plot is from a model that alsoincludes Horsepower, Weight, and Displacement.

technical definition. Figure 6.2 compares the leverage plot showing the relationshipbetween fuel consumption and the number of cylinders in a car’s engine, and thecorresponding scatterplot. Points on the right of the leverage plot are cars withmore cylinders than we would have expected given the other variables in the model.Points near the top of the leverage plot use more gas than we would have expectedgiven the other variables in the model (weight, HP, displacement). The leverageplot and scatterplot reinforce what we said about Cylinders in Section 6.2.3. In asimple regression there is a relatively strong relationship between fuel consumptionand Cylinders, but the relationship is better described by other variables in themultiple regression.

When you look at a leverage plot, look at the points on the plot, not the lines.The lines that JMP produces are there to help you visualize the hypothesis test forwhether each β = 0. You can get the same information from the table of parameterestimates, so the lines are not particularly helpful. However, the true value of a plotis that the points in the plot let you see more than you could in a simple numericalsummary.

Leverage plots are great places to look for bends and funnels which may indicateregression assumption violations. They are also great places to look for high lever-age points.6 Leverage plots can be read just like scatterplots, but they are better

6The word “leverage” means different things in “leverage plot” and “high leverage point.”


Not on the test 6.1 How to build a leverage plot

Here is how you would build a leverage plot if the computer didn’t do it foryou. For concreteness imagine we are creating the leverage plot for Cylindersshown in Figure 6.2.

1. Regress Y on all the X variables in the model except for the X in thecurrent leverage plot (e.g. everything but cylinders). The residualsfrom this regression are the information about Y that is not explainedby the other X’s in the model.

2. Now regress the current X on all the other X’s in the model (e.g.cylinders on weight, HP, and displacement). The residuals from thisregression are the information in the current X that couldn’t be pre-dicted by the other X’s already in the model.

By plotting the first set of residuals against the second, the leverage plotshows just the portion of the relationship between the Y and X that isn’talready explained by the other X’s in the model.A minor detail: because a leverage plot is plotting one set of residuals againstanother, you might expect both axes to have a mean of zero. When JMPcreates leverage plots it adds the mean of X and Y back into the axes, sothat the plot is created on the scale of the unadjusted variables. That’s anice touch, but if I had to create a leverage plot myself I probably wouldn’tbother adding the means back in.

than scatterplots because they control for the impact of the other variables in theregression. Figure 6.2(a) shows no evidence of any assumption violations or unusualpoints. It simply shows that Cylinders is an unimportant variable in our model.

6.3.2 Whole Model Diagnostics

The only real drawback to leverage plots is that if you have many variables inyour regression model then there will be many leverage plots for you to examine.This Section describes some “whole model” tools that you can use instead of or inconjunction with leverage plots. Some people prefer to look at these tools first tosuggest things they should look for in leverage plots. Leverage plots and “wholemodel” diagnostics complement one another. Which ones you look at first is largelya matter of personal preference.


(a) (b)

Figure 6.3: (a) Residual plot and (b) whole model leverage plot for the regressionof fuel consumption on Weight, Horsepower, Cylinders, and Displacement.

The Residual Plot: Residuals vs. Fitted Values

When someone says “the” residual plot in multiple regression they are talking abouta plot of the residuals versus Y , the fitted Y values. Why Y ? In Chapter 5 theresidual plot was a plot of residuals versus X. In multiple regression there areseveral X’s, and Y is a convenient one-number summary of them.

You use the residual plot in multiple regression in exactly the same way as insimple regression. A bend is a sign of nonlinearity and a funnel shape is a sign ofnon-constant variance. If you see evidence of nonlinearity then look at the leverageplots for each variable to see which one is responsible. If the bend is present in severalleverage plots then try transforming Y . Otherwise try transforming the X variableresponsible for the bend, or add a quadratic term in that variable (see page 182).The residual plot for the regression in Figure 6.1 is shown in Figure 6.3(a). Theresiduals look fine.

The (other) Residual Plot: Residuals vs. Time

If your regression uses time series data you should also plot the residuals vs. time.Examine the plot for “tracking” (aka autocorrelation) just like in Chapter 5.


Normal Quantile Plot of Residuals

The last residual plot worth looking at in a multiple regression is a normal quantileplot of the residuals. Remember that the normality of the residuals is ONLY impor-tant if you want to use the regression to predict individual Y values (in which caseit is very important). Otherwise a central limit theorem effect shows up to save theday. You interpret a normal quantile plot of the residuals in exactly the same wayyou interpret a normal quantile plot of any other variable: if the points are close toa straight line on the plot then the residuals are close to normally distributed.

An Unhelpful Plot: The Whole Model Leverage Plot

This plot (see Figure 6.3(b)) isn’t really useful for diagnosing problems with themodel, which is too bad because JMP puts is front and center in the multipleregression output. The whole model leverage plot is a plot of the actual Y value vs.the predicted Y . You can think of it as a picture of R2, because if R2 = 1 then allthe points would lie on a straight line. The confidence bands that JMP puts aroundthe line are a picture of the whole model F test.

You can get the same information from the whole model leverage plot and theresidual plot, but the whole model leverage plot makes regression assumption vi-olations harder to detect because the “baseline” is a 45 degree line instead of ahorizontal line. Our advice is to ignore this plot.

Leverage (for finding high leverage points)

High leverage points were easy to spot in simple linear regression. You just lookfor a point with an unusual X value. However, when there are many predictors itis possible to have a point that is not unusual for any particular X but is still inan unusual spot overall. Consider Figure 6.4(a), which plots Weight vs. HP for thecar data set. The general trend is that heavier cars have more horsepower, howevernotice the two cars marked with a × (Ford Mustang) and a + (Chevy Corvette).These cars have a lot of horsepower, but not that much more than several othercars. However, they are lighter than other cars with similarly large horsepower.When we compute the leverage of each point in the regression model7 the Mustangand Corvette show up as the points with the most leverage because they have themost unusual combination of weight and HP. Notice that the same two cars showup on the right edge of the HP leverage plot, because they have more HP than onewould expect for cars of their same weight.

7See page 181 for JMP tips.


(a) (b) (c)

Figure 6.4: (a) Scatterplot of Weight vs. HP, (b) histogram of leverages, and (c) leverageplot for HP for the car data set. The maximum leverage in panel (b) is .138. The averageis .027, out of 111 observations.

Recall that we denote the leverage of the ith data point by hi. A larger hi meansa point with more leverage. As with simple regression we have 1/n ≤ hi ≤ 1. Inmultiple regression we have

∑ni=1 hi = (p+ 1). Thus if we added all the numbers in

Figure 6.4(b) we would get 3 (two X variables plus the intercept). Some computerprograms (though not JMP) warn you about points where hi is greater than 3×the average leverage (i.e. hi > 3(p + 1)/n). Rather than use such a hard rule, wesuggest plotting a histogram of leverages like Figure 6.4(b) and taking a closer lookat points on the right that appear extreme.

Cook’s Distance: measuring a point’s influence

Each point in the data set has an associated Cook’s distance, which measures how“far” (in a funky statistical sense) the regression line would move if the point weredeleted. Luckily, the computer does not actually have to delete each point and fit anew regression to calculate the Cook’s distance. It can use a formula instead. Theformula is complicated,8 but it only depends on two things: the point’s leverage andthe size of its residual. High leverage points with big residuals have large Cook’sdistances. A large Cook’s distance means the point is influential for the fittedregression equation.

There is no formal hypothesis test to say when a Cook’s distance is “too large.”Some guidance may be obtained from the appropriate F table, which will depend onthe number of variables in the model and the sample size. A table of suggested cutoffvalues is available in Appendix D.3. Don’t take the cutoff values in Appendix D.3too seriously. They are only there to give you guidance about what a “large” Cook’s

8If you must know the formula it is di = hie2i /(ps2(1 − hi)

2). Definitely not on the test.

6.4. COLLINEARITY 133

Figure 6.5: Cook’s distances for the car data. The largest Cook’s distance is well belowthe lowest threshold in Appendix D.3.

distance looks like. In practice, Cook’s distances are used in much the same way asleverages (i.e. you plot a histogram and look for extreme values). If you identifya point with a large Cook’s distance you should try to determine what makes itunusual. Attempt to find the point in the leverage plots for the regression so thatyou can determine the impact it is having on your analysis. If you think the pointwas recorded in error or represents a phenomenon you do not wish to model thenyou may consider deleting it. If the point legitimately belongs in the data set thenyou might consider transforming the scale of the model so that the point is no longerinfluential.

Figure 6.5 shows the Cook’s distances from the car data set. None of the Cook’sdistances look particularly large when compared to the table in Appendix D.3 with3 model parameters (two slopes and the intercept) and about 100 observations.Figure 6.5 highlights the Mustang and Corvette that attracted our attention ashigh leverage points in Figure 6.4. High leverage points have the opportunity toinfluence the fitted regression a great deal. However, Figure 6.5 says that these twopoints are not influential.

6.4 Collinearity

Another issue that comes up in multiple regression is collinearity,9 where two ormore predictor variables are closely related to one another. The classic exampleis regressing Y on someone’s height in feet, and on their height in inches. Even if“height” is an important predictor, the model can’t be sure how much of the creditfor explaining Y belongs to “height in feet” and how much to “height in inches,”

9Some people call it “multicollinearity,” which we don’t like because it is an eight syllable word.See the footnote on page 104.


Term Estimate Std Error t Ratio Prob>|t| VIF

Intercept 0.0134394 0.007179 1.87 0.0638 .

VW 2.5729753 1.028958 2.50 0.0138 76.618188

SP500 -1.56224 1.066683 -1.46 0.1458 78.399527

IBM 0.2070931 0.132193 1.57 0.1200 1.7847938

PACGE 0.0798655 0.130059 0.61 0.5404 1.1560317

(a)


Intercept 0.0195894 0.006075 3.22 0.0016

VW 1.2392068 0.118087 10.49 <.0001

(b)

Figure 6.6: Regression output for Walmart stock regressed on (a) several stocks andstock indices, (b) only the value weighted stock index.

so the standard errors for both coefficients become inflated. If all the X’s in yourmodel are highly collinear, you could even have a very high R2 and significant F ,but none of the individual variables shows up as significant.

Physically, you can understand collinearity by imagining that you are tryingto rest a sheet of plywood on a number of cinder-blocks (the plywood representsthe regression surface and the blocks the individual data points). If the blocksare fairly evenly distributed under the plywood it will be very stable i.e. you canstand on it anywhere and it will not move. However, if you place all the blocks ina straight line along one of the diagonals of the plywood and stand on one of theopposite corners the plywood will move (and you will fall, don’t try this at home).The second situation is an example of collinearity where two X variables are veryclosely related so they lie close to a straight line and you are trying to rest theregression plane on these points.

6.4.1 Detecting Collinearity

The best way to detect collinearity is to look at the variance inflation factor (VIF).10

We can calculate the VIF for each variable using the formula

V IF (bi) =1

1 −R2Xi|X−i

where R2Xi|X−i

is the R2 we would get if we regressed Xi on all the other X’s. If

R2Xi|X−i

is close to one, then the other X’s can accurately predict Xi. In that case,

10See page 181 for JMP tips on calculating VIF’s.

6.4. COLLINEARITY 135

the computer can’t tell if Y is being explained by Xi or by some combination of theother X’s that is virtually the same thing as Xi.

The VIF is the factor by which the variance of a coefficient is increased relativeto a regression where there was no collinearity. A VIF of 4 means that the standarderror for that coefficient is twice as large as it would be if there were no collinearity. Ifthe VIF is 9 then the standard error is 3 times larger than it would have otherwisebeen. As a rough rule if the VIF is around 4 you should pay attention (i.e. dosomething about it if you can, and it is not too much trouble). If it is around 100you should start to worry (i.e. don’t even think about this as your final regressionmodel). In Figure 6.6(a) the standard error of VW is about 1, with a VIF of around76. In Figure 6.6(b), where there is no collinearity because there is only one X, thestandard error of VW is about .118. Thus, the standard error of VW in the multipleregression is about

√76 ≈ 8.7 times as large as it is in the simple regression (with

no collinearity). If a variable can be perfectly predicted by other X’s, like “heightin feet” and “height in inches” then R2

Xi|X−i= 1, so V IF (bi) = ∞ and you will get

an error message when the computer divides by zero calculating SE(bi).

Because collinearity is a strong relationship among the X variables, you mightthink that the correlation matrix would be a good place to look for collinearityproblems. It isn’t bad, but VIF’s are better because it is possible for collinearity toexist between three or more variables where no individual pair of variables is highlycorrelated. The correlation matrix only shows relationships between one pair ofvariables at a time, so it can fail to alert you to the problem.

6.4.2 Ways of Removing Collinearity

Collinearity happens when you have redundant predictors. Some common ways ofremoving collinearity are:

• Drop one of the redundant predictors. If two or more X’s provide the sameinformation, why bother keeping all of them? In most cases it won’t matterwhich of the collinear variables you drop: that’s what redundant means! InFigure 6.6 just pick one of the two stock market indices. They provide virtuallyidentical information, so it doesn’t matter which one you pick.

• Combine them into a single predictor in some interpretable way. For example ifyou are trying to use CEO compensation to predict performance of a companyyou may have both “base salary” and “other benefits” in the model. Thesetwo variables may be highly correlated so they could be combined into a singlevariable “total compensation” by adding them together.

• Transform the collinear variables. If one of your variables can be interpreted as



Intercept -1015.103 307.5207 -3.30 0.0017 . RSq 0.300978

Age 67.451425 14.50506 4.65 <.0001 49.872 RMSE 171.1107

Age*Age -0.717539 0.165171 -4.34 <.0001 49.872 N 61

(a)


Intercept 360.62865 95.48353 3.78 0.0004 . RSq 0.300978

Age 4.6138293 2.056672 2.24 0.0287 1.0027 RMSE 171.1107

(Age-43.78)^2 -0.717539 0.165171 -4.34 <.0001 1.0027 N 61

(b)

Figure 6.7: Output for a quadratic regression (a) without centering (b) with the squaredterm centered.

“size” then you can use it in a ratio to put other variables on equal footing. Forexample, a chain of supermarkets might have a data set with each store’s totalsales and number of customers. You would expect these to be collinear becausemore customers generally translates into larger sales. Consider replacing totalsales with sales/customer (the average sale amount for each customer). In thecar example HP and Weight are correlated with r = .8183. Using Weight andHP/pound reduces the correlation to .2546.

Taking the difference between two collinear variables can also help, but it isn’tvery interpretable unless the two variables have the same units. For example,V W − S&P500 makes sense as the amount by which V W outperformed theS&P500 in a given month, but TotalSales-Customers wouldn’t make muchsense.

A third example of a transformation that reduces collinearity is centeringpolynomials. Recall the quadratic regression of the size of an employee’s lifeinsurance premium on the employee’s age from Chapter 5 (page 103). Thesame output is reproduced in Figure 6.7 with and without centering. Thecentered version produces exactly the same fitted equation (notice that RMSEis exactly the same), but with much less collinearity.

6.4.3 General Collinearity Advice

Collinearity is undesirable, but not disastrous. It is a less important problem thanregression assumption violations or extreme outliers. You will rarely want to omit asignificant variable simply because it has a high VIF. Such a variable is significant

6.5. REGRESSION WHEN X IS CATEGORICAL 137

VW S&P Lower PI Upper PI

0.0 0.0 -0.1119 0.1421 (Interpolation)

0.1 0.1 0.0061 0.2642 (Interpolation)

0.1 0.0 0.0296 0.4923 (Extrapolation)

Table 6.2: Prediction intervals for the return on Walmart stock using a regression on VWand the S&P 500.

despite the high VIF. If you can remove the collinearity using transformations thenthe significant variable will be even more significant.

Finally, keep in mind that it is easier to accidentally engage in extrapolationwhen using a model fit by collinear X’s. In the context of multiple regression,extrapolation means predicting Y using an unusual combination of X variables.For example, consider the prediction interval for the monthly return on Walmartstock based on a a model with both VW and the S&P500 shown in Table 6.2.The RMSE from the regression is about .06, so if the regression equation were wellestimated we would expect the margin of error for a prediction interval to be about±.12. If we predict Walmart’s return with VW = S&P = 0.1, a typical combinationof values, then we get a prediction interval not much wider than ±.12. The sameis true when we predict with VW = S&P = 0.0. However, when we try to predictWalmart’s return when VW = 0.1 and S&P = 0.0 the prediction interval is almosttwice as wide. Even though .1 and 0 are typical values for both variables, the (0.1,0.0) combination is unusual.

6.5 Regression When X is Categorical

The two previous Sections dealt with problems you can encounter with multipleregression. Now we return to ways of applying the multiple regression model toreal problems. We often encounter data sets containing categorical variables. Acategorical variable in a regression is often called a factor. The possible values ofa factor are called its levels. For example, Sex is a two-level factor with Male andFemale as its levels. Section 6.5.1 describes how to include two-level factors in aregression model using dummy variables. Section 6.5.2 explains how to extend thedummy variable idea to multi-level factors.

6.5.1 Dummy Variables

A factor with only two levels can be incorporated into a regression model by cre-ating a dummy variable, which is simply a way of creating numerical values out of


Figure 6.8: JMP will make dummy variables for you. You don’t have to make them byhand as shown here with “SexCodes”.

categorical data. For example, suppose we wanted compare the salaries of male andfemale managers. Then we would create dummy variables like

Sex[Female] =

1 if subject i is Female

0 otherwise

Sex[Male] =

1 if subject i is Male

0 otherwise.

A regression model using both these variables would look like

Yi = β0 + β1Sex[Male] + β2Sex[Female] =

β0 + β1 if subject i is Male

β0 + β2 if subject i is Female.

Estimating this model is problematic because the two dummy variables are per-fectly collinear.11 In order to fit the model we must constrain the dummy variables’coefficients somehow. The two most popular methods are setting one of the co-efficients to zero (i.e. removing one of the dummy variables from the model) orforcing the coefficients to sum to zero. When you include a categorical variable ina regression, such as Sex in Figure 6.8, JMP automatically creates Sex[Female] andSex[Male] and includes both variables using the “sum to zero” convention. To usethe “leave one out” convention you must create your own dummy variable, such asSexCodes in Figure 6.8, and use it in the regression instead.12

11Sex[Female] = 1 − Sex[Male]12We doubt you will want to do this. Creating your own dummy variables is a lot of work when

you can let the computer do it for you.



Intercept 142.28851 0.883877 160.98 <.0001

Sex[female] -1.821839 0.883877 -2.06 0.0405

‘‘Sex[male] 1.821839 0.883877 2.06 0.0405’’

(a)


Intercept 140.46667 1.43514 97.88 <.0001

SexCodes 3.6436782 1.767753 2.06 0.0405

(b)

Diff. t-Test DF Prob > |t|

Estimate -3.6437 -2.061 218 0.0405

Std Error 1.7678 Lower 95% -7.1278 Upper 95% -0.1596

(c)

Figure 6.9: Regression output comparing male and female salaries based on (a) the sum-to-zero constraint, (b) the leave-one-out constraint. Panel (c) compares men’s and women’ssalaries using the two-sample t-test.

Figure 6.9 shows regression output illustrating the two conventions. In panel(a) the intercept term is a baseline. The average women’s salary is 1.82 (thousanddollars) below the baseline. The average men’s salary is 1.82 above the baseline.That means the difference between the average salaries of men and women is 1.82×2 = 3.64. The difference is statistically significant (p = .0405), but just barely. Theline for Sex[Male] is in quotes because it is not always reported by the computer,13

although the coefficient is easy to figure out if the computer doesn’t report it. Thecoefficients for Sex[Male] and Sex[Female] must sum to zero, so one coefficient isjust −1 times the other.

Figure 6.9(b) shows the same regression using the “leave one out” convention.In panel (b) the intercept represents the average salary for women (where Sex-Codes=0), and the coefficient of SexCodes represents the average difference be-tween men’s and women’s salaries. The difference is significant, with exactly thesame p-value as in panel (a).

Both regressions in Figure 6.9 describe exactly the same relationship. They justparameterize it differently. Both models say that the average salary for women is$140,466 and that the average salary for men is $3,643 higher. Furthermore, all thep-values, t-statistics, and other important statistical summaries of the relationshipare the same for the two models. In that sense, it does not matter which dummyvariables convention you use, as long as you know which one you are using. The

13To see it you have to ask for “expanded estimates.” See page 182.



Intercept 135.01757 1.881391 71.76 <.0001

Sex[female] 0.0146588 0.949791 0.02 0.9877

YearsExper 0.7496472 0.173053 4.33 <.0001

5 10 15 20 25

110

120

130

140

150

160

170

YearsExper

Sal

ary

Figure 6.10: Regression output comparing men’s and women’s salaries after controlling foryears of experience. Adding dummy variables to a regression gives women ( solid line) andmen (M dashed line) different intercepts. The lines are parallel because the dummy variablesdo not affect the slope.

“leave one out” convention is a little easier if you have to create your own dummyvariables by hand. The “sum to zero” convention is nice because it treats all thelevels of a categorical variable symmetrically. Henceforth we will use JMP’s “sumto zero” convention.

The last panel in Figure 6.9 shows that you get the same answer when youcompare the sample means using regression or using the two sample t-test. The twosample t-test turns out to be a special case of regression, which is why we left it as anexercise in Chapter 4. The advantage of comparing men’s and women’s salaries usingregression is that we can control for other background variables. For example, thesubjects in Figure 6.8 come from a properly randomized survey, so the statisticallysignificant difference between their salaries generalizes to the population from whichthe survey was drawn. However, the subjects were not randomly assigned to be menor women,14 so we can’t say that the difference between men’s and women’s salariesis because of gender differences. There are other factors that could explain thesalary difference, such as men and women having different amounts of experience.Figure 6.10 shows output for a regression of Salary on Sex and YearsExper. Notethe change from Figure 6.9. Now the baseline for comparison is a regression lineY = 135 + .75Y earsExper. A woman adds .014 to the regression line to find herexpected salary. A man with the same number of years of experience subtracts .014from the line to find his expected salary (remember that the coefficients for Sex mustsum to zero). That is only a $28 difference. The large p-value for Sex in Figure 6.10

14Obviously very difficult to do.


says that, once we control for YearsExper, Sex does not have a significant effect onSalary. Said another way, multiple regression estimates the impact of changing onevariable while holding another variable constant. Therefore the coefficient of Sexcan be viewed as comparing the salaries of men and women with the same numberof years of experience. Because the p-value for Sex is insignificant, the model saysthat men and women with the same experience are being paid about the same.

You can think about the contributions from the Sex dummy variable as addingor subtracting something from the intercept term, which means that the regressionlines for men and women are parallel. Section 6.6 shows how to expand the modelif you want to consider non-parallel lines.

6.5.2 Factors with Several Levels

A factor with L levels can be included in a multiple regression by splitting it intoL dummy variables. For example, suppose each observation in the data representsa pair of pants ordered from a clothing catalog. The color of each pair can be“natural,” “dusty sage,” “light beige,” “bone,” or “heather.” To include Color inthe regression, simply make a dummy variable for each color, like

X[DS] =

1 if dusty sage

0 otherwise.

Just as with the male/female dummy variables in Section 6.5.1, we must con-strain the coefficients of the L dummy variables in order to avoid a perfect, regressionkilling collinearity. The same constraints from Section 6.5.1 apply here as well. Wecan either leave one dummy variable out of the model or constrain the coefficientsto sum to zero. The “sum-to-zero” constraint is more appealing when dealing withmulti-level factors.

The output from a regression on a multi-level factor can appear daunting becauseeach factor level introduces a coefficient. For example, Figure 6.11 gives outputfrom a regression of log10 CEO compensation on the CEO’s age, industry, and hisor her company’s profits. Industry is a categorical variable with 19 different levels.Three more coefficients are added by the intercept, Age, and Profits, so there are22 regression coefficients in total. It helps to remember that all but one of thedummy variables will be zero for any given observation. Therefore, only 4 of the22 coefficients are needed for any given CEO. To illustrate, let’s estimate the log10

compensation for a 60 year old CEO in the entertainment industry whose companymade 72 million dollars in profit.


************ Expanded Estimates **************

Nominal factors expanded to all levels


Intercept 5.7275756 0.118957 48.15 <.0001

Ind[AeroDef] 0.1476387 0.088284 1.67 0.0949

Ind[Business] 0.0156033 0.072287 0.22 0.8292

Ind[Cap. gds] -0.078135 0.085355 -0.92 0.3603

Ind[Chem] -0.020486 0.075052 -0.27 0.7850

Ind[CompComm] 0.0704833 0.049318 1.43 0.1534

Ind[Constr] 0.1317273 0.116558 1.13 0.2588

Ind[Consumer] 0.0601128 0.053002 1.13 0.2571

Ind[Energy] -0.057568 0.059622 -0.97 0.3346

Ind[Entmnt] 0.1106206 0.07352 1.50 0.1328

Ind[Finance] -0.116069 0.03313 -3.50 0.0005

Ind[Food] -0.04783 0.049415 -0.97 0.3334

Ind[Forest] -0.123455 0.085613 -1.44 0.1497

Ind[Health] 0.0093132 0.056013 0.17 0.8680

Ind[Insurance] 0.0116971 0.052934 0.22 0.8252

Ind[Metals] -0.089493 0.087778 -1.02 0.3083

Ind[Retailing] 0.0509543 0.058207 0.88 0.3816

Ind[Transport] 0.0759199 0.095629 0.79 0.4275

Ind[Travel] 0.1955346 0.09896 1.98 0.0485

Ind[Utility] -0.346568 0.049066 -7.06 <.0001

Age 0.0080405 0.002075 3.87 0.0001

Profits 0.0001485 0.000024 6.25 <.0001

RSquare 0.15152

RMSE 0.38498

N 786

ANOVA

Source DF Sum Sq. Mean Sq. F Ratio

Model 20 20.24733 1.01237 6.8306

Error 765 113.38040 0.14821 Prob > F

Total 785 133.62774 <.0001

Effect Tests

Source Nparm DF Sum Sq. F Ratio Prob>F

Ind 18 18 11.63339 4.3607 <.0001

Age 1 1 2.224885 15.0117 <.0001

Profits 1 1 5.794551 39.0970 <.0001

Figure 6.11: Output for the regression of log10 CEO compensation on the CEO’s age,industry, and corporate profits. The Ind[Utility] variable is left out of the usual parameterestimates table. We can either find it in the “expanded estimates” table (easy) or computeit from the other Ind[x] coefficients (hard).

Y = 5.73 + .11 + .00804 × 60 + .000149 × 7215 = 6.333.

The same CEO in the chemicals industry could expect a log10 compensation of

Y = 5.73 − .02 + .00804 × 60 + .000149 × 72 = 6.203.

That doesn’t look like a big difference, but remember we’re on the log scale. Thefirst CEO expects to make 106.333 = $2, 152, 782, the second, 106.203 = 1, 595, 879,about half a million dollars less. Each coefficient of Ind[X] in Figure 6.11 representsthe amount to be added to the regression equation for a CEO in industry X. If youlike, you can interpret Figure 6.11 as 19 different regression lines with the sameslopes for Age and Profit but different intercepts for each industry.

Testing a Multilevel Factor

To test the significance of a multi-level factor, you test whether the collection ofdummy variables that represent that factor, taken as a whole, do a useful amount

15Profit in the data table is recorded in millions of dollars


Not on the test 6.2 Making the coefficients sum to zero

The computer uses a trick to force the coefficients of the dummy variablesto sum to zero. The trick is to leave one of the dummy variables out, butto define all the others a little differently. For example, suppose the Utilityindustry in Figure 6.11 is the level left out. The other dummy variables are

Ind[AeroDef] =

1 if subject i is in the aerospace-defense industry

−1 if subject i is in the utilities industry

0 otherwise,

Ind[Finance] =

1 if subject i is in the finance industry

−1 if subject i is in the utilities industry

0 otherwise

and so forth, always assigning −1 to the level that doesn’t get its own dummyvariable. If b1, b2, . . . , b18 are the coefficients for the other 18 industries, thena CEO in the utilities industry adds b19 = −1(b1 + b2 + · · · + b18) to theregression equation, because he gets a -1 on every dummy variable in themodel. Clearly, b1 + · · · + b19 = 0. By using this trick, the computer onlyestimates the first 18 b’s, and derives b19. That explains why you don’t seeInd[Utility] in the regular parameter estimates box. The computer didn’tactually estimate a parameter for it. What makes this such a neat trick isthat all “19” dummy variables behave as desired (as if one of them was 1and the rest were 0) in Figure 6.11.

of explaining. The partial F test from Section 6.2.4 answers that question. Thepartial F test for each variable in the model is shown in the “Effects Test” table inFigure 6.11. The partial F tests for Age and Profits are equivalent to the t-testsfor those variables, because each variable only introduces one coefficient into themodel. The partial F test for the industry variable Ind examines whether the 18“free” dummy variables reduce SSE by a sufficient amount to justify their inclusion.These 18 dummy variables reduce SSE by 11.63, compared to the model with justAge and Profit. To calculate the F statistic, the computer uses the formula fromSection 6.2.4,

F =11.63/18

0.14821= 4.3607

where 0.14821 is the MSE from the ANOVA table. The p-value generated by thispartial F statistic, with 18 and 765 degrees of freedom, is p < .0001, which says


that at least some of the dummy variables are helpful.Once you see that a categorical factor is helpful, there is nothing that says you

have to keep all the individual dummy variables in the model. However, if you wantto drop insignificant dummy variables from the model you will need to create allthe individual dummy variables by hand. That is a big hassle that doesn’t seem tohave a big payoff. We make an exception to the “no garbage variables” rule whenit comes to individual dummy variables that are part of a factor which is significantoverall.

6.5.3 Testing Differences Between Factor Levels

Once you determine that there is a difference between a subset of the factor levels,you will often want to run subsequent tests to see where the differences are. Thet-statistics for the individual dummy variables are set up to compare the dummyvariable to the baseline regression, not to compare two dummy variables to oneanother. The tool you use to compare dummy variables is called a contrast.

A contrast is a weighted average of dummy variable coefficients where some ofthe weights are positive and some are negative. The weights can be anything youlike as long as the positive and negative weights each sum to 1. Often the positivecomponents of the contrast are equally weighted, and likewise for the negative com-ponents. For example, suppose you wanted to compare average CEO compensationin the the Finance and Insurance industries with average compensation in the En-tertainment, Retailing, and Consumer Goods industries, after controlling for Ageand Profits.16 Number these five industries 1 through 5. The contrast you want tocompute is17

.5β1 + .5β2 − .333β3 − .333β4 − .333β5

where βi is the coefficient of the dummy variable for the ith industry. The results ofthis contrast are shown in the box below. The estimate is negative, which says thatthe average compensation in the Entertainment/Retail/Consumer industries (whichhave negative weights) is larger than the average in the Finance/Insurance indus-tries, after controlling Age and Profits. The small p-value says that the differenceis significant.

Estimate -0.126 Sum of Squares 1.0469633847

Std Error 0.0474 Numerator DF 1

t Ratio -2.658 Denominator DF 765

Prob>|t| 0.008 F Ratio 7.0640690154

SS 1.047 Prob > F 0.0080286945

16We don’t know why someone might want to do this, but it illustrates the point. A moremeaningful example appears in Section 6.6.1.

17See page 182 for computer tips.

6.6. INTERACTIONS BETWEEN VARIABLES 145

1 2X X

Y

(a) typical regression

Y

2X1 X(b) collinearity

Y

2X1 X(c) interaction

Y

2X1 X(d) autocorrelation

Figure 6.12: Collinearity is a relationship between two or more X ’s. Interaction is whenthe strength of the relationship between X1 and Y depends on X2, and vice-versa. Auto-correlation is when Y depends on previous Y ’s.

6.6 Interactions Between Variables

Now that we can incorporate either categorical or continuous X’s, multiple regres-sion looks like a very flexible tool. It gets even more flexible when we introduce theidea of interaction, where the relationship between Y and X depends on anotherX. Interaction can be a hard idea to internalize because there are three variablesinvolved. Mathematically, interaction means that the slope of X1 depends on X2.The practical implications of interactions become much easier to understand if youthink of the regression slopes as meaningful “real world” quantities instead of thegeneric “one unit change in X leads to a . . . .”

For example, the CEO compensation data set has a variable containing thepercentage of a company’s stock owned by the CEO. Suppose we wish to examinethe effect of stock ownership on CEO compensation. From Section 6.5.2 we knowthat log10 compensation also depends on profits.18 The standard way of includingProfit and Stock ownership in a regression model is

Y = β0 + β1Stock + β2Profit.

This model says that increasing a CEO’s ownership of his company’s Stock by 1%increases his expected compensation by β1, regardless of the company’s profit level.That seems wrong to us. Instead it seems like CEO’s with more stock should doeven better if they run profitable companies than if they don’t. Said another way,we think the coefficient of Stock should be larger if the CEO’s company is profitable,and smaller if it is not. Consider the regression model where

Y = β0 + β1Stock + β2Profit + β3(Stock)(Profit).

The term (Stock)(Profit) is called the interaction between Stock and Profit. It issimply the product of the two variables. The individual variables Stock and Profitare called main effects or first order effects of the interaction. To determine the

18and on Age, Industry, and presumably other variables as well. Ignore them for the moment.



Intercept 5.7103165 0.118212 48.31 <.0001 .

Ind[aerosp] 0.1351741 0.087518 1.54 0.1229 4.130047

... (other industry dummy variables omitted)

Ind[Travel] 0.2064925 0.098087 2.11 0.0356 4.8915569

Age 0.0087606 0.002067 4.24 <.0001 1.0939624

Profits 0.0001199 0.000029 4.15 <.0001 1.5551463

Stock% -0.009774 0.002402 -4.07 <.0001 1.1822928

(Stock%-2.17974)*

(Profit-244.202) -0.000013 0.000009 -1.40 0.1630 1.6220983

Figure 6.13: Regression of log10 CEO compensation on Age, Industry, Profit, the percentof a company’s stock owned by the CEO, and the Profit*Stock interaction.

interaction’s effect on the “slope” of Stock, ignore all terms in the model that don’thave Stock in them, then factor Stock out of all that do.19 The slope of Stock is

“βStock” = β1 + β3Profit.

If β3 is positive then the “slope” of Stock is larger when Profit is high, and smallerwhen Profit is low. The key to understanding interactions is to come up with a“real world” meaning for the slopes of the variables involved in the interaction.In the current example, βStock is the impact that stock ownership has on a CEO’soverall compensation. If β3 > 0 then stock ownership has a more positive impact oncompensation for CEO’s of more profitable companies, and a more negative impacton compensation for CEO’s of money losing companies. The interaction can beinterpreted the other way as well.

“βProfit” = β2 + β3Stock

so if β3 > 0 then the model says that the profits of a CEO’s company have a largerimpact on a CEO’s compensation if the CEO owns a lot of stock.

Figure 6.13 tests our theory about the relationship between Stock, Profit, andlog10 compensation by incorporating the Stock*Profit interaction into the regres-sion from Figure 6.11. The first thing we notice is that the interaction term isinsignificant (p = .1630). That means that the effect of stock ownership on CEOcompensation is about the same for CEO’s of highly profitable companies and CEO’sof companies with poor profits. So much for our theory! We also notice that thecoefficient of Stock is negative. Perhaps CEO’s who own large amounts of stock

19If you know about partial derivatives, we’re just taking the partial derivative with respect toStock, treating all other variables as constants.


choose to forgo huge compensation packages in the hopes that their existing shareswill increase in value. The leverage plot for Stock seems to fit with that explanation.It seems we had a flaw in our logic. CEO’s that own a lot of stock obviously do wellwhen their companies are profitable, but an increase in the value of their existingshares does not count as compensation, so it won’t show up in our data.

Just for practice, let’s try to interpret what the interaction term in Figure 6.13says, even though it is insignificant and should be dropped from the model. Be-fore creating the interaction term, the computer centers Stock and Profit aroundtheir average values to avoid introducing excessive collinearity into the model.20

Figure 6.13 estimates the slope of Stock as

bStock = −0.009774 − 0.000013(Profit − 244.202).

Notice that the mean of Stock (2.17 . . . ) multiplies Profit, not Stock, when youmultiply out the interaction term, so it doesn’t affect bStock. The estimate says thateach million dollars of profit over the average of $244 million decreases the slope ofStock by 0.000013. The estimated slope of Profit is

bProfit = 0.0001199 − 0.000013(Stock − 2.17974),

which says that each percent of the company’s stock owned by the CEO, over theaverage of 2.18%, decreases the slope of profit by 0.000013.

6.6.1 Interactions Between Continuous and Categorical Variables

Interactions say that the slope for one variable depends on the value of anothervariable. When one variable is continuous and the other is categorical interactionsmean that there is a different slope for each level of the categorical variable.

For example, suppose a factory supervisor oversees three managers, convenientlynamed a, b, and c. The supervisor decides to base each manager’s annual perfor-mance review on the typical amount of time (in man-hours) required for them tocomplete a production run. The supervisor obtains a random sample of 20 produc-tion runs from each manager. To make the comparison fair, the size (number ofitems produced) of each production run is also recorded. The supervisor regressesRunTime on RunSize and Manager to obtain the regression output in Figure 6.14(a).The regression estimates the fixed cost of starting up a production run at 176 man-hours, and the marginal cost of each item produced is about .25 man hours (i.e.it takes each person about 15 minutes to produce an item, once the process is upand running). The fixed cost for manager a is 38.4 man-hours above the baseline.Manger b is 14.7 man-hours below the baseline, and manager c is 23.8 man-hours

20See page 136.


Expanded Estimates

Nominal factors expanded to all levels


Intercept 176.70882 5.658644 31.23 <.0001

Manager[a] 38.409663 3.005923 12.78 <.0001

Manager[b] -14.65115 3.031379 -4.83 <.0001

Manager[c] -23.75851 2.995898 -7.93 <.0001

Run Size 0.243369 0.025076 9.71 <.0001

Effect Tests

Source Nparm DF Sum of Sq. F Ratio Prob > F

Manager 2 2 44773.996 83.4768 <.0001

Run Size 1 1 25260.250 94.1906 <.0001

(a) (b)

Expanded Estimates

Term Estimate Std Err. tRatio Prob>|t|

Intercept 179.59191 5.619643 31.96 <.0001

Manager[a] 38.188168 2.900342 13.17 <.0001

Manager[b] -13.5381 2.936288 -4.61 <.0001

Manager[c] -24.65007 2.887839 -8.54 <.0001

Run Size 0.2344284 0.024708 9.49 <.0001

Mgr[a](RS-209.317) 0.072836 0.035263 2.07 0.0437

Mgr[b](RS-209.317) -0.09765 0.037178 -2.63 0.0112

Mgr[c]*RS 0.0248147 0.032207 0.77 0.4444

Effect Test

Source Nparm DF Sum Sq. F Ratio Prob > F

Manager 2 2 43981.452 89.6934 <.0001

Run Size 1 1 22070.614 90.0192 <.0001

Manager*RS 2 2 1778.661 3.6273 0.0333

(c) (d)

Figure 6.14: Evaluating three managers based on run-time (a-b) without interactions (c-d)with the interaction between manager and run size. Manager a (, broken line), Manager b(+, light line), Manager c(×, heavy line).

below the baseline. The effects test shows that the Manager variable is significant,so the differences between the managers are too large to be just random chance.Upon determining that there actually is a difference between the three managers,the supervisor computed two contrasts (shown in Figure 6.15), indicating that thesetup time for manager a is significantly above the average times of manager b andc on runs of similar size, and that the difference between managers b and c is notstatistically significant.

The supervisor decides to put manager a on probation, and give managers b andc their usual bonus. While meeting with the managers to discuss the annual review,the supervisor learned that manager b used a different production method than theother two managers. Manager b’s method was optimized to reduce the marginalcost of each unit produced. The supervisor began to worry that he hadn’t judged


a 0 a 1

b 1 b -0.5

c -1 c -0.5

Estimate 9.1074 Estimate 57.614

Std Error 5.2243 Std Error 4.5089

t Ratio 1.7433 t Ratio 12.778

Prob>|t| 0.0868 Prob>|t| <.0001

SS 814.99 SS 43788

Figure 6.15: Contrasts based on the model in Figure 6.14(a).

manager b fairly, because his analysis assumed that all three managers had thesame marginal cost of .243 man-hours per item. If manager b’s method was trulyeffective (which the supervisor isn’t sure about and would like to test), managerb would have an advantage on large jobs that the supervisor’s analysis ignored.Unhappy with his current model, the supervisor calls you to help.

The supervisor’s question suggests an interaction between the Manager variableand the RunSize variable. Interactions say that the slope of one variable (in thiscase RunSize) depends on another (in this case Manager). So in the current examplean interaction simply means each manager has his own slope, or even more specificto this example: his own marginal cost of production.

After hiring you to explain all this (and paying you a huge consulting fee) thesupervisor decides to run a new regression that includes the interaction betweenManager and RunSize, shown in Figure 6.14(c). The interaction term enters intothe model as a product of RunSize with all the Manager[x] dummy variables, aftercentering the continuous variable RunSize to limit collinearity.21 Because the inter-action enters into the model in the form of several variables at once, the right way totest its significance is through the partial F test shown in the “Effects Test” table.The interaction is significant with p = .0333, which says that at least one of themanagers has a different slope (marginal cost) than the other two. The regressionequations for the three managers are:

RunTime =

179.59 + 38.19 + .234RunSize + .073(RS − 209.3) if Manager a

179.59 − 13.54 + .234RunSize − .098(RS − 209.3) if Manager b

179.59 − 24.65 + .234RunSize + .025(RS − 209.3) if Manager c.

The baseline marginal cost per unit of production is .2344 man-hours per unit.Manager a adds .073 man-hours per unit, manager b reduces the baseline marginalcost by .098 man-hours per unit, and manager c adds .025 man-hours per unit.Notice that the coefficients of the dummy variables in the interaction terms sum

21The centering term is not present in the Mgr[c] line of the computer output, but that is a typoon the part of the computer programmers. The output should read Mgr[c](RS-209.317).


to zero just like the coefficients of the Manager main effects. By adding like termswe find that manager a’s slope is 0.307, manager b’s is 0.136, and manager c’s is0.259. JMP doesn’t provide a method equivalent to a contrast for testing interactionterms, but clearly manager b has a smaller slope (i.e. smaller marginal cost per itemproduced) than the other two.

6.6.2 General Advice on Interactions

Students often ask how you know if you need to include an interaction term. Thesimple answer is that you try it in the model and look at its p-value, just like anyother variable. The trick is having some idea which interaction to check, whichrequires some contextual knowledge about the problem. You should think to try aninteraction term whenever you think that changing a variable (either categorical orcontinuous) changes the dynamics of the other variables. To develop that intuitionyou should think about the economic interpretation of the coefficients as marginalcosts, marginal revenues, etc. There is no plot or statistic you look at to say “I needan interaction here.” The variables involved in an interaction may or may not berelated to one another.

Interactions are relatively rare in regression analysis. Many people don’t botherto check for them because of the intuition required for the search. However, whenyou find a strong one it is a real bonus because you’ve learned something interestingabout the process you’re studying.

Finally, we interpret interactions are the effect that one variable has on the slopeof another. That interpretation becomes more difficult if the slope of either vari-able is excluded from the model. Therefore, when fitting models with interactionsremember the hierarchical principle which says that if you have an interaction inthe model, you should also include the main effects of each variable, even if theyare not significant. The exception to the hierarchical principle is when the variableyou create by multiplying two other variables is interpretable on its own merits. Inthat case you have the option of keeping or dropping the main effects as conditionswarrant.

6.7 Model Selection/Data Mining

Model selection is the process of trying to decide which are the important variablesto include in your model. This is probably the most difficult part of a typicalregression problem. Model selection is one of the key components in the field of“Data Mining” which you have probably heard about before. Data mining hasarisen because of the huge amounts of data that are now becoming available as aresult of new technologies such as bar code scanners. Every time you go to the

6.7. MODEL SELECTION/DATA MINING 151

supermarket and your products are scanned at the checkout that information issent to a central location. There are literally terabytes of information recordedevery day. This information can be used to plan marketing campaigns, promotions,etc. However, there is so much data that it is virtually impossible to manually decidewhich of the thousands of recorded variables are important for any given responseand which are not. The field of data mining has sprung up to try to answer thisproblem (and related ones).

However, the problem has existed in statistics for a long time before data miningarrived. There are some established procedures for deciding on the best variables,which we discuss below. In the interests of fairness we should point out that this issuch a hard problem that answering it is somewhat more of an art than a science!

6.7.1 Model Selection Strategy

When deciding on the “best” regression model you have several things to keep youreye on. You should check for (1) regression assumption violations, (2) unusualpoints, and (3) insignificant variables in the model. You should also keep your eyeon the VIF’s of the variables in the model to be aware of any collinearity issues.

When faced with a mountain of potential X variables to sort through, a goodstrategy is to start off with what significant predictors you can find by trial anderror. Each time you fit a model, glance at the diagnostics in the computer outputto see if you can spot any glaring assumption violations or unusual points. Tryto economically interpret the coefficients of the models you fit (“β3 is the cost ofraw materials”). Being able to interpret your coefficients this way will help youunderstand the model better, make it easier for you to explain the model to otherpeople, and perhaps give you some insight on interactions that you might want totest for. This stage of the analysis generally requires quite a bit of work and a lotof practice. However, there is really no good substitute for this kind of approach.

Once you’ve gone through a few iterations and are more or less happy with thesituation regarding regression assumptions and unusual points, you eventually getto the stage where a computer can help you figure out which variables will providesignificant p-values. Be careful that you don’t jump to this step too soon, though.Computers are really good a churning through variables looking for significant p-values, but they are really bad at identifying and fixing assumption violations, sug-gesting interpretable transformations to remove collinearity problems, and makingjudgments about what to do with unusual points.


6.7.2 Multiple Comparisons and the Bonferroni Rule

At several points throughout this Chapter we have urged caution when looking atp-values for individual variables. The primary reason is the problem of multiple

comparisons. In previous Chapters each problem had only one hypothesis test forus to do. Our .05 rule for p-values meant that if an X was a garbage variable, we hadonly a 5% chance of mistaking it for a significant one. Regression analysis involvesmany different hypothesis tests and a lot more trial and error, which means thereis a much greater opportunity for garbage variables to sneak into the regressionmodel by chance. Consequently, our .05 rule for p-values no longer protects us fromspurious significant results as well as it did before.

One obvious solution is to enact a tougher standard for declaring significancebased on the p-values of individual variables in a multiple regression. The questionis “how much tougher?” Unfortunately there is no way of knowing, but there is aconservative procedure known as the Bonferroni adjustment which we know to bestricter than we need to guarantee a particular level of significance. The Bonferroniadjustment says that if you are going to select from p potential X variables, thenreplace the .05 threshold for significance with .05/p. Thus if there were 10 potentialX variables we should use .005 as the significance threshold instead of .05.

The Bonferroni rule is a rough guide. It is important to remember that it istougher than it needs to be, simply because we have no way of knowing preciselyhow much we should relax it to maintain a “true” .05 significance level. At somepoint (a practical limit is .0001, the smallest p-value that the computer will print)the Bonferroni adjusted threshold gets low enough that we don’t penalize it anyfurther.

6.7.3 Stepwise Regression

In regression problems with only a few X variables it is best to select the variablesin the regression model by hand. However, if you have a mountain of X variablesstaring you in the face you’re going to need some help. Some sort of automatedapproach is required to choose a manageable subset of the variables to considerin more detail. There are three common approaches, collectively called stepwise

regression.

• Forward Selection. With this procedure we start with no variables in themodel and add to the model the variable with lowest p-value. We then addthe variable with next lowest p-value (conditional on the first variable being inthe model). This approach is continued until the p-value for the next variableto add is above some threshold (e.g. 5%, adjusted by the Bonferroni rule) atwhich point we stop. The threshold must be chosen by the user.


Not on the test 6.3 Where does the Bonferroni rule come from?

SupposeX1, . . . ,Xp are all variables with no relationship to Y . Now supposewe run a regression and pick out only theX’s that have significant individualp-values. Each Xi has a probability α of spuriously making it into the model,where our usual rule is α = .05. From very basic probability we know that ifA and B are two events, then P (A or B) = P (A) + P (B) − P (A and B) ≤P (A)+P (B). Therefore, the probability that at least one of the Xi’s makesit into the model is

P (X1 or X2 or · · · or Xp) ≤ P (X1) + · · · + P (Xp) = pα.

Thus if we replace the .05 rule with a .005 rule, then we can look at 10 t-statistics and still maintain only a 5% chance of allowing at least one garbagevariable into the regression.

• Backward Selection. With backward selection we start with all the possiblevariables in the model. We then remove the variable with largest p-value.Then we remove the variable with next largest p-value (given the new modelwith the first variable removed). This procedure continues until all remainingvariables have a p-value below some threshold, again chosen by the user.

• Mixed Selection. This is a combination of forward and backward selection.We start with no variables in the model and as with forward selection add thevariable with lowest p-value. The procedure continues in this way except thatif at any point any of the variables in the model have a p-value above a certainthreshold they are removed (the backward part). These forward and backwardsteps continue until all variables in the model have a low enough p-value andall variables outside the model have a large p-value if added. Notice that thisprocedure requires selecting two thresholds. If the two thresholds are closeto another the procedure can enter a cycle where including a variable makea previously included variable insignificant. Dropping that variable makesthe previous variable insignificant, which makes the first variable significantagain. If you encounter such a cycle then simply choose a tougher thresholdfor including variables in the model.

JMP (or most other statistical packages) will automatically perform these pro-cedures for you.22 The only inputs you need to provide are the data and the

22See page 182 for computer tips.


thresholds. Note that the default thresholds that JMP provides are TERRIBLE.They are much larger than .05, when they should be much smaller. To illustrate,we simulated 10 normally distributed random variables, totally independently, foruse in a regression to predict the stock market (60 monthly observations of the VWindex from the early 1990’s). The last 12 months were set aside, and the remaining48 were used to fit the model. Obviously our X variables are complete gibberish, asis reflected in the ANOVA table for the regression of VW on X1, . . . ,X10 shown inFigure 6.16. The second ANOVA table in Figure 6.16 was obtained using the Mixedstepwise procedure, with JMP’s default thresholds, on all possible interaction andquadratic terms for the 10 X variables. Notice R2 = .92! The whole model Ftest is highly significant. Figure 6.16(a) shows a time series plot of the predictionsfrom this regression model on the same plot as the actual data. For the 48 monthsthat the model was fit on the predicted and actual series track nearly perfectly.For the last 12 months, which were not used to fit the model, the two series arecompletely unrelated to one another. Figure 6.16(b) makes the same point using adifferent view of the data. It plots the residuals versus the predicted returns for all60 observations. At least half of the 12 points not used in fitting the model couldbe considered outliers.

The point of this exercise is that stepwise regression is a “greedy” algorithmfor choosing X’s to be in the model. Give it a chance and it will include all sortsof spurious variables that happen to fit just by chance. When using the stepwiseprocedure remember to protect yourself by setting tough significance thresholds.When we tried our same experiment using .001 as a threshold (the lowest thresholdthat JMP allows), we ended up with an “empty” model, which in this case is theright answer.


ANOVA Table for ordinary regression on 10 ‘‘Random Noise’’ variables


Model 10 0.00409569 0.000410 0.2906

Error 36 0.05074203 0.001410 Prob > F

Total 46 0.05483773 0.9791

Mixed Stepwise with JMP’s Defaults


Model 26 0.05058174 0.001945 9.1422 RSquare 0.922389

Error 20 0.00425599 0.000213 Prob > F RMSE 0.014588

Total 46 0.05483773 <.0001 N 47

(all possible interactions considered)

(a) (b)

Figure 6.16: Regression output for the value weighted stock market index regressed on 10variables simulated from “random noise.”

Chapter 7

Further Topics

7.1 Logistic Regression

Often we wish to understand the relationship between X and Y where X is con-tinuous but Y is categorical. For example a credit card company may wish tounderstand what factors (X variables) affect whether a customer will default orwhether a customer will accept an offer of a new card. Both “default” and “acceptnew card” are categorical responses. Suppose we let

Yi =

1 if ith customer defaults

0 if not

Why not just fit the regular regression model,

Yi = β0 + β1Xi + εi

and predict Y as

Y = b0 + b1X

There are two problems with this prediction. First, if Y = 0.5, for example, weknow this is an incorrect prediction because Y is either 0 or 1. Second, dependingon the value of X, Y can range anywhere from negative to positive infinity. Clearlythis makes no sense because Y can only take on the values 0 or 1.

A New Model

The first problem can be overcome by treating Y as a guess, not for Y itself, butfor the probability that Y equals 1 i.e. P (Y = 1). So, for example, if Y = 0.5 thiswould indicate that the probability of a person with this value of X defaulting is

157

158 CHAPTER 7. FURTHER TOPICS

50%. Thus any prediction between 0 and 1 now makes sense. However, this does notsolve the second problem because a probability less than zero or greater than onestill has no sensible interpretation. The problem is that a straight line relationshipbetween X and P (Y = 1) is not correct. The true relationship will always bebetween 0 and 1. We need to use a different function i.e. not linear. There aremany possible functions but the one that is used most often is the following

p =eβ0+β1X

1 + eβ0+β1X

where p = P (Y = 1). This curve has an S shape. Notice that as β1X gets close toinfinity p gets close to one (but never goes past one) and when β1X gets close tonegative infinity p gets close to zero (but never goes below zero). Hence, no matterwhat X is we will get a sensible prediction. If we rearrange this equation we get

p

1 − p= eβ0+β1X

p/(1 − p) is called the odds. It can go anywhere from 0 to infinity. A number closeto zero indicates that the probability of a 1 (default) is close to zero while a numberclose to infinity indicates a high probability of default. Finally, by taking logs ofboth sides we get

log

(p

1 − p

)

= β0 + β1X

The left hand side is called the log odds or “logit.” Notice that there is a linearrelationship between the log odds and X.

Fitting the Model

Just as with regular regression β0 and β1 are unknown. Therefore to understandthe relationship between X and Y and make future predictions for Y we need to beestimate them using b0 and b1. These estimates are produced using a method called“Maximum Likelihood” which is a little different from “Least Squares.” However,the details are not important because most of the ideas are very similar. We stillhave the same questions and problems as with standard regression. For example b1is only a guess for β1 so it has a standard error which we can use to construct aconfidence interval. We are still interested in testing the null hypothesis H0 : β1 = 0because this corresponds to “no relationship between X and Y .” We now use a χ2

(chi-square) statistic rather than t but you still look at the p-value to see whetheryou should reject H0 and conclude that there is a relationship.

The interpretation of β1 is a little more difficult than with standard regression.If β1 is positive then increasing X will increase p = P (Y = 1) and vice versa for β1

7.1. LOGISTIC REGRESSION 159

negative. However, the effect of increasing X by one, on the probability is less clear.Increasing X by one changes the log odds by β1. Equivalently it multiplies the oddsby eβ1 . However, the effect this will have on p depends on what the probability is tostart with.

Making Predictions

Suppose we fit the model using average debt as a predictor and get b0 = −3 andb1 = .001. Then for a person with 0 average debt we would predict that theprobability they defaulted on the loan would be

p =e−3+0.001×0

1 + e−3+0.001×0=

e−3

1 + e−3= 0.047

On the other hand a person with $2, 000 in average debt would have a probabilityof default of

p =e−3+0.001×2,000

1 + e−3+0.001×2,000=

e−1

1 + e−1= 0.269

Multiple Logistic Regression

The logistic regression model can easily be extended to as many X variables as welike. Instead of using

p =eβ0+β1X

1 + eβ0+β1X

we use

p =eβ0+β1X1+···+βpXp

1 + eβ0+β1X1+···+βpXp

or equivalently

log

(p

1 − p

)

= β0 + β1X1 + · · · + βpXp.

The same questions from multiple regression reappear. Do the variables overallhelp to predict Y? (Look at the p-value for the “whole model test.”) Which of theindividual variables help? (Look at the individual p-values.) What effect does eachX have on Y . (Look at the signs on the coefficients.)

In the credit card example it looked like average balance and income had an effecton the probability of default. Higher average balances caused a higher probability(b1 was positive) and higher incomes caused a lower probability (b2 was negative).Therefore we are especially concerned about people with high balances and lowincomes.


Making predictions with several variables

Suppose we fit the model using average debt and income as predictors and get b0 = 1and b1 = .001 and b2 = −.0001. Then for a person with 0 average debt and incomeof $50, 000 we would predict that the probability they defaulted on the loan wouldbe

p =e1+0.001×0−0.0001×50,000

1 + e1+0.001×0−0.0001×50,000=

e−4

1 + e−4= 0.0180

On the other hand a person with $2, 000 in average debt and income of $30, 000would have a probability of default of

p =e1+0.001×2,000−0.0001×50,000

1 + e1+0.001×2,000−0.0001×30,000=

e0

1 + e0= 0.5

7.2 Time Series

The Difference Between Time Series and Cross Sectional Data

• Cross Sectional Data

– Car’s weight predicts/explains fuel consumption.

– Two different variables, no concept of time.

• Time Series

– Past number of deliveries explains future number of deliveries.

– Same variable, different times.

Autocorrelation

When we have data measured over time we often denote the Y variable as Yt wheret indicates the time that Y was measured at. The “lag variable” is then just Yt−1

i.e. all the Y ’s shifted back by one. A standard assumption in regression is that theY ’s are all independent of each other. In time series data this assumption is oftenviolated. If there is a correlation between “yesterday’s” Y i.e. Yt−1 and “today’s”Y i.e. Yt this is called “autocorrelation.” Autocorrelation means today’s residual iscorrelated with yesterday’s residual. An easy way to spot it is to plot the residualsagainst time. Tracking, i.e. a pattern where the residuals follow each other, isevidence of autocorrelation. In cross sectional data there is no time componentwhich means no autocorrelation. On the other hand time series data almost alwayshas some autocorrelation.

7.2. TIME SERIES 161

Impact of Autocorrelation

• On predictions. The past contains additional information about the future notcaptured by the current regression model. Today’s residual can help predicttomorrow’s residual. This means that we should incorporate the previous(lagged) Y ’s in the regression model to produce a better estimate of today’sY .

• On parameter estimates. Yesterday’s Y can be used to predict today’s Y . Asa consequence today’s Y provides less new information than an independentobservation. Another way to say this is that the “Equivalent Sample Size” issmaller. For example, 100 Y ’s that have autocorrelation may only provide asmuch information as 80 independent Y ’s. Since we have less information theparameter estimates are less certain than in an independent sample.

Testing for Autocorrelation

One test for autocorrelation is to use the “Durbin-Watson Statistic.” It is calculatedusing the following formula

DW =

∑nt=2(et − et−1)

2

∑nt=1 e

2t

It compares the variation of residuals about the lagged residuals (the numerator) tothe variation in the residuals (denominator). The Durbin-Watson statistic assumesvalues between 0 and 4. A value close to 0 indicates strong positive autocorrelation.A value close to 2 indicates no autocorrelation and a value close to 4 indicates strongnegative autocorrelation. Basically look for values a long way from 2. It can beshown that

DW ≈ 2 − 2r

where r is the autocorrelation between the residuals. Therefore, an alternativeto the Durbin-Watson statistic is to simply calculate the correlation between theresiduals and the lagged residuals. Notice that if the correlation is zero the DWstatistic should be close to 2.

What to do when you Detect Autocorrelation

The easiest way to deal with autocorrelation is to incorporate the lagged residuals inthe regression as a new X variable. In other words fit the model with what ever Xvariables you want to use, save the residuals, lag them and refit the model includingthe lagged residuals. This will generally improve the accuracy of the predictions aswell as remove the autocorrelation.


Short Memory versus Long Memory

Autocorrelation is called a “short memory” phenomenon because it implies that theY ’s are affected by or remember what happened in the recent past. “Long memory”can also be important in making future predictions. Some examples of long memoryare:

• Trend. This is a long run change in the average value of Y . For exampleindustry growth.

• Seasonal. This is a predictable pattern with a fixed period. For example ifyou are selling outdoor furniture you would expect your summer sales to behigher than winter sales irrespective of any long term trend.

• Cyclical. These are long run patterns with no fixed period. An example isthe “boom and bust” business cycle where the entire economy goes through aboom period where everything is expanding and then a bust where everythingis contracting. Predicting the length of such periods is very difficult and unlessyou have a lot of data it can be hard to differentiate between cycles and longrun trends. Since we have so little time on time series data we won’t worryabout cyclical variation in this class.

To model a long term trend in the data we treat time as another predictor.Just as with any other predictor we need to check what sort of relationship there isbetween it and Y (i.e. none, linear, non-linear) and transform as appropriate. Tomodel seasonal variation we should incorporate a categorical variable indicating theseason. For example if we felt that sales may depend on the quarter i.e. Winter,Spring, Summer or Autumn we would add a categorical variable indicating whichof these 4 time periods each of the data points correspond to. On the other handwe may feel that each month has a different value in which case we would add acategorical variable with 12 levels, one for each month. When we fit the model JMPwill automatically create the correct dummy variables to code for each of the seasons.As with any other predictors we should look at the appropriate p-values and plots tocheck whether the variables are necessary and none of the regression assumptionshave been violated. By incorporating time, seasons and lagged residuals in theregression model we can deal with a large range of possible time series problems.

7.3 More on Probability Distributions

This section explains more about some standard probability distributions otherthan the normal distribution. You may encounter some of these distributions inyour operations class.

7.3. MORE ON PROBABILITY DISTRIBUTIONS 163

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

z

dens

ity

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

z

prob

abili

ty

Figure 7.1: PDF (left) and CDF (right) for the normal distribution.

7.3.1 Background

There are two basic types of random variables: discrete and continuous. Discreterandom variables are typically integer valued. That is: they can be 0, 1, 2, . . . .Continuous random variables are real valued, like 3.141759. All random variablescan be characterized by their cumulative distribution function (or CDF)

F (x) = Pr(X ≤ x).

CDF’s have three basic properties: they start at 0, they stop at 1, and they neverdecrease as you move from left to right. We are already familiar with one CDF:the normal table! If a random variable is continuous you can take the derivativeof its CDF to get its probability density function (or PDF). The normal PDF andCDF are plotted in Figure 7.1. If you want to understand a random variable it isis easier to think about its PDF because the PDF looks like a histogram. However,the CDF is a handy thing to have around, because you can calculate the probabilitythat your random variables X is in any interval (a, b] by F (b) − F (a), just like wedid with the normal table in Section 2.4.

If a random variable is discrete you can calculate its probability function P (X =x).


0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

x

dexp

(x)

−1 0 1 2 3 40.

00.

20.

40.

60.

81.

0

x

prob

abili

ty

Figure 7.2: The exponential distribution (a) PDF (b) CDF.

7.3.2 Exponential Waiting Times

The shorthand notation to say that X is an exponential random variable with rateλ is X ∼ E(λ). The exponential distribution has the density function

f(x) = λe−λx

and CDF

F (x) = 1 − e−λx

for x > 0. These are plotted in Figure 7.2.

One famous property of the exponential distribution is that it is memoryless.That is, if X is an exponential random variable with rate λ, and you’ve been waitingfor a day for the

7.3.3 Binomial and Poisson Counts

The binomial probability function is

P (X = x) =

(n

x

)

px(1 − p)n−x

7.4. PLANNING STUDIES 165

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

0 2 4 6 8 10 12

0.2

0.4

0.6

0.8

1.0

Figure 7.3: Poisson (a) probability function (b) CDF with λ = 3.

7.3.4 Review

7.4 Planning Studies

7.4.1 Different Types of Studies

In this Chapter we talk about three different types of studies, which are primarilydistinguished by their goal. The concept of randomization plays an important role instudy design. Different types of randomization are required to achieve the differentgoals.

Experiment The goal of an experiment is to determine whether one action (calleda treatment) causes another (the effect) to occur. For example, does receivinga coupon in the mail cause a customer to change the type of beer he buys inthe grocery store? In an experiment, sample units (the basic unit of study:people, cars, supermarket transactions) are randomly assigned to one or moretreatment levels (e.g. bottles/cans) and an outcome measure is recorded.Experiments can be designed to simultaneously test more than one type oftreatment (bottles/cans and 12-pack/6-pack).

Survey The goal of a survey is to describe a population by collecting a smallsample from the population. Surveys are more interested in painting an accu-rate picture of the population than determining causal relationships between


Distribution: Normal Exponential Poisson Binomial

Notation N (µ, σ) E(λ) Po(λ) B(n, p)

f(x) 1√2πσ

e−12

(x−µ)2

σ2 λe−λx λx

x! e−λ

(nx

)px(1 − p)n−x

CDF tables 1 − e−λx tables tablesE(X) µ 1/λ λ npV ar(X) σ2 1/λ2 λ np(1 − p)Discrete /Continuous

continuous continuous discrete discrete

Model for lots of stuff waiting times counts counts(no known max) (known max. n)

Table 7.1: Summary of random variables

variables. The key issue in a survey is how to decide which units from thepopulation are to be included in the sample. The best surveys randomly se-lect units from the overall population. Note that there are many differentstrategies that may be employed to randomly select units for inclusion in asurvey, and the data survey data should be analyzed with the randomizationstrategy in mind. A survey in which the entire population is included is calleda census. Conducting a census is expensive, time consuming, and it may evenbe impossible for practical or ethical reasons. There can also be some subtleproblems with a census, such as what to do if some people refuse to respond.A carefully conducted survey can sometimes produce more accurate resultsthan a census!

Observational Study Like an experiment, the goal of an observational study isto draw causal conclusions. However, the sample units in an observationalstudy are not under the control of the study designer. The conclusions drawnfrom an observational study are always subject to criticism, but observationalstudies are the least expensive type of study to perform. As a result they arethe most common.

7.4.2 Bias, Variance, and Randomization

Whenever a study is performed to estimate some quantity there are two possibleproblems, bias and variance. To say that a study is biased means that there is some-thing systematically wrong with the way the study is conducted. Mathematicallyspeaking, bias is the difference between the expected value of the sample statisticthe study is designed to produce and the population parameter that the statisticestimates. For example, in a survey obtained by simple random sampling, we know


that E(X) = µ so the bias is zero. If the bias is zero we call the estimator unbiased.If an unbiased study were performed many times, sometimes its estimates would betoo large and sometimes they would be too small, but on average they would becorrect.

In the context of study design, “variance” simply means that if you performedthe study again you would get a slightly different answer. As long as a study isunbiased, we have very effective tools (confidence intervals and hypothesis tests)at our disposal for measuring variance. If these tools indicate that we have notmeasured a phenomenon accurately enough, then we can reduce the variance bysimply gathering a larger sample. If an estimator is biased it will get the wronganswer even if we have no variance. In large studies (where the variance is smallbecause n is so large) bias is usually a much worse problem than variance. Ideallywe want to produce an estimate that has both low bias and low variance.

Unfortunately bias is often hard (and sometimes impossible) to detect. Thatis why randomization is so important in designing studies. We can’t control bias,but we can control variance by collecting more data. Randomization is a tool forturning bias (which we can’t control) into variance (which we can).

7.4.3 Surveys

Randomization in Surveys

The whole idea of a survey is to generalize the results from your sample to a largerpopulation. You can only do so if you are sure that your sample is representative ofthe population. Randomly selecting units from the population is vital to ensuringthe representativeness of your sample. This is counter-intuitive to some people. Youmay think that if you carefully decide which units should be included in the samplethen you can be sure it is representative of the population. However, bias can sneakinto a deterministic sampling scheme in some very subtle ways. A famous exam-ple is the Dewey/Truman U.S. presidential election, where the Gallup organizationused “quota sampling” in its polling. The survey was conducted through personalinterviews, where each interviewer was given a quota describing the characteris-tics of who they should interview: (12 white males over 40, 7 Asian females under30). However, there is a limit to the precision that can be placed on quotas. Theinterviewers followed the quotas, but still managed to show a bias towards inter-viewing republicans who planned to vote for Dewey. In fact Truman, the democraticcandidate, won the election.


Figure 7.4: The famous picture of Harry S. Truman after he defeated Thomas E. Deweyin the 1948 U.S. Presidential election.

Steps in a Survey

Define “population” Clearly defining the population you want to study is im-portant because it defines the “sampling unit,” the basic unit of analysis inthe study. If you are not clear about the population you want to study, then itmay not be obvious how you should sample from the population. For example,do you want to randomly sample transactions or accounts (which might con-tain several transactions)? If you sample the wrong thing you can end up with“size bias” where larger units (e.g. transactions occurring in busy accounts)have a higher probability of being selected than smaller units (transactionsoccurring in small accounts).

Construct sampling frame The sampling frame is a list of “almost all” units inthe population. For example, the U.S. Census is often used as a samplingframe. The sampling frame may have some information about the units inthe population, but it does not contain the information needed for your study.“Frame coverage bias” occurs when there are some units in the populationdo not appear on the sampling frame. Sometimes the sampling frame is anexplicit list. Sometimes it is implicit, such as “the people watching CNN rightnow.” Explicit sampling frames give your results greater credibility.

Select sample There are several possible strategies that can be employed to ran-domly sample units from the sampling frame. The simplest is a “simple ran-dom sample.” Simple random sampling is equivalent to drawing units out ofa hat. There are other sampling methods, such as stratified random sampling,


cluster sampling, two stage sampling, and many others. The thing that de-termines whether a survey is “scientific” is whether the sampling probabilitiesare known (i.e. the probability of each individual in the population appearingin the sample).

If the sampling probabilities are unknown, then the survey very likely suffersfrom “selection bias.” A common form of selection bias (called self-selection)occurs when people (on an implicit sampling frame) are encouraged to phonein or log on to the internet and voice their opinions. Typically, only thestrongest opinions are heard.

Another form of selection bias occurs with a “convenience sample” composedof the units that are easy for study designer to observe. We experience selec-tion bias from convenience samples every day. Have you ever wondered how“Boy Band X” can have the number one hit record in the nation, but youdon’t know anyone who owns a copy? Your circle of friends is a conveniencesample. Convenience samples occur very often when the first n records areextracted from a large database. Those records might be ordered in some im-portant way that you haven’t thought of. For example, employment recordsmight have the most senior employees listed first.

Collect data Particularly when you are surveying people, just because you haveselected a unit to appear in your survey does not mean they will agree to doso. “Non-response bias” occurs when people that don’t answer your questionsare systematically different from those people that do. The fraction of peoplewho respond to your survey questions is called the “response rate.” Theresponse rate is typically highest for surveys conducted in person. It is muchlower for phone surveys, and lower still for surveys conducted by mail. Manysurvey agencies actually offer financial incentives for people to participate inthe survey in order to increase the response rate.

Analyze data The data analysis must be done with the sampling scheme in mind.The techniques we have learned (and will continue to learn) are appropri-ate for a simple random sample. If applied to data from another samplingscheme (stratified sampling, cluster sampling, etc.) they can produce biasedresults. The methods we know about can be modified to handle other sam-pling schemes, typically by assigning each observation a weight which can becomputed using knowledge of the sampling strategy. We will not discuss theseweighting schemes.


Other Types of Random Sampling

Cluster sampling. Stratified sampling. Two stage sampling.

7.4.4 Experiments

Unlike surveys, the theory of experiments does not concern itself with generalizingthe results of the experiment to a larger population. The goal of an experiment isto infer a causal relationship between two variables. The first variable (X) is calleda treatment and is set by the experimenter. The second variable (Y ) is called theresponse and is measured by the experimenter. Experiments can be conducted withseveral different treatment and response variables simultaneously. The study of theright way to organize and analyze complicated experiments falls under a sub-fieldof statistics called “experimental design.”

Because the goal of an experiment is different than that of a survey, a differenttype of randomization is required. Surveys randomly select which units are tobe included in the study to ensure that the survey is representative of the largerpopulation. Experiments randomly assign units to treatment levels to make surethere are no systematic differences between the different groups the experiment isdesigned to compare.

For example, in testing a new drug experimental subjects are randomly assignedto one of two groups. The treatment group is given the drug, while the control groupis given a placebo. We then compare the two groups to see if the treatment seemsto be helping.

Issues in Experimental Design

Confounding This is where, for example, all women are given the drug and allmen are given the placebo. If we then find that the treatment group did betterthan the control group we have a problem because we can’t tell whether thisis because of the drug or because of the gender.

Lurking Variables You would obviously never conduct an experiment by assign-ing all the men to one treatment level and all the women to another. But whatif there were some other systematic difference between the treatment and con-trol groups that was not so obvious to you? Then any difference you observebetween the groups might be because of that other variable. For example thefact that people who smoke get cancer does not prove that smoking causescancer. It is possible that there is an unobserved variable (e.g. a defectivegene) that causes people to both smoke and develop cancer. The variable isunobserved (lurking) so it is difficult to tell for sure. The best way to overcome


this problem is to randomly assign people to groups. Random assignment en-sures that any systematic differences among individuals, even those that youmight not know to account for, are evenly spread between two groups. If unitsare randomly assigned to treatment levels, then the only systematic differencebetween the groups being compared is the treatment assignment. In thatcase you can be confident that a statistically significant difference betweenthe groups was caused by the treatment.

Placebo Effect People in treatment group may do better simply because theythink they should! This problem can be eliminated by not telling subjectswhich group they are in, or telling observers which group they are measuring.This is called a double blind trial.

7.4.5 Observational Studies

An observational study is an experiment where randomization is impossible or un-desirable. For example it is not ethical to randomly make some people smoke andmake others not smoke. Studies involving smoking and cancer rates are observa-tional studies because the subjects decide whether or not they will smoke. Obser-vational studies are always subject to criticism due to possible lurking variables.Another way to say this is that observational studies suffer from “omitted variablebias.”

Omitted variable bias is illustrated by Simpson’s paradox, which simply saysthat conditioning an analysis on a lurking variable can change the conclusion. Hereis an example. We are comparing the batting averages for two baseball players.(Batting averages are the fraction of the time that a player hits the ball, times1000. So someone who bats 250 gets a hit 25% of the time.) First we look at theoverall batting average for each player for the entire season.

Player Whole Season

A 251B 286

It appears that player B is the better batter. However, if we look at the averagesfor the two halves of the season we get a completely different conclusion.

Player First Half Second Half

A 300 250B 290 200

How is it possible that player A can have a higher average in both halves of theseason but a lower average overall? Upon closer examination of the numbers we seethat


First Half Second HalfPlayer Hits at-Bats Hits at-Bats

A 3 10 100 400B 58 200 2 10

The batting average for both players is lower during the second half of the season(maybe because of better pitching, or worse weather). Most of player A’s attemptscame during the second half, and vice-versa for player B. Therefore if we conditionon which half of the season it is we get a quite different conclusion. Here is anotherexample based on a study of admissions to Berkeley graduate programs.

Admission

Yes No Totals % Admitted

Gender Men 3738 4704 8442 44.3

Women 1494 2827 4321 34.6

Totals 5232 7531 12763

44.3% of men were admitted while only 34.6% of women were. On the surfacethere appears to be strong evidence of gender bias in the admission process. How-ever, look at the numbers if we do a comparison on a department by departmentbasis.

Men Women

Program No. Applicants % admitted No. Applicants % admitted

A 825 62 108 82B 560 63 25 68C 325 37 593 34D 417 33 375 35E 191 28 393 24F 373 6 341 7

Total 2691 45 1835 30

The percentage of women admitted is generally higher in each department. Onlydepartments C and E have slightly lower rates for women. Department A seems ifanything to be biased in favor of women. How is this possible? Notice that mentend to apply in greater numbers to the “easy” programs (A and B) while womengo for the harder ones (C, D, E and F). Hence the reason for the lower percentage


admitted is simply a result of women going for the harder programs. Unfortunately,statistics cannot determine whether this means men are smart or lazy.

Put in a transition here that explains the implications of observationalstudies on multiple regression and vice-versa.

Using linear regression has several advantages over the two sample t-test. Thefirst is that linear regression allows you to incorporate other variables (once welearn about multiple regression). Recall that possible lurking variables often makeit hard to determine if an apparent difference between, say, Males and Females,is caused by gender or some other lurking variables. If a difference between gen-ders is still apparent after including possible lurking variables then this suggests(though does not prove) that the difference is really caused by gender. Using linearregression we can incorporate any other possible lurking variables. The two samplet-test does not allow this. For example with the compensation example males werepayed significantly more than Females. However, when we incorporated the level ofresponsibility we got

Yi ≈ 112.8 + 6.06Position +

1.86 if ith person is Female

−1.86 if ith person is Male

= 6.06Position +

114.7 if ith person is Female

110.9 if ith person is Male

So the conclusion was reversed. It turns out that it is simply that there are morewomen at a lower level of responsibility that causes it to seem that women are beingdiscriminated against. This of course leaves open the question as to why womenare at a lower level of responsibility. The second reason for using linear regressionis that it facilitates a comparison of more than two groups whereas the two samplet-test can only handle two groups. However, the two sample t-test is commonlyused in practice so it is important that you understand how it works.

7.4.6 Summary

• Surveys: Random selection ensures survey is representative. Randomizedsurveys can generalize their results to the population.

• Experiments: Random treatment assignment prevents lurking variables frominterfering with causal conclusions. Randomized experiments allow you toconclude that differences in the outcome variable are caused by different treat-ments.

• Observational Studies: No randomization is possible. Control for as many


things as you can to silence your critics. If a relationship persists perhaps itsreal. Do an experiment (if possible/ethical) to verify.

Congratulations!!!

You’ve completed statistics (unless you skipped to the back to see how it all ends,in which case: get back to work)!

You may be wondering what happens next? No, we mean after the drinkingbinge. Statistics classes tend to evoke one of two reactions: either you’re prayingthat you never see this stuff again, or you’re intrigued and would like to learn more.

If you’re in the first camp I’ve got some bad news for you. Be prepared tosee regression analysis used in several of your remaining Core classes and secondyear electives. The good news is that all your hard work learning the material heremeans that these other classes won’t look nearly as frightening. For those of youwho found this course interesting, you should know that it was a REAL statisticscourse. It qualifies you to go compete with the Wharton’s and Chicago’s of theworld for quantitative jobs and summer internships.

If you want to see more, here are some courses you should consider.

Data Mining: Arif Ansari Data mining is the process of automating informa-tion discovery. The course is focused on developing a thorough understandingof how business data can be efficiently stored and analyzed to generate valu-able business information. Business applications are emphasized.

The amount of data collected is growing at a phenomenal rate. The users of thedata are expecting more sophisticated information from them. A marketingmanager is no longer satisfied with a simple listing of marketing contacts,but wants detailed information about customers past purchases as well aspredictions of future purchases. Simple structured/query language queriesare not adequate to support these increased demands for information. Datamining steps in to solve these needs.

In this course you will learn the various techniques used in data mining likeDecision trees, Neural networks, CART, Association rules etc., This coursegives you hands-on experience on how to apply data mining techniques to realworld business problems.

175


Data Mining is especially useful to a marketing organization, because it allowsyou to profile customers to a level not possible before. Distributors of massmailers today generally all use data mining tools. In a few years data miningwill a requirement of marketing organizations.

IOM 522 - Time Series Analysis for Forecasting Professor Delores Conway,winner of the 1998 University Associates award for excellence in teaching.

Forecasts of consumer demand, corporate revenues, earnings, capital expen-ditures, and other items are essential for marketing, finance, accounting andoperations. This course emphasizes the usefulness of regression, smoothingand Box-Jenkins forecasting procedures for analyzing time series data and de-veloping forecasts. Topics include the concept of stationarity, autoregressiveand moving average models, identification and estimation of models, predic-tion and assessment of model forecasts, seasonal models, and interventionanalysis. Students obtain practical experience using ForecastX (a state-of-the-art, Excel based package) to analyze data and develop actual forecasts.The analytical skills learned from the class are sophisticated and marketable,with wide application.

Appendix A

JMP Cheat Sheet

This guide will tell you the JMP commands for implementing the techniques discussedin class. This document is not intended to explain statistical concepts or give detaileddescriptions of JMP output. Therefore, don’t worry if you come across an unfamiliar term.If you don’t recognize a term like “variance inflation factor” then we probably just haven’tgotten that far in class.

A.1 Get familiar with JMP.

You should familiarize yourself with the basic features of JMP by reading the first threeChapters of the JMP manual. You don’t have to get every detail, especially on all theFormula Editor functions, but you should get the main idea about how stuff works. Someof the things you should be able to do are:

• Open a JMP data table.

• Understand the basic structure of data tables. Rows are observations, or sampleunits. Columns are variables, or pieces of information about each observation.

• Identify the modeling type (continuous, nominal, ordinal) of each variable. Be ableto change the modeling type if needed.

• Make a new variable in the data table using JMP’s formula editor. For example, takethe log of a variable.

• Use JMP’s tools to copy a table or graph and paste it into Word (or your favoriteword processor).

• Use JMP’s online help system to answer questions for you.

A.2 Generally Neat Tricks

A.2.1 Dynamic Graphics

JMP graphics are dynamic, which means you can select an item or set of items in one graphand they will be selected in all other graphs and in the data table.

A.2.2 Including and Excluding Points

Sometimes you will want to determine the impact of a small number of points (maybe justa single point) on your analysis. You can exclude a point from an analysis by selecting

177

178 APPENDIX A. JMP CHEAT SHEET

the point in any graph or in the data table and choosing Rows ⇒ Exclude/Unexclude.Note that excluding a point will remove it from all future numerical calculations, but thepoint may still appear in graphs. To eliminate the selected point from graphs, select Rows⇒ Hide/Unhide. You can re-admit excluded and/or hidden points by selecting them andchoosing Rows ⇒ Exclude/Unexclude a second time. An easy way to select all excludedpoints is by double clicking the “Excluded” line in the lower left portion of the data table.You can also choose Rows ⇒ Row Selection ⇒ Select Excluded from the menu.

A.2.3 Taking a Subset of the Data

The easiest way to take a subset of the data is by selecting the observations you want toinclude in the subset and choosing Tables ⇒ Subset from the menu. For example, supposeyou are working with a data set of CEO salaries and you only want to investigate CEO’sin the finance industry. Make a histogram of the industry variable (a categorical variablein the data set), and click on the “Finance” histogram bar. Then choose Tables ⇒ Subsetfrom the menu and a new data table will be created with just the finance CEO’s.

A.2.4 Marking Points for Further Investigation

Sometimes you notice an unusual point in one graph and you want to see if the same pointis unusual in other graphs as well. An easy way to do this is to select the point in the graph,and then right click on it. Choose “Markers” and select the plotting character you want forthe point. The point will appear with the same plotting character in all other graphs.

A.2.5 Changing Preferences

There are many ways you can customize JMP. You can select different toolbars, set differentdefaults for the analysis platforms (distribution of Y, fit Y by X, etc.), choose a defaultlocation to search for data files, and several other things. Choose File ⇒ Preferences fromthe menu to explore all the things you can change.

A.2.6 Shift Clicking and Control Clicking

Sometimes you may want to select several variables from a list or select several points ona graph. You can accomplish this by holding down the shift or control keys as you makeyour selections. Note that shift and control clicking in JMP works just like it does in otherWindows applications.

A.3 The Distribution of Y

All the following instructions assume you have launched the “Distribution of Y” analysisplatform.

A.3.1 Continuous Data

By default you will see a histogram, boxplot, a list of quantiles, and a list of moments(mean, standard deviation, etc.) including a 95% confidence interval for the mean.

1. One Sample T Test. Click the little red triangle on the gray bar over the variablewhose mean you want to do the T-test for. Select “Test Mean.” A dialog box popsup. Enter the mean you wish to use in the null hypothesis in the first field. Leaveother fields blank. Click OK.

A.4. FIT Y BY X 179

2. Normal Quantile Plot (or Q-Q Plot). Click the little red triangle on the gray bar overthe variable you want the Q-Q plot for. Select “Normal Quantile Plot.”

A.3.2 Categorical Data

Sometimes categorical data are presented in terms of counts, instead of a big data setcontaining categorical information for each observation. To enter this type of data intoJMP you will need to make two columns. The first is a categorical variable that lists thelevels appearing in the data set. For example, the variable Race may contain the levels“Black, White, Asian, Hispanic, etc.” The second is a numerical list revealing how manytimes each level appeared in the data set. Suppose you name this variable counts. Whenyou launch the Distribution of Y analysis platform, select Race as the Y variable and entercounts in the “Frequency” field.

A.4 Fit Y by X

All the following instructions assume you have launched the “FitY by X” analysis platform. The appropriate analysis will be de-termined for you by the modeling type (continuous, ordinal, ornominal) of the variables you select as Y and X.

A.4.1 The Two Sample T-Test (or One Way ANOVA).

In the Fit Y by X dialog box, select the variable whose means you wish to test as “Y.”Select the categorical variable identifying group membership as “X.” For example, to testwhether a significant salary difference exists between men and women, select “Salary” asthe Y variable and “Sex” as the X variable. A dotplot of the data will appear.

1. How to do a T-Test.Click on the little red triangle on the gray bar over the dotplot. Select “Means/ANOVA/T-test.”

2. Manipulating the Data.Sometimes the data in the data table must be manipulated into the form that JMPexpects in order to do the two sample T test. The “Tables” menu contains two optionsthat are sometimes useful. “Stack” will take two or more columns of data and stackthem into a single column. “Split” will take a single column of data and split it intoseveral columns.

3. Display Options.The “Display Options” sub-menu under the little red triangle controls the features ofthe dotplot. Use this menu to add side-by-side boxplots, means diamonds, to connectthe means of each subgroup, or to limit over-plotting by adding random jitter to eachpoint’s X value.

A.4.2 Contingency Tables/Mosaic Plots

In the Fit Y by X dialog box select one categorical variable as Y and another as X. As faras the contingency table is concerned, it doesn’t matter which is which. The mosaic plotwill put the X variable along the horizontal axis and the Y variable on the vertical axis (asyou would expect).


1. Entering Tables Directly Into JMP.Sometimes categorical data are presented in terms of counts, instead of a big data setcontaining categorical information for each observation. To enter this type of datainto JMP you will need to make three columns. The first is a categorical variablethat lists the levels appearing in the X variable. For example, the variable Race maycontain the levels “Black, White, Asian, Hispanic, etc.” The second is a list of thelevels in the Y variable. For example, the variable Sex may contain the levels “Male,Female.” Each level of the X variable must be paired with each level of the Y variable.For example, the first row in the data table might be “Black, Female.” The secondrow might be “Black, Male.” The final column is a numerical list revealing how manytimes each combination of levels appeared in the data set. Suppose you name thisvariable counts. When you launch the Fit Y by X analysis platform, select Race asthe X variable, Sex as the Y variable, and enter counts in the “Frequency” field.

2. Display Options for a Contingency Table.By default, contingency tables show counts, total %, row %, and column %. Youcan add or remove listings from cells of the contingency table by using the little redtriangle in the gray bar above the table.

A.4.3 Simple Regression

In the Fit Y by X dialog box select the continuous variable you want to explain as Y andthe continuous variable you want to use to do the explaining as X. A scatterplot will appear.The commands listed below all begin by choosing an option from the little red triangle onthe gray bar above the scatterplot.

1. Fitting Regression LinesSelect “Fit Line” from the little red triangle.

2. Fitting Non-Linear Regressions

(a) TransformationsChoose “Fit Special” from the little red triangle. A dialog box appears. Selectthe transformation you want to use for Y, and for X. Click Okay. If the trans-formation you want to use does not appear on the menu you will have to doit “by hand” using JMP’s formula editor. Simply create a new column in thedata set and fill the new column with the transformed data. Then use this newcolumn as the appropriate X or Y in a linear regression.

(b) PolynomialsChoose “Fit Polynomial” from the little red triangle. You should only use degreegreater than two if you have a strong theoretical reason to do so.

(c) SplinesChoose “Fit Spline” from the little red triangle. You will have to specify howwiggly a spline you want to see. Trial and error is the best way to do this.

3. The Regression Manipulation MenuYou may want to ask JMP for more details about your regression after it has beenfit. Each regression you fit causes an additional little red triangle to appear belowthe scatterplot. Use this little red triangle to:

• Save residuals and predicted values

• Plot confidence and prediction curves

• Plot residuals

A.5. MULTIVARIATE 181

A.4.4 Logistic Regression

In the Fit Y by X dialog box choose a nominal variable as Y and a continuous variable asX. There are no special options for you to manipulate. You have more options when youfit a logistic regression using the Fit Model platform.

A.5 Multivariate

Launch the multivariate platform and select the continuous variables you want to examine.

1. Correlation MatrixYou should see this by default.

2. Covariance MatrixYou have to ask for this using the little red triangle.

3. Scatterplot MatrixYou may or may not see this by default. You can add or remove the scatterplotmatrix using the little red triangle. You read each plot in the scatterplot matrix asyou would any other scatterplot. To determine the axes of the scatterplot matrix youmust examine the diagonal of the matrix. The column the plot is in determines theX axis, while the plot’s row in the matrix determines the Y axis. The ellipses in eachplot would contain about 95% of the data if both X and Y were normally distributed.Skinny, tilted ellipses are a graphical depiction of a strong correlation. Ellipses thatare almost circles are a graphical depiction of weak correlation.

A.6 Fit Model (i.e. Multiple Regression)

The Fit Model platform is what you use to run sophisticated regression models with severalX variables.

A.6.1 Running a Regression

Choose the Y variable and the X variables you want to consider in the Fit Model dialogbox. Then click “Run Model.”

A.6.2 Once the Regression is Run

1. Variance Inflation FactorsRight click on the Parameter Estimates box and choose Columns ⇒ VIF in the menuthat appears.

2. Confidence Intervals for Individual CoefficientsRight click on the Parameter Estimates box and choose Columns ⇒ Lower 95% inthe menu that appears. Repeat to get the Upper 95%, completing the interval.

3. Save ColumnsThis menu lives under the big gray bar governing the whole regression. Options underthis menu will save a new column to your data table. Use this menu to obtain:

• Cook’s Distance

• Leverage (or “Hats”)

• Saving Residuals

• Saving Predicted (or “Fitted”) Values


• Confidence and Prediction Intervals

This menu is especially useful for making new predictions from your regression. Beforeyou fit your model, add a row of data including the X variables you want to use inyour prediction. Leave the Y variable blank. When you save predicted values andintervals JMP should save them to the row of “fake data” as well.

4. Row DiagnosticsThis menu lives under the big gray bar governing the whole regression. Options underthis menu will add new tables or graphs to your regression output. Use this menu toobtain:

• Residual Plot (Residuals by Predicted Values)

• The Durbin Watson Statistic

• Plotting Residuals by Row

5. Expanded EstimatesThe expanded estimates box reveals the same information as the parameter estimatesbox, but categorical variables are expanded to include the default categories. Toobtain the expanded estimates box select Estimates ⇒ Expanded Estimates from thelittle red triangle on the big gray bar governing the whole regression.

A.6.3 Including Interactions and Quadratic Terms

To include a variable as a quadratic, go to the Fit Model dialog box. Select the variable youwant to include as a quadratic. Then click on the “Macros” menu and select “Polynomial toDegree.” The “Degree” field under the Macros button controls the degree of the polynomial.The degree field shows “2” by default, so the Polynomial to Degree macro will create aquadratic.

There are two basic ways to include an interaction term. The first is to select thetwo variables you want to include as an interaction (perhaps by holding down the Shift orControl key) and hitting the ”Cross” button. The second is to select two or more variablesthat you want to use in an interaction (using the Shift or Control key) and select Macros⇒ Factorial to Degree.

A.6.4 Contrasts

To test a contrast in a multiple regression, go to the leverage plot for the categorical variablewhose levels you want to test. Click “LS Means Contrast.” In the table that pops up clickthe +/− signs next to the levels you want to test until you get the weights you want. Thenclick “Done” to compute the results of the test.

A.6.5 To Run a Stepwise Regression

In the Fit Model Dialog box, change the “Personality” to “Stepwise.” Enter all the Xvariables you wish to consider (including any interactions-see the instructions on interactionsgiven above). Then click “Run Model.”

The Stepwise Regression dialog box appears. Change “Direction” to ”Mixed” and makethe probability to enter and the probability to leave small numbers. The probability to entershould be less than or equal to the probability to leave. Then click “Go.” When JMP settleson a model, click “Make Model” to get to the familiar regression dialog box. You may wantto change “Personality” to “Effect Leverage” (if necessary) so that you will get the leverageplots.

A.6. FIT MODEL (I.E. MULTIPLE REGRESSION) 183

One of the odd things about JMP’s stepwise regression procedure is that it creates anunusual coding scheme for dummy variables. Suppose you have a categorical variable calledcolor, with levels Red, Blue, Green, and Yellow. JMP’s stepwise procedure may create avariable named something like color[Red&Yellow-Blue&Green]. This variable assumes thevalue 1 if the color is red or yellow. It assumes the value -1 if the color is blue or green.This type of dummy variable compares colors that are either red or yellow to colors thatare either blue or green.

A.6.6 Logistic Regression

Logistic regression with several X variables works just like regression with several X vari-ables. Just choose a binary (categorical) variable as Y in the Fit Model dialog box. Checkthe little red triangle on the big gray bar to see the options you have for logistic regression.You can save the following for each item in your data set: the probability an observationwith the observed X would fall in each level of Y, the value of the linear predictor, and themost likely level of Y for an observation with those X values.

Appendix B

Some Useful Excel Commands

Excel contains functions for calculating probabilities from the normal, T , χ2 and F distri-butions. Each distribution also has an inverse function. You use the regular function whenyou have a potential value for the random variable and you want to compute a probability.You use the inverse function when you have a probability and you want to know the valueto which it corresponds.

Normal Distribution

Normdist(x, mean, sd, cumulative) If cumulative is TRUE this function returns theprobability that a normal random variable with the given mean and standard devia-tion is less than x.

If cumulative is FALSE, this function gives the height of the normal curve evaluatedat x.

• Example: The salaries of workers in a factory is normally distributed with mean$40,000 and standard deviation $7,500. What is the probability that a randomlychosen worker from the factory makes less than $53,000?Normdist(53000, 40000, 7500, TRUE).

Norminv(p,mean,sd) returns the p’th quantile of the specified normal distribution. Thatis, it returns a value x such that a normal random variable has probability p of beingless than x.

• Example: In the factory described above find the 25th and 75th salary per-centiles.25th: Norminv(.25, 40000, 7500).75th: Norminv(.75, 40000, 7500).

185

186 APPENDIX B. SOME USEFUL EXCEL COMMANDS

T-Distribution

Tdist(t, df, tails) The tails argument is either 1 or 2. If tails=1 then this functionreturns the probability that at T random variable with df degrees of freedom is greater

than t. For reasons known only to Bill Gates, the t argument cannot be negative.Note that you have to standardize the t statistic yourself before using this function,as there are no mean and sd arguments like there are in Normdist.

• Example: For a test of H0 : µ = 3 vs. Ha : µ 6= 3 we get a t statistic of 1.76.There are 73 observations in the data set. What is the p-value?Tdist(1.76,72 2) ( df = 73− 1 and tails =2 because it is a two tailed test.)

• Example: For a test of H0 : µ = 3 vs. Ha : µ 6= 3 we get a t statistic of -1.76.There are 73 observations in the data set. What is the p-value?Tdist(1.76,72 2) ( The T distribution is symmetric so ignoring the negativesign makes no difference.)

• Example: For a test of H0 : µ = 3 vs. Ha : µ > 3 we get a t statistic of 1.76.There are 73 observations in the data set. What is the p-value?Tdist(1.76,72 1) ( The p-value here is the probability above 1.76.)

• Example: For a test of H0 : µ = 3 vs. Ha : µ > 3 we get a t statistic of -1.76.There are 73 observations in the data set. What is the p-value?=1-Tdist(1.76,72 1) ( The p-value here is the probability below 1.76.)

Tinv(p, df) Returns the value of t that you would need to see in a two tailed test to geta p-value of p.

• Example: We have 73 observations. How large a t statistic would we have tosee to reject a two tailed test at the .13 level?Tinv(.13, 72)

• Example: With 73 observations what t would give us a p-value of .05 for thetest H0 : µ = 17 vs. Ha : µ > 17.Because of the alternative hypothesis, the p-value is the area to the right of t.Thus the answer here is the value of t that would give a p-value of .10 on a twotailed test Tinv(.05, 72).

χ2 (chi-square) distribution

Chidist(x,df) Returns the probability to the right of x on the chi-square distribution withthe specified degrees of freedom.

• Example: A χ2 test statistic turns out to be 12.7 on 9 degrees of freedom. Whatis the p-value?Chidist(12.7,9)

ChiInv(p, df) p is the probability in the right tail of the χ2 distribution. This functionreturns the value of the corresponding χ2 statistic.

187

• Example: In a χ2 test on 9 degrees of freedom, how large must the test statisticbe in order to get a p-value of .02?ChiInv(.02, 9)

F Distribution

Fdist(F, NumDF, DenomDF) Returns the p value from the F distribution with NumDF

in the numerator and DenomDF in the denominator.

• Example: If F = 12.7, the numerator df = 3 and the denominator df = 102,what is the p-value?Fdist(12.7, 3, 102

Finv(p, NumDF, DenomDF) Returns the value of the F statistic needed to achieve ap-value of p.

• If there were 3 numerator and 102 denominator degrees of freedom, how largean F statistic would be needed to get a p-value of .05?Finv(.05, 3, 102)

188 APPENDIX B. SOME USEFUL EXCEL COMMANDS

Appendix C

The Greek Alphabet

lower uppercase case letter

α A alphaβ B betaγ Γ gammaδ ∆ deltaε E epsilonζ Z zetaη H etaθ Θ thetaι I iotaκ K kappaλ Λ lambdaµ M muν N nuξ Ξ xio O omicronπ Π piρ R rhoσ Σ sigmaτ T tauυ Υ upsilonφ Φ phiχ X chiψ Ψ psiω Ω omega

189

190 APPENDIX C. THE GREEK ALPHABET

Appendix D

Tables

191

192 APPENDIX D. TABLES

D.1 Normal Table

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.00 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359

0.10 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753

0.20 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141

0.30 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517

0.40 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

0.50 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224

0.60 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549

0.70 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852

0.80 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133

0.90 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389

1.00 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

1.10 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830

1.20 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

1.30 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177

1.40 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319

1.50 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441

1.60 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545

1.70 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633

1.80 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706

1.90 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767

2.00 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817

2.10 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857

2.20 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890

2.30 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916

2.40 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936

2.50 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952

2.60 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964

2.70 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974

2.80 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981

2.90 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986

3.00 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990

3.10 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993

3.20 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995

3.30 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997

3.40 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998

3.50 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998

The body of the table contains the probability that a N (0, 1) randomvariable is less than z. The left margin of the table contains the firsttwo digits of z. The top row of the table contains the third digit of z.

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Z

D.2. QUICK AND DIRTY NORMAL TABLE 193

D.2 Quick and Dirty Normal Table

z Pr(Z<z) z Pr(Z<z) z Pr(Z<z) z Pr(Z<z)

-4.00 0.0000 | -2.00 0.0228 | 0.00 0.5000 | 2.00 0.9772

-3.95 0.0000 | -1.95 0.0256 | 0.05 0.5199 | 2.05 0.9798

-3.90 0.0000 | -1.90 0.0287 | 0.10 0.5398 | 2.10 0.9821

-3.85 0.0001 | -1.85 0.0322 | 0.15 0.5596 | 2.15 0.9842

-3.80 0.0001 | -1.80 0.0359 | 0.20 0.5793 | 2.20 0.9861

-3.75 0.0001 | -1.75 0.0401 | 0.25 0.5987 | 2.25 0.9878

-3.70 0.0001 | -1.70 0.0446 | 0.30 0.6179 | 2.30 0.9893

-3.65 0.0001 | -1.65 0.0495 | 0.35 0.6368 | 2.35 0.9906

-3.60 0.0002 | -1.60 0.0548 | 0.40 0.6554 | 2.40 0.9918

-3.55 0.0002 | -1.55 0.0606 | 0.45 0.6736 | 2.45 0.9929

-3.50 0.0002 | -1.50 0.0668 | 0.50 0.6915 | 2.50 0.9938

-3.45 0.0003 | -1.45 0.0735 | 0.55 0.7088 | 2.55 0.9946

-3.40 0.0003 | -1.40 0.0808 | 0.60 0.7257 | 2.60 0.9953

-3.35 0.0004 | -1.35 0.0885 | 0.65 0.7422 | 2.65 0.9960

-3.30 0.0005 | -1.30 0.0968 | 0.70 0.7580 | 2.70 0.9965

-3.25 0.0006 | -1.25 0.1056 | 0.75 0.7734 | 2.75 0.9970

-3.20 0.0007 | -1.20 0.1151 | 0.80 0.7881 | 2.80 0.9974

-3.15 0.0008 | -1.15 0.1251 | 0.85 0.8023 | 2.85 0.9978

-3.10 0.0010 | -1.10 0.1357 | 0.90 0.8159 | 2.90 0.9981

-3.05 0.0011 | -1.05 0.1469 | 0.95 0.8289 | 2.95 0.9984

-3.00 0.0013 | -1.00 0.1587 | 1.00 0.8413 | 3.00 0.9987

-2.95 0.0016 | -0.95 0.1711 | 1.05 0.8531 | 3.05 0.9989

-2.90 0.0019 | -0.90 0.1841 | 1.10 0.8643 | 3.10 0.9990

-2.85 0.0022 | -0.85 0.1977 | 1.15 0.8749 | 3.15 0.9992

-2.80 0.0026 | -0.80 0.2119 | 1.20 0.8849 | 3.20 0.9993

-2.75 0.0030 | -0.75 0.2266 | 1.25 0.8944 | 3.25 0.9994

-2.70 0.0035 | -0.70 0.2420 | 1.30 0.9032 | 3.30 0.9995

-2.65 0.0040 | -0.65 0.2578 | 1.35 0.9115 | 3.35 0.9996

-2.60 0.0047 | -0.60 0.2743 | 1.40 0.9192 | 3.40 0.9997

-2.55 0.0054 | -0.55 0.2912 | 1.45 0.9265 | 3.45 0.9997

-2.50 0.0062 | -0.50 0.3085 | 1.50 0.9332 | 3.50 0.9998

-2.45 0.0071 | -0.45 0.3264 | 1.55 0.9394 | 3.55 0.9998

-2.40 0.0082 | -0.40 0.3446 | 1.60 0.9452 | 3.60 0.9998

-2.35 0.0094 | -0.35 0.3632 | 1.65 0.9505 | 3.65 0.9999

-2.30 0.0107 | -0.30 0.3821 | 1.70 0.9554 | 3.70 0.9999

-2.25 0.0122 | -0.25 0.4013 | 1.75 0.9599 | 3.75 0.9999

-2.20 0.0139 | -0.20 0.4207 | 1.80 0.9641 | 3.80 0.9999

-2.15 0.0158 | -0.15 0.4404 | 1.85 0.9678 | 3.85 0.9999

-2.10 0.0179 | -0.10 0.4602 | 1.90 0.9713 | 3.90 1.0000

-2.05 0.0202 | -0.05 0.4801 | 1.95 0.9744 | 3.95 1.0000

The table gives the probability that a N (0, 1) random variable is less than z. It is less precise thanTable D.1 because z increments by .05 rather than .01, but it may be easier to use.


D.3 Cook’s Distance

Number of Sample Size

Params 10 100 1000

Odd Surprising Shocking Odd Surprising Shocking Odd Surprising Shocking

1 0.0166 0.0677 0.4897 0.0159 0.0645 0.4583 0.0158 0.0642 0.4553

2 0.1065 0.2282 0.7435 0.1055 0.2236 0.698 0.1054 0.2232 0.6936

3 0.1912 0.3357 0.8451 0.1944 0.3351 0.7941 0.1948 0.3351 0.7892

4 0.2551 0.4066 0.8988 0.2647 0.4115 0.8449 0.2658 0.4121 0.8397

5 0.3033 0.4563 0.9319 0.3199 0.467 0.8762 0.3218 0.4684 0.8709

6 0.3405 0.4931 0.9544 0.3641 0.5094 0.8974 0.367 0.5114 0.892

7 0.37 0.5215 0.9705 0.4005 0.5429 0.9127 0.4043 0.5457 0.9072

8 0.394 0.544 0.9828 0.4309 0.5703 0.9242 0.4356 0.5738 0.9186

9 0.4139 0.5623 0.9923 0.4568 0.5931 0.9332 0.4625 0.5973 0.9276

10 0.4306 0.5775 1 0.4792 0.6125 0.9405 0.4858 0.6173 0.9348

11 0.4448 0.5903 1.0063 0.4987 0.6292 0.9464 0.5062 0.6347 0.9407

12 0.4571 0.6013 1.0116 0.516 0.6438 0.9514 0.5244 0.6499 0.9457

13 0.4678 0.6108 1.016 0.5314 0.6567 0.9556 0.5406 0.6634 0.9498

14 0.4772 0.6191 1.0199 0.5453 0.6681 0.9592 0.5552 0.6754 0.9534

15 0.4856 0.6264 1.0232 0.5578 0.6784 0.9624 0.5685 0.6862 0.9566

16 0.4931 0.6329 1.0261 0.5691 0.6876 0.9651 0.5807 0.696 0.9593

17 0.4998 0.6387 1.0287 0.5795 0.6961 0.9675 0.5918 0.705 0.9617

18 0.5058 0.644 1.031 0.5891 0.7037 0.9697 0.6021 0.7132 0.9639

19 0.5113 0.6487 1.0331 0.5979 0.7108 0.9716 0.6116 0.7207 0.9658

20 0.5163 0.653 1.0349 0.606 0.7173 0.9734 0.6204 0.7277 0.9675

21 0.5209 0.6569 1.0366 0.6136 0.7233 0.9749 0.6287 0.7342 0.9691

22 0.5251 0.6606 1.0381 0.6206 0.7289 0.9764 0.6364 0.7402 0.9705

23 0.529 0.6639 1.0395 0.6272 0.7341 0.9777 0.6436 0.7458 0.9718

24 0.5326 0.6669 1.0408 0.6334 0.7389 0.9789 0.6504 0.7511 0.973

25 0.536 0.6698 1.042 0.6392 0.7434 0.98 0.6568 0.7561 0.9741

26 0.5391 0.6724 1.0431 0.6446 0.7477 0.981 0.6629 0.7607 0.9751

27 0.542 0.6748 1.0441 0.6498 0.7517 0.982 0.6686 0.7651 0.9761

28 0.5447 0.6771 1.045 0.6546 0.7555 0.9828 0.674 0.7693 0.9769

29 0.5472 0.6793 1.0459 0.6592 0.759 0.9837 0.6792 0.7733 0.9778

30 0.5496 0.6813 1.0467 0.6636 0.7624 0.9844 0.6841 0.777 0.9785

The numbers in the table are approximate cutoff values for Cook’s distances with the specifiednumber of model parameters (intercept + number of slopes), and nearest sample size.

An “odd” point is one about which you are mildly curious.A “shocking” point clearly influences the fitted regression.

D.4. CHI-SQUARE TABLE 195

D.4 Chi-Square Table

Probability

0.9 0.95 0.99 0.999 0.9999

DF

1 2.71 3.84 6.63 10.83 15.13

2 4.61 5.99 9.21 13.82 18.42

3 6.25 7.81 11.34 16.27 21.10

4 7.78 9.49 13.28 18.47 23.51

5 9.24 11.07 15.09 20.51 25.75

6 10.64 12.59 16.81 22.46 27.85

7 12.02 14.07 18.48 24.32 29.88

8 13.36 15.51 20.09 26.12 31.83

9 14.68 16.92 21.67 27.88 33.72

10 15.99 18.31 23.21 29.59 35.56

11 17.28 19.68 24.73 31.26 37.36

12 18.55 21.03 26.22 32.91 39.13

13 19.81 22.36 27.69 34.53 40.87

14 21.06 23.68 29.14 36.12 42.58

15 22.31 25.00 30.58 37.70 44.26

16 23.54 26.30 32.00 39.25 45.93

17 24.77 27.59 33.41 40.79 47.56

18 25.99 28.87 34.81 42.31 49.19

19 27.20 30.14 36.19 43.82 50.79

20 28.41 31.41 37.57 45.31 52.38

21 29.62 32.67 38.93 46.80 53.96

22 30.81 33.92 40.29 48.27 55.52

23 32.01 35.17 41.64 49.73 57.07

24 33.20 36.42 42.98 51.18 58.61

25 34.38 37.65 44.31 52.62 60.14

26 35.56 38.89 45.64 54.05 61.67

27 36.74 40.11 46.96 55.48 63.17

28 37.92 41.34 48.28 56.89 64.66

29 39.09 42.56 49.59 58.30 66.15

30 40.26 43.77 50.89 59.70 67.62

40 51.81 55.76 63.69 73.40 82.06

50 63.17 67.50 76.15 86.66 95.97

60 74.40 79.08 88.38 99.61 109.50

70 85.53 90.53 100.43 112.32 122.74

80 96.58 101.88 112.33 124.84 135.77

90 107.57 113.15 124.12 137.21 148.62

100 118.50 124.34 135.81 149.45 161.33

The table shows the value that a chi-square ran-dom variable must attain so that the specifiedamount of probability lies to its left.

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

0.25

Chi Square (Number in Table Body)

Probability

Bibliography

Albright, S. C., Winston, W. L., and Zappe, C. (2004). Data Analysis for Managers

with Microsoft Excel, 2nd Edition. Brooks/Cole–Thomson Lerning.

Foster, D. P., Stine, R. A., and Waterman, R. P. (1998). Basic Business Statistics.Springer.

197

Index

added variable plot, 127alternative hypothesis, 72ANOVA table, 121autocorrelation, 56, 106, 107autocorrelation function, 56autoregression, 107

bimodal, 9Bonferroni adjustment, 152Box-Cox transformations, 99boxplot, 7

categorical, 2central limit theorem, 64, 66, 80chi square test, 82collinearity, 123, 133–135conditional probability, 22conditional proportions, 11confidence interval, 126, 158

for a mean, 65, 67for a proportion, 80for the regression line, 94, 126for the regression slope, 92

contingency table, 2continuous, 2contrast, 144Cook’s distance, 132correlation, 52covariance, 50covariance matrix, 51

decision theory, 45discreteness, 9

dummy variable, 138dummy variables, 80

empirical rule, 8expected value, 33extrapolation, 96

factor, 137fat tails, 9fields, 1first order effects, 146frequency table, 2

heteroscedasticity, 104hierarchical principle, 150high leverage point, 110histogram, 3, 7, 13

independence, 29indicator variables, 80influential point, 110interaction, 145interpolation, 96

joint distribution, 20joint proportion, 11

lag variable, 107levels, 2, 137leverage plot, 127linear regression model, 20

main effects, 146margin of error, 70

198

INDEX 199

marginal distribution, 12, 21

market segments, 45

Markov chain, 31

Markov dependence, 29

model sum of squares, 120

moments, 4

mosaic plots, 3

multiple comparisons, 152

nominal variable, 2

normal distribution, 8, 20, 37

normal quantile plot, 9, 41

null hypothesis, 71

observation, 1

ordinal variable, 2

outlier, 110

outliers, 6

p-value, 73

parameters, 62

point estimate, 69

population, 61

prediction interval, 94, 126

Q-Q plot, see normal quantile plot

quantile-quantile plot, see normal quan-tile plot

quantiles, 4

quartiles, 6

random variable, 18, 35, 38, 39, 51, 63,64, 66, 88

randomization, 165, 167, 171

in experiments, 170, 173

in surveys, 166, 167, 173

records, 1

regression assumptions, 97

relative frequencies, 2

residual plot, 97, 98, 104, 106, 130

residuals, 93, 96, 98, 104, 109, 122,127, 130, 160–162

reward matrix, 46risk profile, 47

sample, 61sampling distribution, 63scatterplot, 13simple random sample, 62skewed, 9slope, 87, 88, 91SSE, 120SSM, 120SST, 120standard error, 91standard deviation, 5, 35, 37–39, 64,

66, 67, 93standard error, 64–66, 96, 158statistics, 62stepwise regression, 152

t distribution, 67t-statistic, 73t-test

for a regression coefficient, 91, 123one sample, 76paired, 79two-sample, 140

test statistic, 72trend

in a time series, 106Tukey’s bulging rule, 99

variable, 1variance, 34–36, 51, 52, 54, 63, 64, 97,

122, 126, 127, 166, 167non-constant, 104of a random variable, 34residual, 120

variance inflation factor, 134VIF, see variance inflation factor

200 INDEX

z-score, 38

Stats

Documents

Transcript of Stats