L08 Over Fitting

27
Fall 2009 Copyright Robert A Stine Revised 10/8/09 Statistics 622 Module 8 Avoiding Over-Confidence OVERVIEW .................................................................................................................................................................................. 2 FROM PREVIOUS CLASSES..................................................................................................................................................... 3 OVERFITTING ........................................................................................................................................................................... 4 AN EXAMPLE OF OVERFITTING (NYSE_2003.JMP) ..................................................................................................... 5 VISUALIZATION.......................................................................................................................................................................... 9 COMMON SENSE TEST ............................................................................................................................................................10 EXAMPLE OF PREDICTING STOCKS ......................................................................................................................................11 WHAT ARE THOSE OTHER PREDICTORS? ............................................................................................................................12 PROTECTION FROM OVERFITTING .....................................................................................................................................13 BONFERRONI =RIGHT ANSWER +ADDED BONUS ...........................................................................................................15 OTHER APPLICATIONS OF THE BONFERRONI RULE .........................................................................................................16 DETECTING OVERFITTING WITH A VALIDATION SAMPLE .............................................................................................17 CONTROLLING STEPWISE WITH A VALIDATION SAMPLE (BLOCK.JMP) .......................................................................19 BACK TO BUSINESS .................................................................................................................................................................23 APPENDIX BONFERRONI METHOD .......................................................................................................................................25 THE BONFERRONI INEQUALITY............................................................................................................................................25 USE IN MODEL SELECTION ....................................................................................................................................................25 BONFERRONI RULE FOR PVALUES ......................................................................................................................................26 ITS REALLY PRETTY GOOD ....................................................................................................................................................26

description

 

Transcript of L08 Over Fitting

Page 1: L08 Over Fitting

Fall 2009

Copyright Robert A Stine Revised 10/8/09

Statistics 622 Module 8 Avoiding Over-Confidence

OVERVIEW.................................................................................................................................................................................. 2 FROM PREVIOUS CLASSES…..................................................................................................................................................... 3 OVER‐FITTING ........................................................................................................................................................................... 4 AN EXAMPLE OF OVER‐FITTING   (NYSE_2003.JMP) ..................................................................................................... 5 VISUALIZATION.......................................................................................................................................................................... 9 COMMON SENSE TEST............................................................................................................................................................10 EXAMPLE OF PREDICTING STOCKS ......................................................................................................................................11 WHAT ARE THOSE OTHER PREDICTORS? ............................................................................................................................12 PROTECTION FROM OVER‐FITTING .....................................................................................................................................13 BONFERRONI = RIGHT ANSWER + ADDED BONUS...........................................................................................................15 OTHER APPLICATIONS OF THE BONFERRONI RULE .........................................................................................................16 DETECTING OVER‐FITTING WITH A VALIDATION SAMPLE .............................................................................................17 CONTROLLING STEPWISE WITH A VALIDATION SAMPLE  (BLOCK.JMP) .......................................................................19 BACK TO BUSINESS .................................................................................................................................................................23 APPENDIX  BONFERRONI METHOD .......................................................................................................................................25 THE BONFERRONI INEQUALITY............................................................................................................................................25 USE IN MODEL SELECTION....................................................................................................................................................25 BONFERRONI RULE FOR P‐VALUES ......................................................................................................................................26 IT’S REALLY PRETTY GOOD ....................................................................................................................................................26 

Page 2: L08 Over Fitting

Statistics 622 8-2 Fall 2009

Overview Stepwise models

Select most predictive features from a list that you provide of candidate features, incrementally improving the fit of the model by as much as possible at each step. When automated, the search continues so long as the feature improves the model enough as gauged by its p-value.

Over-fitting1 If the search is allowed to choose predictors too “easily”, stepwise selection will identify predictors that ought not be in the model, producing an artificially good fit when in fact the model has been getting worse and worse.

Bonferroni rule The Bonferroni rule lets us halt the search without having to set aside a validation sample, allowing us use all the data for finding a predictive model rather than a subset. Though automatic, you should still use your knowledge of the context to offer more informed choices of features to consider for the modeling.

1 For another example of over-fitting when modeling stock returns, see BAUR pages 220-227.

Page 3: L08 Over Fitting

Statistics 622 8-3 Fall 2009

From previous classes… Cost of uncertainty

An accurate estimate of mean demand improves profits. Suggests that we should use more predictors in models, including more combinations of features that capture synergies among the features (interactions).

Stepwise regression Automates the tedious process of working through the various interactions and other candidate features.

Problem: Over-confidence The combination of

Desire for more accurate predictions +

Automated searches that maximize fitted R2 Creates the possibility that our predictions are not so accurate as we think. Over-fitting results when the modeling process leads us to build a model that captures random patterns in the data that will not be present in predicting new cases. The fit of the model looks better on paper than in reality

Other situations with over-confidence Subjective confidence intervals Winners curse in auctions

Two methods for recognizing and avoiding over-fitting Bonferroni p-values, which do not require the use of a validation sample in order to test the model Cross-validation, which requires setting aside data to test the fit of a model.

Page 4: L08 Over Fitting

Statistics 622 8-4 Fall 2009

Over-fitting False optimism

Is your model as good as it claims? Or, has your hard work to improve its fit to the data exaggerated its accuracy? When we use the same data to both fit and evaluate a model, we get an “optimistic” impression of how well the model predicts. This process that leads to an exaggerated sense of accuracy is known as over-fitting. When a model has been over-fit, predictors that appear significant from the output do not in fact improve the model’s ability to predict the new cases. Perhaps many of the predictors that are in a model have arrived by chance alone because we have considered so many possible models.

Over-fitting Adding features to a model that improve its fit to the observed data, but that degrade the ability of a model to predict new cases. Iterative refinement of a model (either manually or by an automated algorithm) in order to improve the usual summaries (e.g., R2 and p-values) typically generates a better fit to the observed data that pick the predictors than will be had when predicting new data. No good deed goes unpunished!

It’s the process, not the model Over-fitting does not happen if we pick a large group of predictors and simply fit one big model, without iteratively trying to improve its fit.

“Optimization capitalizes on chance”

Page 5: L08 Over Fitting

Statistics 622 8-5 Fall 2009

An Example of Over-fitting (nyse_2003.jmp) Stock market analysis

Over-fitting is common in domains in which there is a lot of pressure to obtain accurate predictions, as in the case of predicting the direction of the stock market. Data: daily returns on the NYSE composite index in October and November 2003. Objective: Build a model to predict what will happen in December 2003, using a battery of 12 trading rules (labeled X1 to X12). These are a few very basic technical trading rules.

Model selection criteria Many numerical criteria have been proposed that can be used as an alternative to maximizing R2 to judge the quality of a good model. This table lists several well-known criteria. To use these in forward stepwise, control the forward search by using these “Prob-to-enter” values.

Name Prob-to-Enter Approximate t-stat for inclusion

Idea

Adjusted R2

.33 |t| > 1 Decrease RMSE

AIC, Cp .16 |t| > √2 Unbiased estimate of prediction accuracy

BIC Depends on n |t| > ½ log n Bayesian probability Bonferroni 1/m |t| > √(2 log m) Minimize worst case,

family wide error rate

Page 6: L08 Over Fitting

Statistics 622 8-6 Fall 2009

Search domain for the example Consider interactions among 12 exogenous features. The total number of features available to stepwise is then

m = 12 + 12 + 12 × 11/2 = 24 + 66 = 90 Wide data set

There are 42 trading days in October and November. With interactions, we have more features than cases to use.

m = 90 > n = 42 Hence we cannot fit the saturated model with all features.2

AIC criterion for forward search Set “Prob to Enter” = 0.16 and run the search forward.

The stepwise search never stops!

A greedy search becomes gluttonous when offered so many choices relative to the number of cases that are available.

2 You can show that often the best model is the so-called “saturated” model that has every feature included as a predictor in the fit. But, you can only do this when you have more cases than features, typically at least 3 per predictor (a crude rule of thumb for the ratio n/m).

Page 7: L08 Over Fitting

Statistics 622 8-7 Fall 2009

To avoid the cascade, make it harder to add a predictor; reducing the “Prob to enter” to 0.10 gives this result:

The search stops after adding 20 predictors. Optionally, following a common convention, we can “clean up” the fit and make it appear more impressive by stepping backward to remove collinear predictors that are redundant.

The backward elimination removes 3 predictors.

Page 8: L08 Over Fitting

Statistics 622 8-8 Fall 2009

Make the model and obtain the usual summary. This “Summary of Fit” suggests a great model. Any diagnostic procedure that ignores how we chose the features to include in this model finds no problem. All conclude that this is a great-fitting model, one that is highly statistically significant. Look at all of the predictors whose p-value < 0.0001. These easily meet the Bonferroni threshold, when applied after the fact.

Summary of Fit RSquare 0.949 Root Mean Square Error 0.191 Mean of Response 0.177 Observations (or Sum Wgts) 42

Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 17 16.214437 0.953790 26.1361 Error 24 0.875838 0.036493 Prob > F C. Total 41 17.090274 <.0001

Parameter Estimates Term Est Std Err t Ratio Prob>|t| Intercept -0.090 0.058 -1.56 0.1317 Exogenous 6 0.093 0.036 2.60 0.0156 Exogenous 9 0.256 0.046 5.59 <.0001 Exogenous 10 0.326 0.058 5.62 <.0001 (Exogenous 2-0.19088)*(Exogenous 3+0.07326) 0.192 0.035 5.52 <.0001 (Exogenous 2-0.19088)*(Exogenous 5-0.11786) 0.181 0.043 4.19 0.0003 (Exogenous 3+0.07326)*(Exogenous 5-0.11786) -0.209 0.038 -5.45 <.0001 (Exogenous 5-0.11786)*(Exogenous 6-0.07955) 0.178 0.030 5.88 <.0001 (Exogenous 8+0.13772)*(Exogenous 8+0.13772) 0.087 0.031 2.78 0.0105 (Exogenous 1+0.21142)*(Exogenous 9-0.32728) -0.412 0.048 -8.66 <.0001 (Exogenous 2-0.19088)*(Exogenous 9-0.32728) 0.198 0.044 4.51 0.0001 (Exogenous 5-0.11786)*(Exogenous 9-0.32728) 0.384 0.062 6.18 <.0001 (Exogenous 6-0.07955)*(Exogenous 10+0.03726) 0.183 0.036 5.05 <.0001 (Exogenous 7-0.23689)*(Exogenous 10+0.03726) 0.252 0.057 4.45 0.0002 (Exogenous 10+0.03726)*(Exogenous 10+0.03726) 0.202 0.027 7.38 <.0001 (Exogenous 2-0.19088)*(Exogenous 11+0.04288) -0.115 0.047 -2.46 0.0215 (Exogenous 6-0.07955)*(Exogenous 11+0.04288) 0.132 0.057 2.30 0.0304 (Exogenous 10+0.03726)*(Exogenous 12+0.18472) 0.263 0.046 5.69 <.0001

Page 9: L08 Over Fitting

Statistics 622 8-9 Fall 2009

Visualization The surface contour shows that there’s a lot of curvature in the fit of the model, but unlike the curvature seen in several prior examples, the data do not seem to show visual evidence of the curvature.

No pair of predictors appears particularly predictive, although the overall model is.

This plot shows the curvature of the prediction formula using predictors 8 and 10 along the bottom.3

3 Save the prediction formula from your regression model. Then select Graphics > Surface Plot and fill the dialog for the variables with the prediction formula as well as the column that holds the response data. To produce such a plot, you need a recent version of JMP.

Page 10: L08 Over Fitting

Statistics 622 8-10 Fall 2009

Common Sense Test: Hold-back some data Question

Is this fit an example of the ability of multiple regression to find “hidden effects” that simpler models miss? There’s no real substance to rely upon to find an explanation for the model. We have too many explanatory variables than we can sensibly interpret.

Simple idea (cross-validation) Reserve some data in order to test the model, such as the next month of returns. Fit model to a training/estimation sample, then predict cases in test/validation sample.

Catch-22 How much to reserve, or set aside, for checking the model? No clear-cut answer. Save a little. This choice leaves too much variation in your

measure of how well the model has done. A model might look good simply by chance. If we were to only reserve, say, 5 cases to test the model, then it might “get lucky” and predict these 5 well, simply by chance.

Save a lot. This choice leaves too few cases available to find good predictors. We end up with a good estimate of the performance of a poor model. When trying to improve a model or find complex effects, we’ll do better with more data to identify the effects.

Page 11: L08 Over Fitting

Statistics 622 8-11 Fall 2009

Example of Predicting Stocks What happens in December?

The model that looks so good on paper flops miserably when put to this simple test. The fitted equation predicts the estimation cases remarkably well, but produces large prediction errors when extended out-of-sample to the next month.

Plot of the prediction errors. Left: in-sample errors, residuals from the fitted model. Right: out-of-sample errors in the forecast period.4 The residuals are small during the estimation period (October – November), in contrast to the size of the errors when the model is used to predict the returns on the NYSE during December.

This model has been over-fit, producing poor forecasts for December. The usual summary statistics conceal the selection process that was used to identify the model.

4 The horizontal gaps between the dots are the weekends or holidays.

-4

-3

-2

-1

0

1

2

3

4

5

Pre

dic

tion E

rror

20

03

10

01

20

03

11

01

20

03

12

01

20

04

01

01

Cal_Date

October November December

Page 12: L08 Over Fitting

Statistics 622 8-12 Fall 2009

What are those other predictors? Random noise!

The 12 basic features X1, X2, … X12 that were called “technical trading rules” are in fact columns of simulated samples from normal distributions.5 Any model that uses these as predictors over-fits the data.

But the final model looks so good! True, but the out-of-sample predictions show how poor it is. A better prediction would be to use the average of the historical data instead. In this example, we know (because the “exogenous rules” are simulated random noise) that the true coefficients for these variables are all zero.

Why doesn’t the final overall F-ratio find the problem? The standard test statistics work “once”, as if you postulated one model before you saw the data. Stepwise tries hundreds of variables before choosing these. Finding a p-value less than 0.05 is not unusual if you look at, say, 100 possible features. Among these, you’d expect to find 5 whose p-value < 0.05 by chance alone.

Cannot let stepwise procedure add such variables In this example, the first step picks the worst variable: one that adds actually adds nothing but claims to do a lot. The effect of adding this spurious predictor is to bias the estimate of error variation. That is, the RMSE is now smaller than it should be. The bias inflates the t-statistics for every other feature.

5 Thereby giving away my opinion of many technical trading rules.

Page 13: L08 Over Fitting

Statistics 622 8-13 Fall 2009

Source of the cascade Suppose stepwise selection incorrectly picks a predictor that it should not have, one for which β = 0. The reason that it picks the wrong predictor is that, by chance, this predictor explains a lot of variation (has a large correlation with the response, here stock returns). The predictor is useless out-of-sample but looks good within the estimation sample. As a result, the model looks better while at the same time actually performing worse. The result is a biased estimate of the amount of unexplained variation. RMSE gets smaller when in fact the model fits worse; it should be larger – not smaller – after adding this feature. The biased RMSE, being too small, makes all of the other features look better; t-statistics of features that are not in the model suddenly get larger than they should be. These inflated t-stats make it easier to add other useless features to the model, forming a cascade as more spurious predictors join the model. The EverReady bunny.

Protection from Over-fitting Many have been “burned” by using a method like stepwise regression and over-fitting. A frequently-heard complaint:

“The model looked fine when we built it, but when we rolled it out in the field it failed completely. Statistics is useless. Lies, damn lies, statistics.”

Protections from over-fitting include the following: (a) Avoid automatic methods Sure, and why not use an abacus, slide rule, and normal table while you’re at it? It’s not the computer per se, but

Page 14: L08 Over Fitting

Statistics 622 8-14 Fall 2009

rather the shoddy way that we have used the automatic search. The same concerns apply to tedious manual searches as well.

(b) Arrogant: Stick to substantively-motivated predictors Are you so confident that you know all there is to know about which factors affect the response? Particularly troubling when it comes to interactions. Even so, you can use stepwise selection after picking a model as a diagnostic. That is, use stepwise to learn whether a substantively motivated model has missed structure. Start with a non-trivial substantively motivated model. It should include the predictors that your knowledge of the domain tells you belong. Then run stepwise to see whether it finds other things that might be relevant.

(c) Cautious: Use a more stringent threshold Add a feature only when the results are convincing that the feature has a real effect, not a coincidence. We can do this by using the Bonferroni rule. If you have a list of m candidate features, then set “Prob to enter” = 0.05/m.

Page 15: L08 Over Fitting

Statistics 622 8-15 Fall 2009

Bonferroni = Right Answer + Added Bonus What happens in the stock example?

Set the Prob-to-enter threshold to 0.05 divided by m, number of features that are being considered. In this example, the number of considered features is 12 “raw” + 12 “squares” + 12×11/2 “interactions”= 90 “Prob to enter” = 0.05/90 = .00056 Remove all of the predictors from the stepwise dialog, change the “Prob to enter” field to 0.00056, and click go.6 The search finds the right answer: it adds nothing! No predictor enters the model, and we’re left with a regression with just an intercept.

None should be in the model; the “null model” is the truth.7 The “technical trading rules” used as predictors are random noise, totally unrelated to the response.

Added bonus The use of the Bonferroni rule for guiding the selection process avoids the need to reserve a validation sample in order to test your model and avoid over-fitting. Just set the appropriate “Prob to enter” and use all of the data to fit the model. A larger sample allows the modeling to identify more subtle features that would otherwise be missed.

6 JMP rounds the value input for p-to-enter that is shown in the box in the stepwise dialog, even though the underlying code will use the value that you have entered. 7Some of the predictors in the stepwise model claim to have p-values that pass the Bonferroni rule. Once stepwise introduces noise into the regression, it can add more and more and these look fine. You need to use Bonferroni before adding the variables, not after.

Page 16: L08 Over Fitting

Statistics 622 8-16 Fall 2009

Other Applications of the Bonferroni Rule You can (and generally should) use the Bonferroni rule in other situations in regression as well.

Any time that you look at a collection of p-values to judge statistical significance, consider using a Bonferroni adjustment to the p-values.

Testing in multiple regression Suppose you fit a multiple regression with 5 predictors. No selection or stepwise, just fit the model with these predictors. How should you judge the statistical results?

Two-stage process (1) Check the overall F-ratio, shown in the Anova summary

of the model. This tests whether the R2 of the model is large given the number of predictors in the fitted model and the number of observations.

(2) If the overall F-ratio is statistically significant, then consider the individual t-statistics for the coefficients using a Bonferroni rule for these.

Suppose the model as a whole is significant, and you have moved to the individual slopes. If you are looking at p-values of a model with 5 predictors, then compare them to 0.05/5 = 0.01 before you get excited about finding a statistically significant effect.

Tukey comparisons The use of Tukey-Kramer comparisons among several means is an alternative way to avoid claiming artificial statistical significance in the specific case of comparing many averages.

Page 17: L08 Over Fitting

Statistics 622 8-17 Fall 2009

Detecting Over-fitting with a Validation Sample Bonferroni is not always possible.

Some methods do not allow this type of control on over-fitting because they do not offer p-values.

Reserve a validation sample It is common in time series modeling to set aside future data to check the predictions from your model. We did it with the stocks without giving it much thought.8 Divide the data set into two batches, one for fitting the model and the second for evaluating the model. The validation sample should be “locked away” excluded from the modeling process, and certainly not “shown” to the search procedure.

Software issues JMP’s “Column Shuffle” command makes this separation into two batches easy to do. For example:

This formula defines a column that labels a random sample of 50 cases (rows) as validation cases, with the rest labeled as estimation cases.9 Then use the “Exclude” & “Hide” commands from the rows menu to set aside and conceal the validation cases.

8 At some point with time series models, you won’t be able to set aside data. If you’re trying to predict tomorrow, do you really want to use a model built to data that is a month older? 9 Only 47 cases appear in the validation sample in the next example because it so happened that 3 excluded outliers fall among the validation cases.

Page 18: L08 Over Fitting

Statistics 622 8-18 Fall 2009

Questions when using a validation sample 1. How many observations should I put into the validation sample. 2. How can I use the validation sample to identify over-fitting? In the blocks example introduced in Module 7, we have n = 200 runs to build a model.10 That produces the following paradox:

If we set aside, say, half for validation, then we’ll have a hard time finding good predictors. On the other hand, if we only set aside, say, 10 cases for validation, maybe these may be insufficient to give a valid impression of how well the model has done. A fit might do well on these 10 by chance.

Multi-fold cross-validation A better alternative, if we had the software needed to automate the process, repeats the validation process over and over. 5-fold cross-validation:

Divide data into subsets, each with 20% of the cases. Fit your model on 4 subsets, then predict the other. Do this 5 times, each time omitting a different subset. Accumulate the prediction errors. Repeat!

10 So, why not go back to the client and say “I need more data.” Getting data is expensive unless its already been captured in the system. Often, as in this example, the features for each run have to be found by manually searching back through records.

Page 19: L08 Over Fitting

Statistics 622 8-19 Fall 2009

Controlling Stepwise with a Validation Sample (block.jmp) Prior version of the cost-accounting model had 15 predictors with an R2 of 69% and RMSE of $5.80. Using the Bonferroni rule to control the stepwise search gives the model shown on the next page…

It is hard to count how many predictors JMP can choose from because categorical terms get turned into several dummy variables. We can estimate m by counting the number of “screens” needed to show the candidate features. With m ≈ 385 features to consider, the Bonferroni threshold for the “Prob to enter” criterion is

0.05/385 = 0.00013

The resulting model appears on the next page. The claimed model is more parsimonious and does not claim the precision produced by the prior search.

The model has 4 predictors, with R2 = 0.47, RMSE = $6.80 It also avoids weird variables like the type of music!

Page 20: L08 Over Fitting

Statistics 622 8-20 Fall 2009

Actual by Predicted Plot 

Summary of Fit

RSquare 0.465 RSquare Adj 0.454 Root Mean Square Error 6.834 Mean of Response 39.694 Observations (or Sum Wgts) 197.000

Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 4 7800.251 1950.06 41.7500 Error 192 8967.959 46.71 Prob > F C. Total 196 16768.210 <.0001

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 20.22 1.84 10.97 <.0001 Labor_hrs 38.68 4.17 9.27 <.0001 (Abstemp-4.6)*(Abstemp-4.6) 0.07 0.01 6.09 <.0001 (Cost_Kg-1.8)*(Materialcost-2.3) 0.86 0.15 5.69 <.0001 (Manager{J-R&L}+0.22) * (Brkdown/units-0.00634)

-372.50 89.07 -4.18 <.0001

20

30

40

50

60

70

80

Ave_C

ost

Actu

al

20 30 40 50 60 70 80

Ave_Cost Predicted P<.0001

RSq=0.47 RMSE=6.8343

Page 21: L08 Over Fitting

Statistics 622 8-21 Fall 2009

Leverage plots suggest that the model has found some additional highly leveraged points that were not identified previously.

What should we do about these? What can we learn from these?

Abstemp*Abstemp

Cost_Kg*Materialcost

20

30

40

50

60

70

Ave_C

ost

Levera

ge R

esid

uals

35 40 45 50 55 60 65

Abstemp*Abstemp Leverage, P<

.0001

20

30

40

50

60

70

Ave_C

ost

Levera

ge R

esid

uals

35 40 45 50 55

Cost_Kg*Materialcost Leverage,

P<.0001

Page 22: L08 Over Fitting

Statistics 622 8-22 Fall 2009

Visualization of the model reveals some of the structure of the model..11 These plots are more interesting if you color-code the points for old and new plants. Do you see the two groups of points?

11 JMP will produce a surface plot only for models produced by Fit Model.

Page 23: L08 Over Fitting

Statistics 622 8-23 Fall 2009

Back to Business Allure of fancy tools

It is easy to become so enamored by fancy tools that you may lose sight of the problem that you’re trying to solve. The client wants a model that predicts the cost of a production run. We’ve now learned enough to be able to return to the client with questions of our own. We’re doing much better than the naïve initial model (5 predictors, R2 = 0.30 versus the improved model with only 4 predictors yet higher R2 = 0.47).

What questions should you ask the client in order to understand what’s been found by the model?

What are those leveraged outliers? What’s up with temperature controls? Do these have the same effect in both plants. (You’ll have to do some data analysis to answer this one.) What do you make of the categorical factor?

In other words… Stepwise methods leave ample opportunity to exploit what you know about the context… You can design more sensible features to consider by using what you “know” about the problem. Ideally, by simplifying the search for additional predictors, stepwise methods (or other search technologies) allow you to have more time to think about the modeling problem. Here are a few substantively motivated comments:

Page 24: L08 Over Fitting

Statistics 622 8-24 Fall 2009

The features 1/Units and Breakdown/Units make more sense (and are more interpretable) as ways of tracking fixed costs. Similarly, why use Cost/Kg when you can figure out the material cost as the product cost/kg × weight? Finally, make note of the so-called nesting of managers within the different plants. Consider the following table:

Plant By Manager Count JEAN LEE PAT RANDY TERRY NEW 40 0 0 0 30 70 OLD 0 44 42 41 0 127 40 44 42 41 30 197

Jean and Terry work in the new plant, with the others working in the old plant. Can you compare Jean to Lee, for example? Or does that amount to comparing the two plants? These two features, Manager and Plant, are confounded and cannot be separated by this analysis. (We can, however, compare Jean to Terry since they do work in the same plant.)

Page 25: L08 Over Fitting

Statistics 622 8-25 Fall 2009

Appendix: Bonferroni Method The Bonferroni Inequality

The Bonferroni inequality (a.k.a., Boole’s inequality) gives a simple upper bound for the probability of a union of events. If you simply ignore the double counting, then it follows that

P(E1or E2 oror Em ) ≤ P(E j )j=1

m

In the special case that all of the events have equal probability p = P(Ej), we get the special case

P(E1or E2 oror Em ) ≤ m p Use in Model Selection

In model selection for stepwise regression, we start with a list of m possible features of the data that we consider for use in the model. Often, this list will include interactions that we want to have considered in the model, but are not really very sure about. If the list of possible predictors is large, then we need to avoid “false positives”, adding a variable to the model that is not actually helpful. Once the modeling begins to add unneeded predictors, it tends to “cascade” by adding more and more. We’ll avoid this by trying to never add a predictor that’s not helpful.

Page 26: L08 Over Fitting

Statistics 622 8-26 Fall 2009

Bonferroni Rule for p-values Let the events E1 through Em denote errors in the modeling, adding the jth variable when it actually does not affect the response. The chance for making any error when we consider all m of these is then

P(some false positive) = P(E1or E2 oror Em )≤ m p

If we add a feature as a predictor in the model only if its p-value is smaller than 0.05/m, say, then the chance for incorrectly including a predictor is less then

P(some false positive) ≤ m 0.05m

= 0.05

There’s only a 5% chance of making any mistake.

It’s really pretty good Some would say that using this so-called “Bonferroni rule” is too conservative: it makes it too hard to find useful predictors. It’s actually not so bad. (1) For example, suppose that we have m = 1000 possible features to sort through. Then the Bonferroni rule says to only add a feature if its p-value is smaller than 0.05/1000, 0.00005. That seems really small at first, but convert it to a t-ratio. How large (in absolute size) does the t-ratio need to be in order for the p-value to be smaller than 0.00005? The answer is about 4.6.

In other words, once the t-ratio is larger than around 5, a model selection procedure will add the variable. A t-ratio of 5 does not seem so unattainable. Sure, it requires a large

Page 27: L08 Over Fitting

Statistics 622 8-27 Fall 2009

effect, but with so many possibilities, we need to be careful.

(2) Another way to see that Bonferroni is pretty good is to put a lower bound on the probability of a false positive. If all of the events are independent, then

P(some false positive) =1− P(none)=1− P(E1

c and E2c and and Em

c )=1− P(E1

c ) × P(E2c ) ×× P(Em

c )=1− (1− p)m

=1− em log(1− p )

≥1− e−m p

and the last step follows because log(1+x) ≤ x. Combined with the Bonferroni inequality, we have (for independent tests)

1− e−m p ≤ P(some false positive) ≤ m p This table summarizes the implications. It shows that as n grows and p gets smaller, the bounds from these inequalities are really very tight.

m p m p Bounds

50 0.01 0.50 0.39 – 0.50 50 0.005 0.25 0.22 – 0.25 100 0.0001 0.01 0.0095 – 0.0100