Best Subset Reg 2

9
Model Selection in Minitab V ariable Selection The basic procedure is obtained by Stat Regression Best Subsets …  Necessary input: The response variable is specified in the Responses: box. The pool of explanatory variables of which subsets are to be evaluated is entered in the Free predictors: box.  No other specifications are necessary , though some may be desired, as follows. Commonly useful options: If there are some explanatory variables which are to be kep t in all models, they are entered in the Predictors in all models: box in the main Best Subsets Regression window, and no t in the Free predictors: box. The default output lists the two best models of each size. To change this number (I  prefer to see at least 4), click the Options button, and entered the desired value in the Models of each size to print: box. To restrict the minimum or maximum sizes of the models to be evaluated, click the Options button, and entered the desired value(s) in the appropriate box under Free Predictor(s) in Each Model.

Transcript of Best Subset Reg 2

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 1/9

Model Selection in Minitab

Variable Selection

The basic procedure is obtained byStat Regression Best Subsets …

Necessary input:

The response variable is specified in the Responses: box.The pool of explanatory variables of which subsets are to be evaluated is entered in

the Free predictors: box. No other specifications are necessary, though some may be desired, as follows.

Commonly useful options:

If there are some explanatory variables which are to be kept in all models, they areentered in the Predictors in all models: box in the main Best Subsets Regression window, and not in the Free predictors: box.

The default output lists the two best models of each size. To change this number (I prefer to see at least 4), click the Options button, and entered the desired value in theModels of each size to print: box.

To restrict the minimum or maximum sizes of the models to be evaluated, click theOptions button, and entered the desired value(s) in the appropriate box under FreePredictor(s) in Each Model .

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 2/9

Model Selection in Minitab 2

Output

On the next page is an excerpt of the output from the analysis specified in the preceding screen shots. Each line in the table represents a particular model; as requested,four models of each size are reported. The variables in a particular model are indicated bythe Xs in the columns under the variable names (which read downwards).

For example, the first line is for the best one-variable model, with onlyTotPersInc . This model has R2 = 44.4, R2

a = 44.3, C p = 357.5, and s (= / MSE) =0.37095.

As another example, the best three-variable model is shown in the row starting 3 62.1 … This model contains the variables TotPop , PctPoverty , and PerCapInc .

By both the C p and the R2a (= s) criteria, the best model in this example is the first

8-variable model, with all the explanatory variables except LandArea and PctBach . The best two 9-variable models, which add one or the other of the variables not in the preceding model, are nearly as good by both criteria.

Using the C p criterion can be facilitated by plotting the C p values against thenumber of variables in the model. This can be done by cutting-and-pasting the output tableinto another Minitab worksheet (or Excel, etc .). The reference line can be added bycreating two new columns, one with the values 0 and the maximum number of variablesadded, and the other with the corresponding p (i.e. 1 and the maximum number variables

plus 1). After making the scatterplot of C p, these columns can be used to add a Calculated

line… to the scatterplot. The other criteria can be plotted similarly. Examples are on thenext page.

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 3/9

Model Selection in Minitab 3

Best Subsets Regression: logPhys versus LandArea, TotPop, ...

Response is logPhys

P TP P P c P o

L c c c t P e ta t t t P P c r Pn T 1 O H c o t C ed o 8 v S t v U a rA t t e G B e n p sr P o r r a r e I I

Mallows e o 3 6 a c t m n nVars R-Sq R-Sq(adj) C-p S a p 4 5 d h y p c c

1 44.4 44.3 357.5 0.37095 X1 42.0 41.8 392.1 0.37896 X1 24.9 24.8 635.2 0.43101 X1 20.3 20.1 701.7 0.44418 X2 54.8 54.6 210.3 0.33465 X X

2 54.7 54.5 212.6 0.33526 X X2 54.0 53.8 222.5 0.33781 X X2 52.5 52.3 243.3 0.34311 X X3 62.1 61.8 108.6 0.30690 X X X3 60.3 60.1 134.0 0.31403 X X X

…7 69.4 68.9 12.7 0.27709 X X X X X X X7 69.0 68.5 18.4 0.27890 X X X X X X X7 68.7 68.2 22.3 0.28013 X X X X X X X7 68.4 67.9 27.3 0.28169 X X X X X X X8 69.9 69.3 7.6 0.27517 X X X X X X X X8 69.7 69.1 10.8 0.27618 X X X X X X X X8 69.5 68.9 13.9 0.27716 X X X X X X X X

8 69.1 68.6 18.3 0.27857 X X X X X X X X9 69.9 69.3 9.2 0.27536 X X X X X X X X X9 69.9 69.3 9.4 0.27540 X X X X X X X X X9 69.7 69.1 12.4 0.27636 X X X X X X X X X9 69.1 68.5 20.3 0.27889 X X X X X X X X X

10 69.9 69.2 11.0 0.27560 X X X X X X X X X X

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 4/9

Model Selection in Minitab 4

When one or more variables are forced to be in all models, the Vars column in the best-subsets output specifies how many of the free predictors are included; it does notcount the predictors forced into the models. In such a case the number of parameters, p,will be the sum of the number of free predictors included (as given in the Vars column),the number of variables forced in, plus 1 for the intercept. On the next page is an excerptof the output resulting from moving TotPop from the Free predictors: box to thePredictors in all models: box (bolding added to show these points).

number of variables (= p-1)

C p

1086420

50

40

30

20

10

0

number of variables (= p-1)

a d j u s t e

d R ^ 2

1086420

70

60

50

40

30

20

number of variables (= p-1)

s ( =

s q u a r e - r o o

t o

f M S E )

1086420

0.45

0.40

0.35

0.30

0.25

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 5/9

Model Selection in Minitab 5

Other Criteria — PRESS

The only other model-selection criterion available in Minitab is PRESS p. This can be gotten only for one model at a time, using the usual regression procedure.

Stat Regression Regression …Click on the Options button, then check the box for PRESS and predicted R-

square under the Display part of the options window. (Predicted R2 is the fraction of the

Best Subsets Regression: logPhys versus LandArea, Pct18to34, ...

Response is logPhysThe following variables are included in all models: TotPop

P TP P P c P o

L c c c t P e ta t t t P P c r Pn 1 O H c o t C ed 8 v S t v U a rA t e G B e n p sr o r r a r e I I

Mallows e 3 6 a c t m n nVars R-Sq R-Sq(adj) C-p S a 4 5 d h y p c c 1 54.8 54.6 210.3 0.33465 X 1 54.7 54.5 212.6 0.33526 X 2 62.1 61.8 108.6 0.30690 X X 2 58.2 57.9 164.8 0.32246 X X

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 6/9

Model Selection in Minitab 6

variation in the response variable “explained” by the leave-one-out predictions used tocalculate PRESS . These two statistics are essentially redundant, but the predicted R2 can

be compared directly to the regular R2 for the model, to judge whether the latter accuratelyreflects the predictive value of the model or is inflated by over-fitting.)

Output

The PRESS and predicted R2 statistics are printed just below the regular R2: In this

example the PRESS statistics is reasonably close to SSE , and the predicted R2 isreasonably close to the regular R2. These findings indicate that the model is at least notsubstantially overfit.

Regression Analysis: logPhys versus TotPop, Pct18to34, ...

The regression equation islogPhys = - 0.988 + 0.000001 TotPop + 0.0234 Pct18to34 + 0.0270 PctOver65

+ 0.00833 PctHSGrad + 0.0483 PctPoverty - 0.0255 PctUnemp+ 0.000086 PerCapInc - 0.000046 TotPersInc

Predictor Coef SE Coef T P…

TotPersInc -0.00004566 0.00000922 -4.95 0.000

S = 0.275165 R-Sq = 69.9% R-Sq(adj) = 69.3%

PRESS = 44.2863 R-Sq(pred) = 59.14%

Analysis of Variance

Source DF SS MS F PRegression 8 75.7403 9.4675 125.04 0.000Residual Error 431 32.6336 0.0757Total 439 108.3739

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 7/9

Model Selection in Minitab 7

Validation

Internal validation, using PRESS and predicted R2, is done as explained above.

MSPR with New Data

The predicted values for the observations in the validation data set, predicted basedon the selected model fit to the model-building data set, can be calculated in two ways.First, the fitted model equation can be entered into the Calculator , creating a column of

predicted values in a worksheet containing only the validation data set.Alternatively, the validation data can be added (as new columns) to the model-

building data set, and Prediction intervals for new observations: can be requested inthe Options window of the regression procedure, with the regression being done on theoriginal data set. For example, if nTotPop is the Total Population variable in the new dataset, and so on for the other variables, the following would be used:

The predicted values will be stored in a new column, named PFIT1 .Once the column of predicted values is created, the MSPR can be calculated in the

Calculator , using an expression likeSSQ ( nLgPhys - PFIT1 ) / N( PFIT1) ,

where nLgPhys is the column of observed values of the response variable in the new dataset, PFIT1 is the predicted values, SSQ is a function computing the sum of squared valuesfor an entire column (or in this case, of squared differences between two columns), and N is a function returning the number of non-missing values in a column. The result will be acolumn with only one entry, which is the MSPR .

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 8/9

Model Selection in Minitab 8

Data Splitting

Various methods can be used to split a data set into model-building and validationsub-sets. A relatively easy method is to first create a column distinguishing which obser-vations are to go into which sub-set, then use

Data Split Worksheet …

to divide the data set based on the column just created.For example, to separate odd and even observations, the calculation shown in the

window on the next page will create a column ( c21 ) containing 1s for all observationswith odd IDNum and 0s for all even observations. (The function MOD(x,y) returns theremainder after dividing x by y , and the entire expression is a logical comparison, evalu-ating to 1 (true) when the equality is true, and 0 (false) otherwise.) This column can then

be used to split the data, as in the following window (invoked by Data SplitWorksheet … ).

To simply select a random subset of observations, an easier method usesCalc Random Data Sample From Columns…

Using either of the preceding methods to create separate subsets, the MSPR can becalculated as described under MSPR with New Data above, either using the calculator tocompute predicted values for the validation data set, or copying the validation data (asnew columns) into the data-building subset and using the Prediction intervals for newobservations: method.

Yet another approach to data-splitting is to not actually divide the data set, but toseparate the response variable into different columns for the model-building and

7/31/2019 Best Subset Reg 2

http://slidepdf.com/reader/full/best-subset-reg-2 9/9

Model Selection in Minitab 9

validation subsets. For instance, creating a new column as logPhys / c21 (with c21 the 1/0column created above), will have missing values for all even-numbered observations (for which c21 = 0). This new column, of observed values for the model-building data subsetand missing values for the validation data subset, can then be used as the response variablein the regression procedure, and “fits” can be stored (click the Storage button in the mainregression window, and check the box for Fits ). “Fits” will be stored for all observations,including those with missing values of the response variable. The “fits” for the validationobservations can then be separated from those for the model-building observations bycopying the “fits” column to another column but selecting only observations in which thesubsetting column ( c21 here) equals 0. MSPR then can be calculated using this newcolumn of the predicted values for the validation observations.