Introduction to mars_2009
-
Upload
matthew-magistrado -
Category
Technology
-
view
634 -
download
0
Transcript of Introduction to mars_2009
2 Salford Systems © Copyright 2009
MARS is a highly-automated tool for regression
Developed by Jerome H. Friedman of Stanford University
Annals of Statistics, 1991 dense 65 page article
Takes some inspiration from its ancestor CART®
Produces smooth curves and surfaces, not the step-functions of CART
Appropriate target variables are continuous
End result of a MARS run is a regression model
MARS automatically chooses which variables to use
variables are optimally transformed
interactions are detected
model is self-tested to protect against over-fitting
Can also perform well on binary dependent variables
censored survival model (waiting time models as in churn)
Introduction
3 Salford Systems © Copyright 2009
Harrison, D. and D. Rubinfeld.Hedonic Housing Prices & Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978
506 census tracts in City of Boston for the year 1970
Goal: study relationship between quality of life variables and property values
MV median value of owner-occupied homes in tract (‘000s)
CRIM per capita crime rates
NOX concentration of nitrogen oxides (pphm)
AGE percent built before 1940
DIS weighted distance to centers of employment
RM average number of rooms per house
LSTAT percent neighborhood ‘lower socio-economic status’
RAD accessibility to radial highways
CHAS borders Charles River (0/1)
INDUS percent non-retail business
TAX tax rate
PT pupil teacher ratio
Boston Housing Dataset
4 Salford Systems © Copyright 2009
The dataset poses significant challenges to conventional regression modeling
Clearly departure from normality, non-linear relationships, and skewed distributions
Multicollinearity, mutual dependency, and outlying observations
Scatter Matrix
5 Salford Systems © Copyright 2009
A typical MARS solution (univariate for simplicity) is shown above
Essentially a piece-wise linear regression model with the continuity requirement at the transition points called knots
The locations and number of knots were determined automatically to ensure the best possible model fit
The solution can be analytically expressed as conventional regression equations
0
10
2030
40
50
60
0 10 20 30 40
LSTAT
MV
MARS Model
6 Salford Systems © Copyright 2009
Finding the one best knot in a simple regression is a straightforward search problem
try a large number of potential knots and choose one with best R-squared
computation can be implemented efficiently using update algorithms; entire regression does not have to be rerun for every possible knot (just update X’X matrices)
Finding K knots simultaneously would require NK order of computations assuming N observations
To preserve linear problem complexity, multiple knot placement is implemented in a step-wise manner:
Need a forward/backward procedure
The forward procedure adds knots sequentially one at a time
The resulting model will have many knots and overfit the training data
The backward procedure removes least contributing knots one at a time
This produces a list of models of varying complexity
Using appropriate evaluation criterion, identify the optimal model
Resulting model will have approximately correct knot locations
Challenge: Searching for Multiple Knots
7 Salford Systems © Copyright 2009
True conditional mean has two knots at X=30 and X=60, bbserved data includes additional random error
Best single knot will be at X=45, subsequent best locations are true knots around 30 and 60
The backward elimination step is needed to remove the redundant node at X=45
0
20
40
60
80
0 30 60 90X
Y
0
20
40
60
80
0 30 60 90X
Y
Example: Flat Top Function
8 Salford Systems © Copyright 2009
Thinking in terms of knot selection works very well to illustrate splines in one dimension but unwieldy for working with a large number of variables simultaneously
Need a concise notation easy to program and extend in multiple dimensions
Need to support interactions, categorical variables, and missing values
Basis Functions (BF) provide analytical machinery to express the knot placement strategy
Basis function is a continuous univariate transform that reduces predictor influence to a smaller range of values controlled by a parameter c (20 in the example below)
Direct BF: max(X-c, 0) – the original range is cut below c
Mirror BF: max(c-X, 0) – the original range is cut above c
Basis Functions
-10
0
10
20
0 10 20 30 40
X
Mirr
or
-10
0
10
20
0 10 20 30 40
X
Dire
ct
9 Salford Systems © Copyright 2009
MARS constructs basis functions for each unique value present in a continuous variable
Each new BF results to a different number of zeroes in the transformed variable – hence the set of all BFs is linearly independent
The resulting collection is naturally resistant to multicollinearity issues
This is further reinforced by introducing minimum number of observations requirement between two consecutive knots
The Set of All Basis Functions
10 Salford Systems © Copyright 2009
Step-Wise Model Development using BFs
Define a basis function BF1 on the variable INDUS:BF1 = max (0, INDUS - 4)
Use this function instead of INDUS in a regression y = constant + *BF1 + error
This fits a model in which the effect of INDUS on the dependent variable is 0 for all values below 4 and 1 for values above 4
Suppose we added a second basis function BF2 to the model:BF2 = max (0, INDUS -8 )
Then our regression function would be: y = constant + *BF1 + *BF2 + error
11 Salford Systems © Copyright 2009
Solution with 1 Basis Function
MV = 27.395 - 0.659*(INDUS -4)+
0
10
20
30
40
0 5 10 15 20 25 30
INDUS
MV
12 Salford Systems © Copyright 2009
Solution with 2 Basis Functions
MV = 30.290 - 2.439*(INDUS - 4)+ + 2.215*(INDUS-8) +
Slope starts at 0 and then becomes -2.439 after INDUS=4
Slope on third portion (after INDUS=8) is (- 2.439 + 2.215) = -0.224
0
10
20
30
40
0 10 20 30INDUS
MV
13 Salford Systems © Copyright 2009
The following model represents a 3-knot univariate solution for the Boston Housing dataset using two direct and one mirror basis functions
BF1 = max(0, 4-INDUS) BF2 = max(0, INDUS-4) BF3=max(0, INDUS-8)
MV= 29.433 + 0.925*(4-INDUS)+ - 2.180*(INDUS-4)+ + 1.939*(INDUS-8)+
All three line segments have negative slope even though two coefficients are above zero
0
10
20
30
40
0 5 10 15 20 25 30INDUS
MV
Example: Solution with 3 Basis Functions
14 Salford Systems © Copyright 2009
MARS Creates Basis Functions in Pairs
To fully emulate the geometric concept of a knot, MARS creates basis functions in pairs
thus twice as many basis functions possible as there are distinct data values
reminiscent of CART (left and right sides of a split)
mirror image is needed to ultimately find right model
not all linearly independent but increases flexibility of model
For a given set of knots only a subset of mirror image basis functions will be linearly independent of the standard basis functions – MARS is clever enough to identify such cases and discard redundant pieces
However, using the mirror image INSTEAD of the standard basis function at any knot will change the model and is important for interaction detection
15 Salford Systems © Copyright 2009
MARS core technology:
Forward step: add basis function pairs one at a time in conventional step-wise forward manner until the largest model size (specified by the user) is reached
The pairs are needed to fully implement the geometric sense of a knot
Possible collinearity due to redundancy in pairs must be detected and eliminated
For categorical predictors define basis functions as indicator variables for all possible subsets of levels
To support interactions, allow cross products between a new candidate pair and basis functions already present in the model
Backward step: remove basis functions one at a time in conventional step-wise backward manner to obtain a sequence of candidate models
Use test sample or cross-validation to identify the optimal model size
Missing values are treated by constructing missing value indicator (MVI) variables and nesting the basis functions within the corresponding MVIs
Fast update formulae and smart computational shortcuts exist to make the MARS process as fast and efficient as possible
MARS Process
16 Salford Systems © Copyright 2009
Example of Categorical Predictors
Where RAD is declared categorical, MARS reports in classic output:
Basis Functions found:
BF1 = max(0, INDUS - 8.140);
BF3 = ( RAD = 1 OR RAD = 4 OR RAD = 6 OR RAD = 24);
BF13 = max(0, INDUS - 3.970);
BF3 is essentially a dummy indicator for the {1, 4, 6, 24} subset of RAD levels
MARS looks at all possible 2K-1-1 groupings of levels and ultimately chooses the one showing the greatest error reduction
A different grouping can enter the model at subsequent iterations
This machinery mimics CART and is vastly more powerful than the conventional regression approach of replacing categorical variables by a set of dummies
17 Salford Systems © Copyright 2009
Missing Value Handling
In one of the choice models we encountered the following MARS code:
BF10 = ( INCGT5 > .);
BF11 = ( INCGT5 = .);
BF12 = max(0, INCGT5 + .110320E-07) * BF10;
BF13 = max(0, ENVIRON - 12.000) * BF11;
BF14 = max(0, 12.000 - ENVIRON ) * BF11;
This uses income when it is available and uses the ENVIRON variable when INCGT5 is missing
Effectively this creates a surrogate variable for INCGT5
No guarantee that MARS will find a surrogate; however, MARS will search all possible surrogates in basis function generation stage
Unlike CART, this machinery is turned on only when missing values are present in the LEARN sample – hence, care must be exercised when scoring new data
18 Salford Systems © Copyright 2009
Interaction Support in MARS
MARS builds up its interactions by combining a SINGLE previously-entered basis function with a PAIR of new basis functions
The “new pair” of basis functions (a standard and a mirror image) could coincide with a previously entered pair or could be a new pair in an already specified variable or a new pair in a new variable
Interactions are thus built by accretion
first one of the members of the interaction must appear as a main effect
then an interaction can be created involving this term
the second member of the interaction does NOT need to enter as a main effect (modeler might wish to require otherwise via ex post modification of model)
Generally a MARS interaction will be region-specific and look like
(PT - 18.6)+ * (RM - 6.431)+
This is not the familiar interaction of PT*RM because the interaction is confined to the data region where RM<=6.431 and PT<=18.6
19 Salford Systems © Copyright 2009
Boston Housing – Conventional Regression
We compare the results of classical linear regression and MARS
Top three significant predictors are shown for each model
Linear regression provides global insights
MARS regression provides local insights and has superior accuracy
All cut points were automatically discovered by MARS
MARS model can be presented as a linear regression model in the BF space
OLS Regression(R-squared 73%)
MARS Regression(R-squared 87%)
20 Salford Systems © Copyright 2009
Further Reading
To the best of our knowledge, as of 2008 Salford Systems MARS tutorial and the documentation for MARS™ software constitutes the sum total of any extended discussion of MARS. MARS is referenced in several hundred scientific publications appearing since 1994 but the reader is assumed to have read Freidman’s articles.
Friedman, J. H. (1991a). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19, 1-141 (March).
Friedman, J. H. (1991b). Estimating functions of mixed ordinal and categorical variables using adaptive splines. Department of Statistics,Stanford University, Tech. Report LCS108.
Friedman, J. H. and Silverman, B. W. (1989). Flexible parsimonious smoothing and additive modeling (with discussion). TECHNOMETRICS, 31, 3-39 (Feburary).
De Veaux R.D., Psichogios D.C., and Ungar L.H. (1993), A Comparison of Two Nonparametric Estimation Schemes: Mars and Neutral Networks, Computers Chemical Engineering, Vol.17, No.8.