Introduction to mars_2009

Introduction to MARS

Dan SteinbergMykhaylo Golovnya

[email protected]

August, 2009

2 Salford Systems © Copyright 2009

MARS is a highly-automated tool for regression

Developed by Jerome H. Friedman of Stanford University

Annals of Statistics, 1991 dense 65 page article

Takes some inspiration from its ancestor CART®

Produces smooth curves and surfaces, not the step-functions of CART

Appropriate target variables are continuous

End result of a MARS run is a regression model

MARS automatically chooses which variables to use

variables are optimally transformed

interactions are detected

model is self-tested to protect against over-fitting

Can also perform well on binary dependent variables

censored survival model (waiting time models as in churn)

Introduction


Harrison, D. and D. Rubinfeld.Hedonic Housing Prices & Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978

506 census tracts in City of Boston for the year 1970

Goal: study relationship between quality of life variables and property values

MV median value of owner-occupied homes in tract (‘000s)

CRIM per capita crime rates

NOX concentration of nitrogen oxides (pphm)

AGE percent built before 1940

DIS weighted distance to centers of employment

RM average number of rooms per house

LSTAT percent neighborhood ‘lower socio-economic status’

RAD accessibility to radial highways

CHAS borders Charles River (0/1)

INDUS percent non-retail business

TAX tax rate

PT pupil teacher ratio

Boston Housing Dataset


The dataset poses significant challenges to conventional regression modeling

Clearly departure from normality, non-linear relationships, and skewed distributions

Multicollinearity, mutual dependency, and outlying observations

Scatter Matrix


A typical MARS solution (univariate for simplicity) is shown above

Essentially a piece-wise linear regression model with the continuity requirement at the transition points called knots

The locations and number of knots were determined automatically to ensure the best possible model fit

The solution can be analytically expressed as conventional regression equations

0

10

2030

40

50

60

0 10 20 30 40

LSTAT

MV

MARS Model


Finding the one best knot in a simple regression is a straightforward search problem

try a large number of potential knots and choose one with best R-squared

computation can be implemented efficiently using update algorithms; entire regression does not have to be rerun for every possible knot (just update X’X matrices)

Finding K knots simultaneously would require NK order of computations assuming N observations

To preserve linear problem complexity, multiple knot placement is implemented in a step-wise manner:

Need a forward/backward procedure

The forward procedure adds knots sequentially one at a time

The resulting model will have many knots and overfit the training data

The backward procedure removes least contributing knots one at a time

This produces a list of models of varying complexity

Using appropriate evaluation criterion, identify the optimal model

Resulting model will have approximately correct knot locations

Challenge: Searching for Multiple Knots


True conditional mean has two knots at X=30 and X=60, bbserved data includes additional random error

Best single knot will be at X=45, subsequent best locations are true knots around 30 and 60

The backward elimination step is needed to remove the redundant node at X=45

0

20

40

60

80

0 30 60 90X

Y

0

20

40

60

80

0 30 60 90X

Y

Example: Flat Top Function


Thinking in terms of knot selection works very well to illustrate splines in one dimension but unwieldy for working with a large number of variables simultaneously

Need a concise notation easy to program and extend in multiple dimensions

Need to support interactions, categorical variables, and missing values

Basis Functions (BF) provide analytical machinery to express the knot placement strategy

Basis function is a continuous univariate transform that reduces predictor influence to a smaller range of values controlled by a parameter c (20 in the example below)

Direct BF: max(X-c, 0) – the original range is cut below c

Mirror BF: max(c-X, 0) – the original range is cut above c

Basis Functions

-10

0

10

20

0 10 20 30 40

X

Mirr

or

-10

0

10

20

0 10 20 30 40

X

Dire

ct


MARS constructs basis functions for each unique value present in a continuous variable

Each new BF results to a different number of zeroes in the transformed variable – hence the set of all BFs is linearly independent

The resulting collection is naturally resistant to multicollinearity issues

This is further reinforced by introducing minimum number of observations requirement between two consecutive knots

The Set of All Basis Functions


Step-Wise Model Development using BFs

Define a basis function BF1 on the variable INDUS:BF1 = max (0, INDUS - 4)

Use this function instead of INDUS in a regression y = constant + *BF1 + error

This fits a model in which the effect of INDUS on the dependent variable is 0 for all values below 4 and 1 for values above 4

Suppose we added a second basis function BF2 to the model:BF2 = max (0, INDUS -8 )

Then our regression function would be: y = constant + *BF1 + *BF2 + error


Solution with 1 Basis Function

MV = 27.395 - 0.659*(INDUS -4)+

0

10

20

30

40

0 5 10 15 20 25 30

INDUS

MV


Solution with 2 Basis Functions

MV = 30.290 - 2.439*(INDUS - 4)+ + 2.215*(INDUS-8) +

Slope starts at 0 and then becomes -2.439 after INDUS=4

Slope on third portion (after INDUS=8) is (- 2.439 + 2.215) = -0.224

0

10

20

30

40

0 10 20 30INDUS

MV


The following model represents a 3-knot univariate solution for the Boston Housing dataset using two direct and one mirror basis functions

BF1 = max(0, 4-INDUS) BF2 = max(0, INDUS-4) BF3=max(0, INDUS-8)

MV= 29.433 + 0.925*(4-INDUS)+ - 2.180*(INDUS-4)+ + 1.939*(INDUS-8)+

All three line segments have negative slope even though two coefficients are above zero

0

10

20

30

40

0 5 10 15 20 25 30INDUS

MV

Example: Solution with 3 Basis Functions


MARS Creates Basis Functions in Pairs

To fully emulate the geometric concept of a knot, MARS creates basis functions in pairs

thus twice as many basis functions possible as there are distinct data values

reminiscent of CART (left and right sides of a split)

mirror image is needed to ultimately find right model

not all linearly independent but increases flexibility of model

For a given set of knots only a subset of mirror image basis functions will be linearly independent of the standard basis functions – MARS is clever enough to identify such cases and discard redundant pieces

However, using the mirror image INSTEAD of the standard basis function at any knot will change the model and is important for interaction detection


MARS core technology:

Forward step: add basis function pairs one at a time in conventional step-wise forward manner until the largest model size (specified by the user) is reached

The pairs are needed to fully implement the geometric sense of a knot

Possible collinearity due to redundancy in pairs must be detected and eliminated

For categorical predictors define basis functions as indicator variables for all possible subsets of levels

To support interactions, allow cross products between a new candidate pair and basis functions already present in the model

Backward step: remove basis functions one at a time in conventional step-wise backward manner to obtain a sequence of candidate models

Use test sample or cross-validation to identify the optimal model size

Missing values are treated by constructing missing value indicator (MVI) variables and nesting the basis functions within the corresponding MVIs

Fast update formulae and smart computational shortcuts exist to make the MARS process as fast and efficient as possible

MARS Process


Example of Categorical Predictors

Where RAD is declared categorical, MARS reports in classic output:

Basis Functions found:

BF1 = max(0, INDUS - 8.140);

BF3 = ( RAD = 1 OR RAD = 4 OR RAD = 6 OR RAD = 24);

BF13 = max(0, INDUS - 3.970);

BF3 is essentially a dummy indicator for the {1, 4, 6, 24} subset of RAD levels

MARS looks at all possible 2K-1-1 groupings of levels and ultimately chooses the one showing the greatest error reduction

A different grouping can enter the model at subsequent iterations

This machinery mimics CART and is vastly more powerful than the conventional regression approach of replacing categorical variables by a set of dummies


Missing Value Handling

In one of the choice models we encountered the following MARS code:

BF10 = ( INCGT5 > .);

BF11 = ( INCGT5 = .);

BF12 = max(0, INCGT5 + .110320E-07) * BF10;

BF13 = max(0, ENVIRON - 12.000) * BF11;

BF14 = max(0, 12.000 - ENVIRON ) * BF11;

This uses income when it is available and uses the ENVIRON variable when INCGT5 is missing

Effectively this creates a surrogate variable for INCGT5

No guarantee that MARS will find a surrogate; however, MARS will search all possible surrogates in basis function generation stage

Unlike CART, this machinery is turned on only when missing values are present in the LEARN sample – hence, care must be exercised when scoring new data


Interaction Support in MARS

MARS builds up its interactions by combining a SINGLE previously-entered basis function with a PAIR of new basis functions

The “new pair” of basis functions (a standard and a mirror image) could coincide with a previously entered pair or could be a new pair in an already specified variable or a new pair in a new variable

Interactions are thus built by accretion

first one of the members of the interaction must appear as a main effect

then an interaction can be created involving this term

the second member of the interaction does NOT need to enter as a main effect (modeler might wish to require otherwise via ex post modification of model)

Generally a MARS interaction will be region-specific and look like

(PT - 18.6)+ * (RM - 6.431)+

This is not the familiar interaction of PT*RM because the interaction is confined to the data region where RM<=6.431 and PT<=18.6


Boston Housing – Conventional Regression

We compare the results of classical linear regression and MARS

Top three significant predictors are shown for each model

Linear regression provides global insights

MARS regression provides local insights and has superior accuracy

All cut points were automatically discovered by MARS

MARS model can be presented as a linear regression model in the BF space

OLS Regression(R-squared 73%)

MARS Regression(R-squared 87%)


Further Reading

To the best of our knowledge, as of 2008 Salford Systems MARS tutorial and the documentation for MARS™ software constitutes the sum total of any extended discussion of MARS. MARS is referenced in several hundred scientific publications appearing since 1994 but the reader is assumed to have read Freidman’s articles.

Friedman, J. H. (1991a). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19, 1-141 (March).

Friedman, J. H. (1991b). Estimating functions of mixed ordinal and categorical variables using adaptive splines. Department of Statistics,Stanford University, Tech. Report LCS108.

Friedman, J. H. and Silverman, B. W. (1989). Flexible parsimonious smoothing and additive modeling (with discussion). TECHNOMETRICS, 31, 3-39 (Feburary).

De Veaux R.D., Psichogios D.C., and Ungar L.H. (1993), A Comparison of Two Nonparametric Estimation Schemes: Mars and Neutral Networks, Computers Chemical Engineering, Vol.17, No.8.

Introduction to mars_2009

Technology

Transcript of Introduction to mars_2009