Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

34
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King

Transcript of Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

Page 1: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

Comp 540Chapter 9: Additive Models, Trees, and Related Methods

Ryan King

Page 2: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

Overview

9.1 Generalized Additive Models9.2 Tree-based Methods (CART)9.4 MARS9.6 Missing Data9.7 Computational Considerations

Page 3: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

Generalize Additive Models

Generally have the form:

Example: logistic regression

becomes additive Logistic regression:

)(...)(,...,,| 121 pppp XfXfXXXYE

pXXX

X

...1

log 1

pp XfXfX

X_...

1log 11

Page 4: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

Link Functions

• The conditional mean related to an additive function of the predictors via a link function

• Identity: (Gaussian)

• Logit: (binomial)

• Log: (Poisson)

)(...)(1 ppp XfXfXg

g

1

logg

logg

Page 5: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.1.1 Fitting Additive Models

• Ex: Additive Cubic Splines

• Penalized Sum of Squares Criterion (PRSS)

p

jjj XfY

1

p

jjjjj

N

i

p

jijji

p

dttfxfy

fffPRSS

1

2''

2

1 1

21 ,,,,

Page 6: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.1 Backfitting Algorithm

1. Initialize:

2. Cycle:

Until the functions change less than a threshold

jifyN

N

ji ,,0ˆ,1

ˆ1

N

jkikkijj xfySf

1

ˆˆˆ

N

iijjjj xf

Nff

1

ˆ1ˆˆ

jf̂

,,2,1,,,2,1 ppj

Page 7: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.1.3 Summary

• Additive models extend linear models

• Flexible, but still interpretable

• Simple, modular, backfitting procedure

• Limitations for large data-mining applications

Page 8: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Tree-Based Methods

• Partition the feature space

• Fit a simple model (constant) in each partition

• Simple, but powerful

• CART: Classification and Regression Trees, Breiman et al, 1984

Page 9: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Binary Recursive Partitions

x1

x2 aa

b

c d e f

b

cd e

f

Page 10: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Regression Trees

• CART is a top down (divisive) greedy procedure

• Partitioning is a local decision for each node

• A partition on variable j at value s creates regions:

and sXXsjR j ,1 sXXsjR j ,2

Page 11: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Regression Trees

• Each node chooses j,s to solve:

• For any choice j,s the inner minimization is solved by:

• Easy to scan through all choices of j,s to find optimal split

• After the split, recur on and

sjRxi

csjRxi

csjii

cycy,

22

,

21

,2

21

1

minminmin

miim Rxyc ave

1R 2R

Page 12: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Cost-Complexity Pruning

• How large do we grow the tree? Which nodes should we keep?

• Grow tree out to fixed depth, prune back based on cost-complexity criterion.

Page 13: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Terminology

• A subtree: implies is a pruned version of

• Tree has M leaf nodes, each indexed by m

• Leaf node m maps to region

• denotes the number of leaf nodes of

• is the number of data points in region

0TT

T

0T

T

mR

T

T

mRmN

Page 14: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Cost-Complexity Pruning

• We define the cost complexity criterion:

• For , find to minimize

• Choose by cross-validation

mi Rxi

mm y

Nc

mi Rx

mim

m cyN

Q 2ˆ1

TTQNTCT

mmm

1

TC0TT 0

Page 15: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Classification Trees

• We define the same cost complexity criterion:

• But choose different measure of node impurity

mkk

pmk ˆmaxarg

mi Rx

im

mk kyIN

p1

ˆ

TTQNTCT

mmm

1

TQm

Page 16: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Impurity Measures

1. Misclassification Error

2. Gini index

3. Cross-entropy

TQm

mkRx

im

m pmkyIN

TQmi

ˆ1)(1

K

k mkmkkk mkmkm ppppTQ1' ' ˆ1ˆˆˆ

K

k mkmkm ppTQ1

ˆlogˆ

Page 17: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 Categorical Predictors

1. How do we handle categorical variables?

2. In general, possible partitions of q values into two groups

3. Trick for 0-1 case: sort the predictor classes by proportion falling in outcome class 1, then partition as normal

dcbaX ,,,1

12 1 q

Page 18: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.2 CART Example

1. Examples…

Page 19: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.3 PRIM-Bump Hunting

• Partition based, but not tree-based

• Seeks boxes where the response average is high

• Top-down algorithm

Page 20: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

Patient Rule Induction Method

1. Start with all data, and maximal box2. Shrink the box by compressing one face, to peel off

factor alpha of observations. Choose peeling that produces highest response mean.

3. Repeat step 2 until some minimal number of observations remain

4. Expand the box along any face, as long as the resulting box mean increases.

5. Steps 1-4 give a sequence of boxes, use cross-validation to choose a member of the sequence, call that box B1

6. Remove B1 from dataset, repeat process to find another box, as desired.

Page 21: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.3 PRIM Summary

• Can handle categorical predictors, as CART does

• Designed for regression, can 2 class classification can be coded as 0-1

• Non-trivial to deal with k>2 classes

• More patient than CART

Page 22: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.4 Multivariate Adaptive Regression Splines (MARS)

• Generalization of stepwise linear regression

• Modification of CART to improve regression performance

• Able to capture additive structure

• Not tree-based

Page 23: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.4 MARS Continued

• Additive model with adaptive set of basis vectors• Basis built up from simple piecewise linear functions

• Set “C” represents candidate set of linear splines, with “knees” at each data point Xi. Models built with elements from C or their products.

tx xt

t

pjxxxtjjNjjj

XttXC,,2,1,, 21

,

Page 24: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.4 MARS Procedure

Model has form:

1. Given a choice for the , the coefficients chosen by standard linear regression.

2. Start with All functions in C are candidate functions.

3. At each stage consider as a new basis function pair all products of a function in the model set M, with one of the reflected pairs in C.

4. We add add to the model terms of the form:

MhXtXhtXXh ljlMjlM ,21

mh

10 Xh

jm XtXh

tXXh jm

M

mmm XhXf

10

mh

Page 25: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.4 Choosing Number of Terms

• Large models can overfit.

• Backward deletion procedure: delete terms which cause the smallest increase in residual squared error, to give sequence of models.

• Pick Model using Generalized Cross Validation:

• is the effective number of parameters in the model. C=3, r is the number of basis vectors, and K knots

• Choose the model which minimizes

2

2

1

1

ˆ

NM

xfyGCV

N

i ii

M

)(GCV

cKrM

Page 26: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.4 MARS Summary

• Basis functions operate locally

• Forward modeling is hierarchical, multiway products are built up only from existing terms

• Each input appears only once in each product

• Useful option is to set limit on order of operations. Limit of two allows only pairwise products. Limit of one results in an additive model

Page 27: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.5 Hierarchical Mixture of Experts (HME)

• Variant of tree based methods

• Soft splits, not hard decisions

• At each node, an observation goes left or right with probability depending on its input values

• Smooth parameter optimization, instead of discrete split point search

Page 28: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.5 HMEs continued

• Linear (or logistic) regression model fit at each leaf node (Expert)

• Splits can be multi-way, instead of binary

• Splits are probabilistic functions of linear combinations of inputs (gating network), rather than functions of single inputs

• Formally a mixture model

Page 29: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.6 Missing Data

• Quite common to have data with missing values for one or more input features

• Missing values may or may not distort data

• For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y)

Page 30: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.6 Missing Data

• Quite common to have data with missing values for one or more input features

• Missing values may or may not distort data

• For response vector y, Xobs is the observed entries, let Z=(X,y) and Zobs=(Xobs,y), R is an indicator matrix for missing values

Page 31: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.6 Missing Data

• Missing at Random(MAR):

• Missing Completely at Random(MCAR)

• MCAR is a stronger assumption

),|(,| ObsZRPZRP

)|(,| RPZRP

Page 32: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.6 Dealing with Missing Data

Three approaches for handling MCAR data:

1. Discard observations with missing features

2. Rely on the learning algorithm to deal with missing values in its training phase

3. Impute all the missing values before training

Page 33: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.6 Dealing…MCAR

• If few values are missing, (1) may work

• For (2), CART can work well with missing values via surrogate splits. Additive models can assume average values.

• (3) is necessary for most algorithms. Simplest tactic is to use the mean or median.

• If features are correlated, can build predictive models for missing features in terms of known features

Page 34: Comp 540 Chapter 9: Additive Models, Trees, and Related Methods Ryan King.

9.6 Computational Considerations

• For N observations, p predictors

• Additive Models:• Trees:• MARS:• HME, at each step:

mpNNpNO log

NpNO log

NpMNMO 23 22KNpO