Machine Learning in Real World: CART 2 Outline CART Overview and Gymtutor Tutorial Example ...

40
Machine Learning in Real World: CART
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Machine Learning in Real World: CART 2 Outline CART Overview and Gymtutor Tutorial Example ...

Page 1: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Machine Learning in Real

World:CART

Page 2: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

22

Outline

CART Overview and Gymtutor Tutorial Example

Splitting Criteria

Handling Missing Values

Pruning Finding Optimal Tree

Page 3: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

33

CART – Classification And Regression Tree Developed 1974-1984 by 4 statistics professors

Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)

Focused on accurate assessment when data is noisy

Currently distributed by Salford Systems

Page 4: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

44

CART Tutorial Data: GymtutorCART HELP, Sec 3 in CARTManual.pdf ANYRAQT Racquet ball usage (binary indicator coded 0, 1) ONAER Number of on-peak aerobics classes attended NSUPPSNumber of supplements purchased OFFAERNumber of off-peak aerobics classes attended NFAMMEM Number of family members TANNING Number of visits to tanning salon ANYPOOL Pool usage (binary indicator coded 0, 1) SMALLBUS Small business discount (binary indicator coded 0, 1) FIT Fitness score HOME Home ownership (binary indicator coded 0, 1) PERSTRN Personal trainer (binary indicator coded 0, 1) CLASSES Number of classes taken. SEGMENT Member’s market segment (1, 2, 3) – target

Page 5: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

55

View data

CART Menu: View -> Data Info …

Page 6: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

66

CART Example: Gymtutor

Page 7: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

77

CART Model Setup

Target -- required

Predictors (default – all)

Categorical ANYRAQT, ANYPOOL, SMALLBUS, HOME

Categorical: if field name ends in “$”, or from values

Testing default – 10-fold cross-validation

Page 8: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

88

Sample Tree

Page 9: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

99

Color-coding using class

Page 10: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

1010

Decision Tree: Splitters

Page 11: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

1111

Tree Details

Page 12: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

1212

Tree Summary Reports

Page 13: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

1313

Pruning the tree

Page 14: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

1414

Keeping only important variables

Page 15: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

1515

Revised Tree

Page 16: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

1616

Automating CART: Command Log

Page 17: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Automated field selection

handles any number of fields automatically selects relevant fields

No data preprocessing needed Does not require any kind of variable transforms

Impervious to outliers

Missing value tolerant Moderate loss of accuracy due to missing values

Key CART features

Page 18: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Tree growing

Splitting rules to generate tree

Stopping criteria: how far to grow?

Missing values: using surrogates

Tree pruning

Trimming off parts of the tree that don’t work

Ordering the nodes of a large tree by contribution to tree accuracy … which nodes come off first?

Optimal tree selection

Deciding on the best tree after growing and pruning

Balancing simplicity against accuracy

CART: Key Parts of Tree Structured Data Analysis

Page 19: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Data is split into two partitions

Q: Does C4.5 always have binary partitions?

Partitions can also be split into sub-partitions

hence procedure is recursive

CART tree is generated by repeated partitioning of data set

parent gets two children

each child produces two grandchildren

four grandchildren produce 8 great grandchildren

CART is a form of Binary Recursive Partitioning

Page 20: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Is continuous variable X c ?

Does categorical variable D take on levels i, j, or k? is GENDER M or F ?

Standard split: if answer to question is YES a case goes left; otherwise it

goes right

this is the form of all primary splits

example : Is AGE 62.5?

More complex conditions possible: Boolean combinations: AGE<=62 OR BP<=91

Linear combinations: .66*AGE - .75*BP< -40

Splits always determined by questions with YES/NO answers

Page 21: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

For any node CART will examine ALL possible splits.

CART allows search over a random sample if desired

Look at first variable in our data set AGE with minimum value 40

Test split Is AGE 40?

Will separate out the youngest persons to the left

Could be many cases if many people have the same AGE

Next increase the AGE threshold to the next youngest person Is AGE 43?

This will direct additional cases to the left

Continue increasing the splitting threshold value by value

each value is tested for how good the split is . . . how effective it is in separating the classes from each other

Q: Consider splits between values of the same class?

Searching all Possible Splits

Page 22: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

AGE BP SINUST SURVIVE40 91 0 SURVIVE

40 110 0 SURVIVE

40 83 1 DEAD

43 99 0 SURVIVE

43 78 1 DEAD

43 135 0 SURVIVE

45 120 0 SURVIVE

48 119 1 DEAD

48 122 0 SURVIVE

49 150 0 DEAD

49 110 1 SURVIVE

Sorted by Age Sorted by Blood Pressure

AGE BP SINUST SURVIVE43 78 1 DEAD

40 83 1 DEAD

40 91 0 SURVIVE

43 99 0 SURVIVE

40 110 0 SURVIVE

49 110 1 SURVIVE

48 119 1 DEAD

45 120 0 SURVIVE

48 122 0 SURVIVE

43 135 0 SURVIVE

49 150 0 DEAD

Split TablesQ: Where splits need to be evaluated?

X

X

Page 23: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

2323

If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.

gini(T) is minimized if the classes in T are skewed.

Advanced: CART also has other splitting criteria Twoing is recommended for multi-class

CART Splitting Criteria: Gini Index

Page 24: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

CHAID treats missing as a distinct categorical value e.g AGE is 25-44, 45-64, 65-95 or missing

method also adopted by C4.5

If missing is a distinct value then all cases with missing go the same way in the tree

Assumption: whatever the unknown value it is the same for all cases with missing value

Problem: can be more than one reason for a database field to be missing: E.g. Income as a splitter wants to separate high from low

Levels most likely to be missing? High Income AND Low Income!

Don’t want to send both groups to same side of tree

Missing as a distinct splitter value

Page 25: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

2626

CART Treatment of Missing Primary Splitters: Surrogates CART uses a more refined method —a surrogate is used as a stand in

for a missing primary field

surrogate should be a valid replacement for primary

Consider our example of INCOME

Other variables like Education or Occupation might work as good surrogates

Higher education people usually have higher incomes

People in high income occupations will usually (though not always) have higher incomes

Using surrogate means that missing on primary not all treated same way

Whether go left or right depends on surrogate value

thus record specific . . . some cases go left others go right

Page 26: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

A primary splitter is the best splitter of a node

A surrogate is a splitter that splits in a fashion similar to the primary

Surrogate — variable with near equivalent information

Why Useful? If the primary is expensive or difficult to gather and the surrogate is

not

Then consider using the surrogate instead

Loss in predictive accuracy might be slight

If primary splitter is MISSING then CART will use a surrogate

if top surrogate missing CART uses 2nd best surrogate etc

If all surrogates missing also CART uses majority rule

Surrogates Mimicking Alternatives to Primary Splitters

Page 27: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

You will never know when to stop . . . so don’t!

Instead . . . grow trees that are obviously too big

Largest tree grown is called “maximal” tree

Maximal tree could have hundreds or thousands of nodes usually instruct CART to grow only moderately too big

rule of thumb: should grow trees about twice the size of the truly best tree

This becomes first stage in finding the best tree

Next we will have to get rid the parts of the overgrown tree that don’t work (not supported by test data)

CART Pruning Method: Grow Full Tree, Then Prune

Page 28: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

3030

Maximal Tree Example

Page 29: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Take a very large tree (“maximal” tree)

Tree may be radically over-fit

Tracks all the idiosyncrasies of THIS data set

Tracks patterns that may not be found in other data sets

At bottom of tree splits based on very few cases

Analogous to a regression with very large number of variables

PRUNE away branches from this large tree

But which branch to cut first?

CART determines a pruning sequence:

the exact order in which each node should be removed

pruning sequence determined for EVERY node

sequence determined all the way back to root node

Tree Pruning

Page 30: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

3232

Pruning: Which nodes come off next?

Page 31: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Prune away "weakest link" "weakest link" — the nodes that add least to overall accuracy of the tree

contribution to overall tree a function of both increase in accuracy and size of node

accuracy gain is weighted by share of sample

small nodes tend to get removed before large ones

If several nodes have same contribution they all prune away simultaneously

Hence more than two terminal nodes could be cut off in one pruning

Sequence determined all the way back to root node

need to allow for possibility that entire tree is bad

if target variable is unpredictable we will want to prune back to root . . . the no model solution

Order of Pruning: Weakest Link Goes First

Page 32: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

3434

Pruning Sequence Example

24 Terminal Nodes 21 Terminal Nodes

20 Terminal Nodes 18 Terminal Nodes

Page 33: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

3535

Now we test every tree in the pruning sequence Take a test data set and drop it down the largest tree in

the sequence and measure its predictive accuracy how many cases right and how many wrong

measure accuracy overall and by class

Do same for 2nd largest tree, 3rd largest tree, etc

Performance of every tree in sequence is measured

Results reported in table and graph formats

Note that this critical stage is impossible to complete without test data

CART procedure requires test data to guide tree evaluation

Page 34: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Compare error rates measured by

learn data

large test set

Learn R(T) always decreases as tree grows (Q: Why?)

Test R(T) first declines then increases (Q: Why?)

Overfitting is the result tree of too much reliance on learn R(T)

Can lead to disasters when applied to new data

71 .00 .4263 .00 .4058 .03 .3940 .10 .3234 .12 .3219 .20 .31

**10 .29 .30 9 .32 .34

7 .41 .476 .46 .545 .53 .612 .75 .821 .86 .91

No.Terminal

NodesR(T) Rts(T)

Training Data Vs. Test Data Error Rates

Page 35: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

First, provides a rough guide of how you are doing Truth will typically be WORSE than training data measure

If tree performing poorly on training data error may not want to pursue further

Training data error rate more accurate for smaller trees

So reasonable guide for smaller trees

Poor guide for larger trees

At optimal tree training and test error rates should be similar

if not something is wrong

useful to compare not just overall error rate but also within node performance between training and test data

Why look at training data error rates (or cost) at all?

Page 36: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Within a single CART run which tree is best?

Process of pruning the maximal tree can yield many sub-trees

Test data set or cross- validation measures the error rate of each tree

Current wisdom — select the tree with smallest error rate

Only drawback — minimum may not be precisely estimated

Typical error rate as a function of tree size has flat region

Minimum could be anywhere in this region

The Best Pruned Subtree:An Estimation Problem

0 10 20 30 40 50

Tk

R(Tk)

0

1

| |~

^

CART: Optimal Tree

Page 37: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Original monograph recommends NOT choosing minimum error tree because of possible instability of results from run to run

Instead suggest SMALLEST TREE within 1 SE of the minimum error tree

Tends to provide very stable results from run to run

Is possibly as accurate as minimum cost tree yet simpler

Current learning — one SERULE is good for small data sets

For large data sets one should pick most accurate tree

known as the zero SE rule

One SE Rule -- One Standard Error Rule

Page 38: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

Optimal tree has lowest or near lowest cost as determined by a test procedure

Tree should exhibit very similar accuracy when applied to new data

BUT Best Tree is NOT necessarily the one that happens to be most accurate on a single test database

trees somewhat larger or smaller than “optimal” may be preferred

Room for user judgment judgment not about split variable or values

judgment as to how much of tree to keep

determined by story tree is telling

willingness to sacrifice a small amount of accuracy for simplicity

In what sense is the optimal tree “best”?

Page 39: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

4141

CART Summary

CART Key Features binary splits

gini index as splitting criteria

grow, then prune

surrogates for missing values

optimal tree – 1 SE rule

lots of nice graphics

Page 40: Machine Learning in Real World: CART 2 Outline  CART Overview and Gymtutor Tutorial Example  Splitting Criteria  Handling Missing Values  Pruning.

4242

Decision Tree Summary

Decision Trees splits – binary, multi-way split criteria – entropy, gini, … missing value treatment pruning rule extraction from trees

Both C4.5 and CART are robust tools No method is always superior –

experiment!

witten & eibe