Post on 02-Nov-2014
description
Introduction to CART® Salford Systems ©2007
Data Mining with Decision Trees:Data Mining with Decision Trees:An Introduction to CARTAn Introduction to CART®®
Dan Steinberg
Mikhail Golovnyagolomi@salford-systems.com
http://www.salford-systems.com
Introduction to CART® Salford Systems ©2007 2
In The Beginning…
In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which: Could separate relevant from irrelevant predictors Did not require any kind of variable transforms (logs, square roots) Impervious to outliers and missing values Could yield relatively simple and easy to comprehend models Required little to no supervision by the analyst Was frequently more accurate than traditional logistic regression or
discriminant analysis, and other parametric tools
Introduction to CART® Salford Systems ©2007 3
The Years of Struggle
CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons Monograph introducing CART is challenging to read
brilliant book overflowing with insights into tree growing methodology but fairly technical and brief
Method was not expounded in any textbooks Originally taught only in advanced graduate statistics
classes at a handful of leading Universities Original standalone software came with slender
documentation, and output was not self-explanatory Method was such a radical departure from conventional
statistics
Introduction to CART® Salford Systems ©2007 4
The Final Triumph Rising interest in data mining
Availability of huge data sets requiring analysis Need to automate or accelerate and improve analysis process
Comparative performance studies
Advantages of CART over other tree methods handling of missing values assistance in interpretation of results (surrogates) performance advantages: speed, accuracy
New software and documentation make techniques accessible to end users
Word of mouth generated by early adopters Easy to use professional software available from Salford
Systems
Introduction to CART® Salford Systems ©2007 5
So What is CART?
Best illustrated with a famous example-the UCSD Heart Disease study
Given the diagnosis of a heart attack based on Chest pain, Indicative EKGs, Elevation of enzymes typically
released by damaged heart muscle Predict who is at risk of a 2nd heart attack and early death
within 30 days Prediction will determine treatment program (intensive care or
not) For each patient about 100 variables were available, including:
Demographics, medical history, lab results 19 noninvasive variables were used in the analysis
Introduction to CART® Salford Systems ©2007 6
Typical CART Solution Example of a
CLASSIFICATION tree
Dependent variable is categorical (SURVIVE, DIE)
Want to predict class membership
PATIENTS = 215
SURVIVE 178 82.8%DEAD 37 17.2%
Is BP<=91?
Terminal Node A
SURVIVE 6 30.0%DEAD 14 70.0%
NODE = DEAD
Terminal Node B
SURVIVE 102 98.1%DEAD 2 1.9%
NODE = SURVIVE
PATIENTS = 195
SURVIVE 172 88.2%DEAD 23 11.8%
Is AGE<=62.5?
Terminal Node C
SURVIVE 14 50.0%DEAD 14 50.0%
NODE = DEAD
PATIENTS = 91
SURVIVE 70 76.9%DEAD 21 23.1%
Is SINUS<=.5?
Terminal Node D
SURVIVE 56 88.9%DEAD 7 11.1%
NODE = SURVIVE
<= 91 > 91
<= 62.5 > 62.5
>.5<=.5
Introduction to CART® Salford Systems ©2007 7
How to Read It Entire tree represents a complete analysis or model Has the form of a decision tree Root of inverted tree contains all data Root gives rise to child nodes Child nodes can in turn give rise to their own children At some point a given path ends in a terminal node Terminal node classifies object Path through the tree governed by the answers to
QUESTIONS or RULES
Introduction to CART® Salford Systems ©2007 8
Decision Questions
Question is of the form: is statement TRUE? Is continuous variable X c ? Does categorical variable D take on levels i, j, or k?
e.g. Is geographic region 1, 2, 4, or 7?
Standard split: If answer to question is YES a case goes left; otherwise it goes right This is the form of all primary splits
Question is formulated so that only two answers possible Called binary partitioning In CART the YES answer always goes left
Introduction to CART® Salford Systems ©2007 9
The Tree is a Classifier
Classification is determined by following a case’s path down the tree
Terminal nodes are associated with a single class Any case arriving at a terminal node is assigned to that class
The implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later) Other algorithms often use only the majority rule or a limited set of
fixed rules for node class assignment thus being less flexible In standard classification, tree assignment is not
probabilistic Another type of CART tree, the class probability tree does report
distributions of class membership in nodes (discussed later) With large data sets we can take the empirical distribution in
a terminal node to represent the distribution of classes
Introduction to CART® Salford Systems ©2007 10
The Importance of Binary Splits
Some researchers suggested using 3-way, 4-way, etc. splits at each node (univariate)
There are strong theoretical and practical reasons why we feel strongly against such approaches: Exhaustive search for a binary split has linear algorithm complexity,
the alternatives have much higher complexity Choosing the best binary split is “less ambiguous” than the
suggested alternatives Most importantly, binary splits result to the smallest rate of data
fragmentation thus allowing a larger set of predictors to enter a model
Any multi-way split can be replaced by a sequence of binary splits There are numerous practical illustrations on the validity of
the above claims
Introduction to CART® Salford Systems ©2007 11
Accuracy of a Tree
For classification trees All objects reaching a terminal node are classified in the same way
e.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS
Classification is same regardless of other medical history and lab results
The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix)
The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number Overall accuracy Average within-class accuracy Most generally – the Expected Cost
Introduction to CART® Salford Systems ©2007 12
Prediction Success Table
A large number of different terms exists to name various simple ratios based on the table above: Sensitivity, specificity, hit rate, recall, overall accuracy, false/true
positive, false/true negative, predictive value, etc.
Classified As
TRUE Class 1 Class 2"Survivors" "Early Deaths" Total % Correct
Class 1
"Survivors" 158 20 178 88%
Class 2
"Early Deaths" 9 28 37 75%
Total 167 48 215 86.51%
Introduction to CART® Salford Systems ©2007 13
Tree Interpretation and Use
Practical decision tool Trees like this are used for real world decision making
Rules for physicians Decision tool for a nationwide team of salesmen needing to classify
potential customers
Selection of the most important prognostic variables Variable screen for parametric model building Example data set had 100 variables Use results to find variables to include in a logistic regression
Insight — in our example AGE is not relevant if BP is not high Suggests which interactions will be important
Introduction to CART® Salford Systems ©2007 14
Binary Recursive Partitioning
Data is split into two partitions thus “binary” partition
Partitions can also be split into sub-partitions hence procedure is recursive
CART tree is generated by repeated dynamic partitioning of data set Automatically determines which variables to use Automatically determines which regions to focus on The whole process is totally data driven
Introduction to CART® Salford Systems ©2007 15
Three Major Stages
Tree growing Consider all possible splits of a node Find the best split according to the given splitting
rule (best improvement) Continue sequentially until the largest tree is grown
Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branches
Optimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence
Introduction to CART® Salford Systems ©2007 16
Stage 1Stage 2
Stage 3
General Workflow
Historical Data
Learn
Test
Validate
Build a Sequence of Nested Trees
Monitor Performance
Best
Confirm Findings
Introduction to CART® Salford Systems ©2007 17
Large Trees and Nearest Neighbor Classifier
Fear of large trees is unwarranted! To reach a terminal node of a large tree many conditions
must be met Any new case meeting all these conditions will be very similar
to the training case(s) in terminal node Hence the new case should be similar on the target variable also
Key Result: Largest tree possible has a misclassification rate R asymptotically not greater than twice the Bayes rate R* Bayes rate is the error rate of the most accurate possible classifier
given the data Thus useful to check largest tree error rate
For example, largest tree has relative error rate of .58 so theoretically best possible is .29
The exact analysis follows
Introduction to CART® Salford Systems ©2007 18
Derivation
Assume binary outcome and let p(x) be the conditional probability of focus class given x
Then the conditional Bayes risk at point x isr*(x) = min [p(x), 1 - p(x) ]
Now suppose that the actual prediction at X is governed by the class membership of the nearest neighbor xn, then the conditional NN risk isr(x) = p(x) [1-p(xn)] + [1-p(x)] p(xn) 2 p(x) [1- p(x)]
Hence r(x) = 2 r*(x) [1-r*(x)] on large samples Or in the global sense: R* R 2 R* (1 – R*)
Introduction to CART® Salford Systems ©2007 19
Illustration
The error of the largest tree is asymptotically squeezed between the green and red curves
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
R*R* 2R*(1-R*) 2R*
Introduction to CART® Salford Systems ©2007 20
Impact on the Relative Error
RE<x> stands for the asymptotically largest possible relative error of the largest tree assuming <x> initial proportion of the smallest of the target classes (smallest of the priors)
More unbalanced data might substantially inflate the relative errors of the largest trees
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.1 0.2 0.3 0.4 0.5 R*
RE0.5
RE0.4
RE0.3
RE0.2
RE0.1
Introduction to CART® Salford Systems ©2007 21
Searching All Possible Splits
For any node CART will examine ALL possible splits Computationally intensive but there are only a finite number of splits
Consider first the variable AGE--in our data set has minimum value 40 The rule
Is AGE 40? will separate out these cases to the left — the youngest persons
Next increase the AGE threshold to the next youngest person
Is AGE 43 This will direct six cases to the left
Continue increasing the splitting threshold value by value
Introduction to CART® Salford Systems ©2007 22
Split Tables Split tables are used to locate continuous splits
There will be as many splits as there are unique values minus one
AGE BP SINUST SURVIVE40 91 0 SURVIVE
40 110 0 SURVIVE
40 83 1 DEAD
43 99 0 SURVIVE
43 78 1 DEAD
43 135 0 SURVIVE
45 120 0 SURVIVE
48 119 1 DEAD
48 122 0 SURVIVE
49 150 0 DEAD
49 110 1 SURVIVE
Sorted by Age Sorted by Blood Pressure
AGE BP SINUST SURVIVE43 78 1 DEAD
40 83 1 DEAD
40 91 0 SURVIVE
43 99 0 SURVIVE
40 110 0 SURVIVE
49 110 1 SURVIVE
48 119 1 DEAD
45 120 0 SURVIVE
48 122 0 SURVIVE
43 135 0 SURVIVE
49 150 0 DEAD
Introduction to CART® Salford Systems ©2007 23
Categorical Predictors
Categorical predictors spawn many more splits For example, only four levels A,
B, C, D result to 7 splits In general, there will be 2K-1-1
splits CART considers all possible
splits based on a categorical predictor unless the number of categories exceeds 15
Left Right
1 A B, C, D
2 B A, C, B
3 C A, B, D
4 D A, B, C
5 A, B C, D
6 A, C B, D
7 A, D B, C
Introduction to CART® Salford Systems ©2007 24
Binary Target Shortcut
When the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2K-1-1) in levels 30 levels takes twice as long as 15, not 10,000+ times as long
When you have high-level categorical variables in a multi-class problem Create a binary classification problem Try different definitions of DPV (which groups to combine) Explore predictor groupings produced by CART From a study of all results decide which are most informative Create new grouped predicted variable for multi-class problem
Introduction to CART® Salford Systems ©2007 25
GYM Model Example
The original data comes from a financial institution and disguised as a health club
Problem: need to understand a market research clustering scheme
Clusters were created externally using 18 variables and conventional clustering software
Need to find simple rules to describe cluster membership
CART tree provides a nice way to arrive at an intuitive story
Introduction to CART® Salford Systems ©2007 26
not usedtarget
Variable Definitions
ID Identification # of member
CLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10)
ANYRAQT Racquet ball usage (binary indicator coded 0, 1)
ONRCT Number of on-peak racquet ball uses
ANYPOOL Pool usage (binary indicator coded 0, 1)
ONPOOL Number of on-peak pool uses
PLRQTPCT Percent of pool and racquet ball usage
TPLRCT Total number of pool and racquet ball uses
ONAER Number of on-peak aerobics classes attended
OFFAER Number of off-peak aerobics classes attended
SAERDIF Difference between number of on- and off-peak aerobics visits
TANNING Number of visits to tanning salon
PERSTRN Personal trainer (binary indicator coded 0, 1)
CLASSES Number of classes taken
NSUPPS Number of supplements/vitamins/frozen dinners purchased
SMALLBUS Small business discount (binary indicator coded 0, 1)
OFFER Terms of offer
IPAKPRIC Index variable for package price
MONFEE Monthly fee paid
FIT Fitness score
NFAMMEN Number of family members
HOME Home ownership (binary indicator coded 0, 1)
predictors
Introduction to CART® Salford Systems ©2007 27
Initial Runs
Dataset: gymsmall.syd Set the target and predictors Set 50% test sample Navigate the largest, smallest, and optimal trees Check [Splitters] and [Tree Details] Observe the logic of pruning Check [Summary Reports] Understand competitors and surrogates Understand variable importance
Introduction to CART® Salford Systems ©2007 28
Pruning Multiple Nodes
Note that Terminal Node 5 has class assignment 1
All the remaining nodes have class assignment 2
Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal
Introduction to CART® Salford Systems ©2007 29
Pruning Weaker Split First
Eliminating nodes 3 and 4 results to “lesser” damage – loosing one correct dominant class assignment
Eliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment
Introduction to CART® Salford Systems ©2007 30
Understanding Variable Importance
Both FIT and CLASSES are reported as equally important
Yet the tree only includes CLASSES
ONPOOL is one of the main splitters yet is has very low importance
Want to understand why
Introduction to CART® Salford Systems ©2007 31
Competitors and Surrogates
Node details for the root node split are shown above Improvement measures class separation between two sides Competitors are next best splits in the given node ordered by
improvement values Surrogates are splits most resembling (case by case) the main
split Association measures the degree of resemblance Looking for competitors is free, finding surrogates is expensive It is possible to have an empty list of surrogates
FIT is a perfect surrogate (and competitor) for CLASSES
Introduction to CART® Salford Systems ©2007 32
The Utility of Surrogates
Suppose that the main splitter is INCOME and comes from a survey form
People with very low or very high income are likely not to report it – informative missingness in both tails of the income distribution
The splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case)
Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sides
For example, all low education subjects will join the low income side while all high education subjects will join the high income side
Introduction to CART® Salford Systems ©2007 33
Competitors and Surrogates Are Different
For symmetry considerations, both splits must result to identical improvements
Yet one split can’t be substituted for another Hence Split X and Split Y are perfect competitors (same
improvement) but very poor surrogates (low association) The reverse does not hold – a perfect surrogate is always a
perfect competitor
ABC
AB
CSplit X
ABC
AC
BSplit Y
Three nominal classes
Equally present Equal priors, unit
costs, etc.
Introduction to CART® Salford Systems ©2007 34
Calculating Association
Ten observations in a node
Main splitter X Split Y mismatches
on 2 observations The default rule
sends all cases to the dominant side mismatching 3 observations
Split X Split Y
Default
Association = (Default – Split Y)/Default = (3-2)/3 = 1/3
Introduction to CART® Salford Systems ©2007 35
Notes on Association
Measure local to a node, non-parametric in nature Association is negative for nodes with very poor
resemblance, not limited by -1 from below Any positive association is useful (better strategy
than the default rule) Hence define surrogates as splits with the highest
positive association In calculating case mismatch, check the possibility
for split reversal (if the rule is met, send cases to the right) CART marks straight splits as “s” and reverse splits as “r”
Introduction to CART® Salford Systems ©2007 36
Calculating Variable Importance
List all main splitters as well as all surrogates along with the resulting improvements
Aggregate the above list by variable accumulating improvements
Sort the result descending Divide each entry by the largest value and scale to 100% The resulting importance list is based on the given tree
exclusively, if a different tree a grown the importance list will most likely change
Variable importance reveals a deeper structure in stark contrast to a casual look at the main splitters
Introduction to CART® Salford Systems ©2007 37
Mystery Solved
In our example improvements in the root node dominate the whole tree
Consequently, the variable importance reflects the surrogates table in the root node
Introduction to CART® Salford Systems ©2007 38
Penalizing Individual Variables
We use variable penalty control to penalize CLASSES a little bit
This is enough to ensure FIT the main splitter position
The improvement associated with CLASSES is now downgraded by 1 minus the penalty factor
Introduction to CART® Salford Systems ©2007 39
Penalizing Missing and Categorical Predictors
Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantage
Introducing systematic penalty set to the proportion of the variable missing effectively removes the possible bias
Same holds for categorical variables with large number of distinct levels
We recommend that the user always set both controls to 1
Introduction to CART® Salford Systems ©2007 40
Splitting Rules - Introduction
Splitting rule controls the tree growing stage At each node, a finite pool of all possible splits exists
The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in the node
The task of the splitting rule is to rank all splits in the pool according to their improvements The winner will become the main splitter The top runner-ups will become competitors
It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split”
Introduction to CART® Salford Systems ©2007 41
Priors Adjusted Probabilities
Assume K target classes Introduce prior probabilities i, i =1,…,K Assume the number of LEARN observations within each
class: Ni, i =1,…,K
Consider node t containing ni, i =1,…,K LEARN records of each target class
Then the priors adjusted probability that a case lands in node t (node probability) is: pt = (ni / Ni) i
The conditional probability of class i given that a case reached node t is: pi(t) = (ni / Ni) i / pt
The conditional node probability of a class is called within node probability and is used as an input to a splitting rule
Introduction to CART® Salford Systems ©2007 42
GINI Splitting Rule
Introduce node t impurity asIt = 1 - pi
2(t) = ij pi(t) pj(t) Observe that a pure node has zero impurity The impurity function is convex The impurity function is maximized when all within node
probabilities are equal Consider any split and let pleft be the probability that
a case goes left given that we are in the parent node, pright – probability that a case goes right pleft = pt
left / ptparent and pright = pt
right / ptparent
Then the GINI split improvement is defined as: = Iparent – pleft Ileft – pright Iright
Introduction to CART® Salford Systems ©2007 43
Illustration
0.00.2
0.40.6
0.81.0 0.0
0.2
0.40.6
0.81.0
Imp
urity
0.0
0.2
0.4
0.6
GINI Impurity Function for a 3-class Target
(0.33, 0.33)
max
Introduction to CART® Salford Systems ©2007 44
GINI Properties
The impurity reduction is weighted according to the node size upon which it operates
GINI splitting rule sequentially minimizes overall tree impurity (defined as the sum over all terminal nodes of node impurities times the node probabilities)
The splitting rule favors end-cut splits (when one class is separated from the rest) – which follows from its mathematical expression
Introduction to CART® Salford Systems ©2007 45
Example
GYM cluster example, using 10-level target
GINI splitting rule separates classes 3 and 8 from the rest in the root node
Introduction to CART® Salford Systems ©2007 46
ENTROPY Splitting Rule
Similar in spirit to GINI The only difference is in the definition of the
impurity function It = - pi(t) log pi(t) The function is convex, zero for pure nodes, and
maximized for equal mix of within node probabilitiesImpurity on Binary Target
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.25 0.5 0.75 1
Probability
No
rmal
ized
Imp
rove
men
t
GINI
ENTROPY
Because of the different shape, a tree may evolve differently
Introduction to CART® Salford Systems ©2007 47
Example
GYM cluster example, using 10-level target ENTROPY splitting rule separates six
classes on the left from four classes on the right in the root node
Introduction to CART® Salford Systems ©2007 48
TWOING Splitting Rule
Group K target classes into two Super-classes in the parent node according to the rule:C1 = {i : pi(tleft) pi(tright) }C2 = {i : pi(tleft) < pi(tright) }
Compute the GINI improvement assuming that we had a binary target defined as the super-classes above
This usually results to more balanced splits Target classes of “similar” nature will tend to group
together All minivans are separated from the pickup trucks
ignoring the fine differences within each group
Introduction to CART® Salford Systems ©2007 49
Example
GYM cluster example, using 10-level target TWOING splitting rule separates five classes
on the left from five classes on the right in the root node thus producing a very balanced split
Introduction to CART® Salford Systems ©2007 50
Best Possible Split
Suppose that we have a K-class multinomial problem, pi – within-node probabilities for the parent node, pleft - probability of the left direction
Further assume that we have the largest possible pool of binary splits available (say, using categorical splitter with all unique values)
Question: what is the resulting best split under each of the splitting rules?
GINI: separate major class j from the rest, where pj = maxi(pi) – end cut approach!
TWOING and ENTROPY: group classes together such that| pleft - 0.5| is minimized – balanced approach!
Introduction to CART® Salford Systems ©2007 51
Example
GINI trees usually grow narrow and deep while TWOING trees usually grow shallow and wide
A 40B 30C 20D 10
A 40 B 30C 20D 10
GINI Best Split
A 40B 30C 20D 10
A 40D 10
B 30C 20
TWOING Best Split
Introduction to CART® Salford Systems ©2007 52
Symmetric GINI
One possible way to introduce cost matrix actively into the splitting ruleIt = ij Cij pi(t) pj(t)where Cij stands for the cost of class i being misclassified as class j
Note that the costs are automatically symmetrized thus rendering the rule less useful for nominal targets with asymmetric costs
The rule is most useful when handling ordinal outcome with costs proportional to the rank distance between the levels
Another “esoteric” use would be to cancel asymmetric costs during the tree growing stage for some reason while allowing asymmetric costs during pruning and model selection
Introduction to CART® Salford Systems ©2007 53
Ordered TWOING
Assume ordered categorical target Original twoing ignores the order or classes during the
creation of super-classes (strictly nominal approach) Ordered twoing restricts super-class creation only to
ordered sets of levels, the rest remains identical to twoing The rule is most useful when the ordered separation of
target classes is desirable But recall that the splitting rule has no control over the pool of splits
available, it is only used to evaluate and suggest the best possible split – hence, the “magic” of the ordered twoing splitting rule is somewhat limited
Using symmetric GINI with properly introduced costs usually provides more flexibility in this case
Introduction to CART® Salford Systems ©2007 54
Class Probability Rule
During the pruning stage, any pair of adjacent terminal nodes with identical class assignment is automatically eliminated
This is because by default CART builds a classifier and keeping such a pair of nodes would contribute nothing to the model classification accuracy
In reality, we might be interested in seeing such splits For example, having “very likely” responder node next to “likely”
responder node could be informative
Class probability splitting rule does exactly that However, it comes at a price: due to convoluted theoretical reasons,
optimal class probability trees become over-pruned Therefore, only experienced CART users should try this
Introduction to CART® Salford Systems ©2007 55
Example
Introduction to CART® Salford Systems ©2007 56
Favor Even Splits Control
It is possible to further refine splitting rule behavior by introducing “Favor Even Splits” parameter (a.k.a. Power)
This control is most sensitive with the twoing splitting rule and was actually “inspired” by it
We recommend that the user always considers at least the following minimal set of modeling runs GINI with Power=0 ENTROPY with Power=0 TWOING with Power=1
BATTERY RULES can be used to accomplish this automatically
Introduction to CART® Salford Systems ©2007 57
Fundamental Tree Building Mechanisms
Splitting rules have somewhat limited impact on the overall tree structure, especially on binary targets
We are about to introduce a powerful set of PRIORS and COSTS controls that will make profound implications for all stages of tree development and evaluation
These controls lie at the core of successful model building techniques
Introduction to CART® Salford Systems ©2007 58
Key Players
Using historical data, build a model for the appropriately chosen sets of priors and costs
Population DM Engine
Analyst
Model
Dataset
Priors i
Costs Cij
Introduction to CART® Salford Systems ©2007 59
Fraud Example
There are 1909 observations with 42 outcomes of interest
Having such extreme class distribution poses a lot of modeling challenges
The effect of key control settings is usually magnified on such data
We will use this dataset to illustrate the workings of priors and costs below
Introduction to CART® Salford Systems ©2007 60
Fraud Example – PRIORS EQUAL (0.5, 0.5)
High accuracy at 93% Richest node has 37% Small relative error 0.145 Low class assignment threshold at 2.2%
Introduction to CART® Salford Systems ©2007 61
Fraud Example – PRIORS SPECIFY (0.9, 0.1)
Reasonable accuracy at 64% Richest node has 57% Larger relative error 0.516 Class assignment threshold around 15%
Introduction to CART® Salford Systems ©2007 62
Fraud Example – PRIORS DATA (0.98, 0.02)
Poor accuracy at 26% Richest node has 91% High relative error 0.857 Class assignment threshold at 50%
Introduction to CART® Salford Systems ©2007 63
Internal Class Assignment Rule
Recall the definition of the within node probabilitypi(t) = (ni / Ni) i / pt
Internal class assignment rule is always the same:assign node to the class with the highest pi(t)
For PRIORS DATA this simplifies to the simple node majority rule: pi(t) ni
For PRIORS EQUAL this simplifies to class assignment with the highest lift with respect to the root node presence: pi(t) ni / Ni
Introduction to CART® Salford Systems ©2007 64
Putting It All Together
As prior on the given class increases The class assignment threshold decreases Node richness goes down But class accuracy goes up PRIORS EQUAL use the root node class ratio as
the class assignment threshold – hence, most favorable conditions to build a tree
PRIORS DATA use the majority rule as the class assignment threshold – hence, difficult modeling conditions on unbalanced classes
Introduction to CART® Salford Systems ©2007 65
The Impact of Priors
We have a mixture of two overlapping classes The vertical lines show root node splits for different sets of priors
(the left child is classified as red, the right child is classified as blue) Varying priors provides effective control over the tradeoff between
class purity and class accuracy
r=0.1, b=0.9 r=0.9, b=0.1
r=0.5, b=0.5
Introduction to CART® Salford Systems ©2007 66
Model Evaluation
Knowing model Prediction Success (PS) table on a given dataset, calculate the Expected Cost for the given cost matrix assuming the prior distribution of classes in the population
Population Evaluator
Analyst
Expected Cost
Dataset
Priors i
Costs Cij
Model PS TableCells nij
Introduction to CART® Salford Systems ©2007 67
Estimating Expected Cost
The following estimator is usedR = ij Cij (nij / Ni) i
For PRIORS DATA, unit costs: i = Ni / NR = [ ij nij ] / Noverall misclassification rate
For PRIORS EQUAL, unit costs: i = 1 / K R = { (ij nij ) / Ni } / K average within class misclassification rate
The relative cost is computed by dividing by the expected cost of the trivial classifier (all classes are assigned to the dominant cost-adjusted class)
Introduction to CART® Salford Systems ©2007 68
PRIORS DATA – Relative Cost
Expected cost R = (31+5)/1909 = 36/1909 = 1.9% Trivial model R0 = 42/1909 = 2.2%
This is already a very small error value
Relative cost Rel = R / R0 = 36/42 = 0.857 Relative cost will be less than 1.0 only if the total number
of cases misclassified is below 42 – a very difficult requirement
Introduction to CART® Salford Systems ©2007 69
PRIORS EQUAL – Relative Cost
Expected cost R = (138/1867+3/42) / 2 = 7.3% Trivial model R0 = (0/1867+42/42) / 2 = 50%
A very large error value independent of the initial class distribution
Relative cost Rel = R / R0 = 138/1867 + 3/42 = 0.145 Relative cost will be less than 1.0 for a wide variety of modeling
outcomes The minimum usually tends to equalize individual class accuracies
Introduction to CART® Salford Systems ©2007 70
Using Equivalent Costs
Setting the ratio of costs mimic the ratio of DATA priors reproduces the PRIORS EQUAL run as seen both from the tree structure and the relative error value
On binary problems both mechanisms produce identical range of models
Introduction to CART® Salford Systems ©2007 71
Automated CART Runs
Starting with CART 6.0 we have added a new feature called batteries
A battery is a collection of runs obtained via a predefined procedure
A trivial set of batteries is varying one of the key CART parameters: RULES – run every major splitting rule ATOM – vary minimum parent node size MINCHILD – vary minimum node size DEPTH – vary tree depth NODES – vary maximum number of nodes allowed FLIP – flipping the learn and test samples (two runs only)
Introduction to CART® Salford Systems ©2007 72
More Batteries
DRAW – each run uses a randomly drawn learn sample from the original data
CV – varying number of folds used in cross-validation CVR – repeated cross-validation each time with a different
random number seed KEEP – select a given number of predictors at random from
the initial larger set Supports a core set of predictors that must always be present
ONEOFF – one predictor model for each predictor in the given list
The remaining batteries are more advanced and will be discussed individually
Introduction to CART® Salford Systems ©2007 73
Battery MCT
The primary purpose is to test possible model overfitting by running Monte Carlo simulation Randomly permute the target variable Build a CART tree Repeat a number of times The resulting family of profiles measures intrinsic
overfitting due to the learning process Compare the actual run profile to the family of random
profiles to check the significance of the results This battery is especially useful when the relative
error of the original run exceeds 90%
Introduction to CART® Salford Systems ©2007 74
Battery MVI
Designed to check the impact of the missing values indicators on the model performance Build a standard model, missing values are handled via
the mechanism of surrogates Build a model using MVIs only – measures the amount of
predictive information contained in the missing value indicators alone
Combine the two Same as above with missing penalties turned on
Introduction to CART® Salford Systems ©2007 75
Battery PRIOR
Vary the priors for the specified class within a given range For example, vary the prior on class 1 from 0.2 to 0.8 in
increments of 0.02 Will result to a family of models with drastically
different performance in terms of class richness and accuracy (see the earlier discussion)
Can be used for hot spot detection Can also be used to construct a very efficient voting
ensemble of trees
Introduction to CART® Salford Systems ©2007 76
Battery SHAVING
Automates variable selection process Shaving from the top – each step the most important
variable is eliminated Capable to detect and eliminate “model hijackers” – variables that
appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable)
Shaving from the bottom – each step the least important variable is eliminated May drastically reduce the number of variables used by the model
SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least May result to a very long sequence of runs (quadratic complexity)
Introduction to CART® Salford Systems ©2007 77
Battery SAMPLE
Build a series of models in which the learn sample is systematically reduced
Used to determine the impact of the learn sample size on the model performance
Can be combined with BATTERY DRAW to get additional spread
Introduction to CART® Salford Systems ©2007 78
Battery TARGET
Attempt to build a model for each variable in the given list as a target using the remaining variables as predictors
Very effective in capturing inter-variable dependency patterns
Can be used as an input to variable clustering and multi-dimensional scaling approaches
When run on MVIs only, can reveal missing value patterns in the data
Introduction to CART® Salford Systems ©2007 79
Recommended Reading
Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth
Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140
Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64.
Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive Logistic Regression: a Statistical View of Boosting." (Aug. 1998)