Post on 28-Dec-2015
Breeding Decision Trees Using Evolutionary Techniques
Papagelis Athanasios - Kalles DimitriosComputer Technology Institute & AHEAD RM
Introduction
We use GAs to evolve simple and accurate binary decision treesSimple genetic operators over tree structuresExperiments with UCI datasets very good size competitive accuracy results
Experiments with synthetic datasets Superior accuracy results
Current tree induction algorithms…
.. Use greedy heuristics To guide search during tree building To prune the resulting trees
Fast implementationsAccurate results on widely used benchmark datasets (like UCI datasets)Optimal results ? No
Good for real world problems? There are not many real world datasets available
for research.
More on greedy heuristics
They can quickly guide us to desired solutionsOn the other hand they can substantially deviate from optimalWHY? They are very strict Which means that they are VERY
GOOD just for a limited problem space
Why GAs should work ?
GAs are not Hill climbers
Blind on complex search spaces Exhaustive searchers
Extremely expensive
They are … Beam searchers
They balance between time needed and space searched
Application on bigger problem space Good results for much more problems No need to tune or derive new algorithms
Another way to see it..
BiasesPreference bias Characteristics of output
We should choose about it e.g small trees
Procedural bias How we will search?
We should not choose about it Unfortunately we have to:
Greedy heuristics make strong hypotheses about search space
GAs make weak hypotheses about search space
The real world question…Are there datasets where hill-climbing techniques are really inadequate ? e.g unnecessarily big – misguiding output
Yes there are… Conditionally dependent attributes
e.g XOR Irrelevant attributes
Many solutions that use GAs as a preprocessor so as to select adequate attributes
Direct genetic search can be proven more efficient for those datasets
The proposed solution
Select the desired decision tree characteristics (e.g small size)Adopt a decision tree representation with appropriate genetic operatorsCreate an appropriate fitness functionProduce a representative initial populationEvolve for as long as you wish!
Initialization procedurePopulation of minimum decision trees Simple and fast
Choose a random value as test valueChoose two random classes as leaves
A=2
Class=1 Class=2
Genetic operators
M utate d N o de
N e w Te s t Value
M utate d Le af
N e w C las s
F i g u r e 1 . M u t a t i o n E x a m p l e s
C hose n N o de
C hose n N o de
F igure 2 . C rossover E xam ples
Payoff function
Balance between accuracy and size
xsize
xssifiedCorrectClaitreepayoff
i
i
2
2 *)(
set x depending on the desired output characteristics.
•Small Trees ? x near 1
•Emphasis on accuracy ? x grows big
Advanced System Characteristics
Scalled payoff function (Goldberg, 1989)
Alternative crossovers Evolution towards fit subtrees
Accurate subtrees had less chance to be used for crossover or mutation.
Limited Error Fitness (LEF) (Gathercole & Ross, 1997) significant CPU timesavings and
insignificant accuracy loses
Second Layer GA
Test the effectiveness of all those components
coded information about the mutation/crossover rates and different heuristics as well as a number of other optimizing parameters
Most recurring results: mutation rate 0.005 crossover rate 0.93 use a crowding avoidance technique Alternative crossover/mutation techniques did not
produce better results than basic crossover/mutation
Search space / Induction costs
10 leaves,6 values,2 classes Search space >50,173,704,142,848 (HUGE!)
Greedy feature selection O(ak) a=attributes,k=instances (Quinlan 1986) O(a2k2) one level lookahead (Murthy and Salzberg,
1995) O(adkd) for d-1 levels of lookahead
Proposed heuristic O(gen* k
2*a).
Extended heuristic O(gen*k*a)
How it works? An example (a)
An artificial dataset with eight rules (26 possible value, three classes) First two activation-rules as below:
(15.0 %) c1 A=(a or b or t) & B=(a or h or q or x)
(14.0%) c1 B=(f or l or s or w) & C=(c or e or f or k)
Huge Search Space !!!
How it works? An example (b)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10
70
130
190
250
310
370
430
490
550
610
670
730
790
Generations
Fit
ne
ss
- A
ccu
racy
0
20
40
60
80
100
120
140160
180
200
220
240
260
280
300
Siz
e
Mean FitnessFitnessAccuracySize
Illustration of greedy heuristics problem
An example dataset (XOR over A1&A2)A1 A2 A3 Class
T F T TT F F TF T F TF T T TF F F FF F F FT T T FT T F T
More experiments towards this direction
Name Attrib. Class Function Noise Instanc. Random Attributes
Xor1 10 (A1 xor A2) or (A3 xor A4)
No 100 6
Xor2 10 (A1 xor A2) xor (A3 xor A4)
No 100 6
Xor3 10 (A1 xor A2) or (A3 and A4) or (A5 and
A6)
10% class error
100 4
Par1 10 Three attributes parity problem
No 100 7
Par2 10 Four attributes parity problem
No 100 6
Results for artificial datasets
C4.5 GATree
Xor1 67±12.04 100±0
Xor2 53±18.57 90±17.32
Xor3 79±6.52 78±8.37
Par1 70±24,49 100±0
Par2 63±6.71 85±7.91
Results for UCI datasets
1
Table 1: Classification accuracy
C4.5 OneR GATree
Colic 83.84±3.41 81.37±5.36 85.01±4.55
Heart-Statlog 74.44±3.56 76.3±3.04 77.48±3.07
Diabetes 66.27±3.71 63,27±2.59 63,97±3.71
Credit 83.77±2.93 86.81±4.45 86.81±4
Hepatitis 77.42±6.84 84.52±6.2 80.46±5.39
Iris 92±2.98 94.67±3.8 93.8±4.02
Labor 85.26±7.98 72.73±14.37 87.27±7.24
Lymph 65.52±14.63 74.14±7.18 75.24±10.69
Breast-Cancer 71.93±5.11 68.17±7.93 71.03±8.34
Zoo 90±7.91 43.8±10.47 85.4±4.02
Vote 96.09±3.86 95.63±4.33 95.63±4.33
Glass 55.24±7.49 43.19±4.33 53.48±4.33
Balance-Scale 78.24±4.4 59.68±4.4 71.15±6.47
AVERAGES 78.46 72.64 78.98
2
Table 2: Average tree sizes
C4.5 GATree
Colic 27.4 5.84
Heart-Statlog 39.4 8.28
Diabetes 140.6 6.6
Credit 57.8 3
Hepatitis 19.8 5.56
Iris 9.6 7.48
Labor 8.6 8.72
Lymph 28.2 7.96
Breast-Cancer 35.4 6.68
Zoo 17 10.12
Vote 11 3
Glass 60.2 8.98
Balance-Scale 106.6 8.92
AVERAGES 43.2 7.01
C4.5 / OneR deficiencies
Similar preference biases Accurate, small decision trees
This is acceptable
Not optimized procedural biases Emphasis on accuracy (C4.5)
Not optimized tree’s size Emphasis on size (OneR)
Trivial search policy Pruning as a greedy heuristic has similar
disadvantages
Future workMinimize evolution time
crossover/mutation operators change the tree from a node downwards
we can classify only the instances that belong to the changed-node’s subtree.
But we need to maintain more node statistics
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
1 5 9 13 17 21 25 29 33 37 41 45 49
Tree levels
Percen
t O
f In
sta
nces t
ha
t m
ust
be r
e-
cla
ssif
ied
Complete BinaryDecision Tree
Linear BinaryDecision Tree
Average needed re-classification