Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied...
-
Upload
rosalind-holmes -
Category
Documents
-
view
217 -
download
1
Transcript of Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied...
Forest Trees for On-line Data
Joao Gama, Pedro Medas, Ricado Rocha
Proc. ACM Symposium on Applied Computing-SAC04
2005/5/27 報告人:董原賓
Outline
The VFDT System Ultra Fast Forest Trees (UFFT) Experiment
The VFDT System The Very Fast Decision Tree (VFDT) System :
A decision tree is learned by recursively replacing leaves with decision nodes
maintains a set of sufficient statistics at each decision node and install a split-test at that node when there is enough statistical evidence in favor of a split test
The VFDT System
Use Information gain to find the best attribute as the splitting criteria
The main innovation of the VFDT system is the use of Hoeffding bounds to decide when to install a split-test at a given leaf node
VFDT : Information Gain Information gain using attribute Aj : H(Aj) = info(examples) – info(Aj) info(Aj) = ΣiPi(Σk-Piklog(Pik))
nijk = number of examples that reached the leaf with class k, attribute j and value of attribute i
The VFDT System EX:
Name Gender Height Class
John Male 1.85m Tall
Mary Female 1.68m Tall
Kent Male 1.55m Short
Sue Female 1.71m Tall
Paul Male 1.79m Tall
VFDT : Information Gain Info(example) = 4/5log(5/4) + 1/5log(5/1) = 0.217 Info(Gender) = 3/5 ( 2/3log(3/2) + 1/3log(3/1) ) + 2/5 ( 2/2log(2/2) ) = 0.166 + 0 = 0.166 H(Attributegender) = Info(example)-Info(Gender) =0.217–0.166= 0.051
John(T), Mary(T), Kent(S), Sue(T), Paul(T)
Gender
femalemale
John(T), Kent(S), Paul(T)
Mary(T), Sue(T)
VFDT : Hoeffding Bound Suppose we have made n independent ob
servations of a random variable r whose range is R
The Hoeffding bound states with probability 1-δ that the true average of r, is at least
-ε and ε=
rr
VFDT : Hoeffding Bound Let xa be the attribute with the highest H(), x
b the attribute with second-highest H() and △H = H(xa) – H(xb)
if △H > εthen Hoeding bound states with probability 1-δthat xa is really the attribute with highest value in the evaluation function and the leaf node must be transformed into a decision node that splits on xa
Hoeffding Bound Use EX of slide6 : variable r = Height
with range R = 1.85 – 1.55 = 0.3 n = 5 and δ = 0.2 ε= = (0.32*(ln5/2*5))1/2 = 0.12
The VDFT System It is not efficient to compute H() every time
an example arrives because the evaluation of the function for each example could be very expensive
VFDT only computes the attribute evaluation function H() when a minimum number of examples has been seen since the last evaluation. This minimum number of examples nmin is an user-defined parameter
The VDFT System When two or more attributes continuously
have very similar values of H(), even given a large number of examples, the Hoeding bound will not decide between them
To solve this problem the VFDT uses a constantτ introduced by the user. If
ΔH <ε<τthen the leaf node is transformed into a decision node
Ultra Fast Forest Trees (UFFT) Using Hoeffding Bound to decide when to i
nstall a splitting test in a leaf leading to a decision node
Use analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test
The algorithm builds a binary tree for each possible pair of classes
Ultra Fast Forest Trees (UFFT) During the training phase the algorithm
maintains a short term memory. A fixed number of the most recent examples are maintained in a data structure
The sufficient statistics of these leaves are initialized with the examples in the short term memory that will fall at these leaves
The Ultra Fast Forest Tree system (UFFT) is an incremental algorithm
The Splitting Criteria
use the analytical method for split point selection to choose, for all numerical attributes, the most promising valuej
This analysis assumes that the distribution of the values of an attribute follows a normal distribution for both classes
The Splitting Criteria Let and σ2 are the sample mean and variance
of the class σ2 = ΣiP(xi)(xi - )2
p(-) denotes the estimated probability than an example belongs to class –
d1 and d2 are the possible roots of the equation
The Splitting Criteria Let d be the root closer to the sample mean
s of both classes . The splitting test candidate for each numeric attribute i will be Atti ≦ di
To compute the information gain we need to construct a contingency table with the distribution per class of the number of examples less than d and greater than d
The Splitting Criteria
The information kept on the tree is not sufficient to compute the exact number of examples for each entry in the contingency table
With the assumption of normality, we can compute the probability of observing a value less or greater than di
The Splitting Criteria
AttHeight≦1.7 AttHeight > 1.7
Class TallClass Short
P1Tall
P1Short
P2Tall
P2Short
Class Short
dj = 1.7m
Class Tall
P1Tall P1
ShortP2ShortP2
Tall
The Splitting Criteria The information gain of an attribute: H(Atti) = info(PTall , PShort) + Σj
2=1(Pj * info(Pj
Tall , PjShort))
info(PTall , PShort) = PTall * log(1/PTall) + PShort * log(1/PShort)
PTall = P1Tall + P2
Tall PShort = P1Short + P2
Short
This method only requires to maintain the mean and standard deviation for each class per attribute
The Splitting Criteria
Name Income
Height
Class
John 20000 1.85m
Tall
Mary 35000 1.68m
Tall
Kent 25000 1.71m
Short
Sue 55000 1.59m
Short
Paul 18000 1.79m
Tall
AttHeight≦1.7 AttHeight > 1.7
Class TallClass Short
P1Tall = 1/5
P1Short = 1/5
P2Tall = 2/5
P2Short = 1/5
PTall = 3/15 + 6/15 = 3/5 PShort = 1/5 + 1/5 = 2/5
H(Attheight) = ( 3/5*log(5/3) + 2/5*log(5/2) ) + ( 1/5*log5+1/5*log5 ) + ( 2/5*log(5/2) + 1/5*log5 ) = 0.292 + 0.28 + 0.299 = 0.871
From Leaf to Decision Node To expand the tree, the test Atti ≦ di is installed in t
he leaf, and the leaf becomes a decision node with two new descendant leaves.
To expand a leaf two conditions must be satisfied : 1. The information gain of the selected splitting test to be positive 2. it must exist statistical support in favor of the bes
t splitting test, is asserted using the Hoeffding bound as in VFDT In VFDT, nmin is a user defined constant, in UFFT, th
e number is computed as
Short Term Memory Use a short term memory that
maintains in memory a limited number of the most recent examples
The short term memory is used to update the statistics at the new leaves when they are created
The examples in the short term memory traverse the tree and the ones that reach the new leaves will update the sufficient statistics of the tree.
Classification strategies at leaves
implement two different classification strategies :
1.majority class 2.naive-bayes classifier, at each leaf we maintain sufficient statistics to com
pute the information gain and these are the necessary statistics to compute the conditional probabilities of P(xi|Class)
Forest of Trees The UFFT algorithm builds a binary tree for
each possible pair of classes In the general case of n classes, the
algorithm grows a forest of n(n-1)/2 binary trees
When a new example is received during the tree growing phase each tree will receive the example if the class attached to it is one of the two classes in the tree label
Experiment Athlon XP2000+ 1.66GHz, 512MB RAM,
Linux RedHat 7.3 Three artificial datasets : 1. Waveform, with 21 and 40 attributes and 3 classes
2. LED, with 24 binary attributes(17 are irrelevant) and 10 classes
3. Balance, with 4attributes and 3 classes The parameters values δ= 0.05, τ= 0.00
1 and nmin = 300
Experiment Result