Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied...

27
Forest Trees for On- line Data Joao Gama, Pedro Medas, Ricado Roch a Proc. ACM Symposium on Applied Computing-S AC04 2005/5/27 報報報 報報報

Transcript of Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied...

Page 1: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Forest Trees for On-line Data

Joao Gama, Pedro Medas, Ricado Rocha

Proc. ACM Symposium on Applied Computing-SAC04

2005/5/27 報告人:董原賓

Page 2: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Outline

The VFDT System Ultra Fast Forest Trees (UFFT) Experiment

Page 3: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The VFDT System The Very Fast Decision Tree (VFDT) System :

A decision tree is learned by recursively replacing leaves with decision nodes

maintains a set of sufficient statistics at each decision node and install a split-test at that node when there is enough statistical evidence in favor of a split test

Page 4: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The VFDT System

Use Information gain to find the best attribute as the splitting criteria

The main innovation of the VFDT system is the use of Hoeffding bounds to decide when to install a split-test at a given leaf node

Page 5: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

VFDT : Information Gain Information gain using attribute Aj : H(Aj) = info(examples) – info(Aj) info(Aj) = ΣiPi(Σk-Piklog(Pik))

nijk = number of examples that reached the leaf with class k, attribute j and value of attribute i

Page 6: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The VFDT System EX:

Name Gender Height Class

John Male 1.85m Tall

Mary Female 1.68m Tall

Kent Male 1.55m Short

Sue Female 1.71m Tall

Paul Male 1.79m Tall

Page 7: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

VFDT : Information Gain Info(example) = 4/5log(5/4) + 1/5log(5/1) = 0.217 Info(Gender) = 3/5 ( 2/3log(3/2) + 1/3log(3/1) ) + 2/5 ( 2/2log(2/2) ) = 0.166 + 0 = 0.166 H(Attributegender) = Info(example)-Info(Gender) =0.217–0.166= 0.051

John(T), Mary(T), Kent(S), Sue(T), Paul(T)

Gender

femalemale

John(T), Kent(S), Paul(T)

Mary(T), Sue(T)

Page 8: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

VFDT : Hoeffding Bound Suppose we have made n independent ob

servations of a random variable r whose range is R

The Hoeffding bound states with probability 1-δ that the true average of r, is at least

-ε and ε=

rr

Page 9: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

VFDT : Hoeffding Bound Let xa be the attribute with the highest H(), x

b the attribute with second-highest H() and △H = H(xa) – H(xb)

if △H > εthen Hoeding bound states with probability 1-δthat xa is really the attribute with highest value in the evaluation function and the leaf node must be transformed into a decision node that splits on xa

Page 10: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Hoeffding Bound Use EX of slide6 : variable r = Height

with range R = 1.85 – 1.55 = 0.3 n = 5 and δ = 0.2 ε= = (0.32*(ln5/2*5))1/2 = 0.12

Page 11: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The VDFT System It is not efficient to compute H() every time

an example arrives because the evaluation of the function for each example could be very expensive

VFDT only computes the attribute evaluation function H() when a minimum number of examples has been seen since the last evaluation. This minimum number of examples nmin is an user-defined parameter

Page 12: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The VDFT System When two or more attributes continuously

have very similar values of H(), even given a large number of examples, the Hoeding bound will not decide between them

To solve this problem the VFDT uses a constantτ introduced by the user. If

ΔH <ε<τthen the leaf node is transformed into a decision node

Page 13: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Ultra Fast Forest Trees (UFFT) Using Hoeffding Bound to decide when to i

nstall a splitting test in a leaf leading to a decision node

Use analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test

The algorithm builds a binary tree for each possible pair of classes

Page 14: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Ultra Fast Forest Trees (UFFT) During the training phase the algorithm

maintains a short term memory. A fixed number of the most recent examples are maintained in a data structure

The sufficient statistics of these leaves are initialized with the examples in the short term memory that will fall at these leaves

The Ultra Fast Forest Tree system (UFFT) is an incremental algorithm

Page 15: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The Splitting Criteria

use the analytical method for split point selection to choose, for all numerical attributes, the most promising valuej

This analysis assumes that the distribution of the values of an attribute follows a normal distribution for both classes

Page 16: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The Splitting Criteria Let and σ2 are the sample mean and variance

of the class σ2 = ΣiP(xi)(xi - )2

p(-) denotes the estimated probability than an example belongs to class –

d1 and d2 are the possible roots of the equation

Page 17: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The Splitting Criteria Let d be the root closer to the sample mean

s of both classes . The splitting test candidate for each numeric attribute i will be Atti ≦ di

To compute the information gain we need to construct a contingency table with the distribution per class of the number of examples less than d and greater than d

Page 18: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The Splitting Criteria

The information kept on the tree is not sufficient to compute the exact number of examples for each entry in the contingency table

With the assumption of normality, we can compute the probability of observing a value less or greater than di

Page 19: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The Splitting Criteria

AttHeight≦1.7 AttHeight > 1.7

Class TallClass Short

P1Tall

P1Short

P2Tall

P2Short

          

Class Short

dj = 1.7m

Class Tall

P1Tall P1

ShortP2ShortP2

Tall

Page 20: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The Splitting Criteria The information gain of an attribute: H(Atti) = info(PTall , PShort) + Σj

2=1(Pj * info(Pj

Tall , PjShort))

info(PTall , PShort) = PTall * log(1/PTall) + PShort * log(1/PShort)

PTall = P1Tall + P2

Tall PShort = P1Short + P2

Short

This method only requires to maintain the mean and standard deviation for each class per attribute

Page 21: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

The Splitting Criteria

Name Income

Height

Class

John 20000 1.85m

Tall

Mary 35000 1.68m

Tall

Kent 25000 1.71m

Short

Sue 55000 1.59m

Short

Paul 18000 1.79m

Tall

AttHeight≦1.7 AttHeight > 1.7

Class TallClass Short

P1Tall = 1/5

P1Short = 1/5

P2Tall = 2/5

P2Short = 1/5

PTall = 3/15 + 6/15 = 3/5 PShort = 1/5 + 1/5 = 2/5

H(Attheight) = ( 3/5*log(5/3) + 2/5*log(5/2) ) + ( 1/5*log5+1/5*log5 ) + ( 2/5*log(5/2) + 1/5*log5 ) = 0.292 + 0.28 + 0.299 = 0.871

Page 22: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

From Leaf to Decision Node To expand the tree, the test Atti ≦ di is installed in t

he leaf, and the leaf becomes a decision node with two new descendant leaves.

To expand a leaf two conditions must be satisfied : 1. The information gain of the selected splitting test to be positive 2. it must exist statistical support in favor of the bes

t splitting test, is asserted using the Hoeffding bound as in VFDT In VFDT, nmin is a user defined constant, in UFFT, th

e number is computed as

Page 23: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Short Term Memory Use a short term memory that

maintains in memory a limited number of the most recent examples

The short term memory is used to update the statistics at the new leaves when they are created

The examples in the short term memory traverse the tree and the ones that reach the new leaves will update the sufficient statistics of the tree.

Page 24: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Classification strategies at leaves

implement two different classification strategies :

1.majority class 2.naive-bayes classifier, at each leaf we maintain sufficient statistics to com

pute the information gain and these are the necessary statistics to compute the conditional probabilities of P(xi|Class)

Page 25: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Forest of Trees The UFFT algorithm builds a binary tree for

each possible pair of classes In the general case of n classes, the

algorithm grows a forest of n(n-1)/2 binary trees

When a new example is received during the tree growing phase each tree will receive the example if the class attached to it is one of the two classes in the tree label

Page 26: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Experiment Athlon XP2000+ 1.66GHz, 512MB RAM,

Linux RedHat 7.3 Three artificial datasets : 1. Waveform, with 21 and 40 attributes and 3 classes

2. LED, with 24 binary attributes(17 are irrelevant) and 10 classes

3. Balance, with 4attributes and 3 classes The parameters values δ= 0.05, τ= 0.00

1 and nmin = 300

Page 27: Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC04 2005/5/27 報告人:董原賓.

Experiment Result