Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many...

Chapter 10.Chapter 10.Sampling Strategy for Building Sampling Strategy for Building Decision Trees from Very Large Decision Trees from Very Large

Databases Comprising Many Databases Comprising Many Continuous AttributesContinuous Attributes

Jean-Hugues Chauchat and Ricco Rakotomalala

Laboratory ERIC – University Lumiére Lyon

Summarized by Seong-Bae Park

IntroductionIntroduction

Fast and Efficient Sampling Strategy to Build DTs from a very Large Database

Propose a Strategy Using Successive Samples, one on Each Tree Node

FrameworkFramework Play Tennis Table

Handling Continuous Attributes in DTHandling Continuous Attributes in DT Discretization Global Discretization

Each continuous attribute was converted to a discrete one.

1. Each continuous variable is sorted.

2-1. Several cutting points are tested so as to find the subdivision which is the best according to the class attribute.

• Use a splitting measure (entropy gain, chi-square, purity measure)

2-2. Looking for the number of intervals and their boundaries. Local Discretization

It is not necessary to determine how many intervals should be created as each split creates two intervals.

Interaction among attributes is accounted for. Require initially a sorting of the values O(n log n)

Need sampling to reduce n

Local Sampling StrategyLocal Sampling Strategy

During construction, on each leaf, a sample is drawn from the part of the database that corresponds to the path associated to the leaf.

Process1. First, a complete list of individuals on the base is drawn;

2. The first sample is selected while the base is being read;

3. This sample is used to identify the best segmentation attribute, if it exists, otherwise, the stopping rule has played its role and the node becomes a terminal leaf;

4. If a segmentation is possible, then the list in step 1 is broken up into sub-lists corresponding to the various leaves just obtained;

5. Step 4 requires passing through the DB to update each examples’ leaf; this pass is an opportunity to select the samples that will be used in later computations.

Iterate Step 3 to Step 5 until all nodes are converted to terminal leaves.

Local Sampling StrategyLocal Sampling Strategy

Determining The Sample SizeDetermining The Sample Size

The size of the sample must be such that1) This split be recognized as such, that is the power of the test mu

st be sufficient;

2) The discretization point be estimated as precisely as possible;

3) If, on the given node on the base, many splitting attributes are possible, the criterion for the optimal attribute remains maximal in the sample.

Testing Statistical Signification for a LinkTesting Statistical Signification for a Link

For each node, we use statistical tests concepts: probability of type I and type II errors ( and ) Looking for the attribute which provides the best split according to the

criterion T. The split is done if two conditions are met:

1) If this split is the best,

2) If this split is possible (T(Sample Data) is unlikely when H0 is true.)

• Null Hypothesis H0:

“There is no link between the class attribute and the predictive attribute we are testing.”

• p – value : the probability of T being greater than or equal to T(Sample Data)

• H0 is rejected, so the split is possible, if the p – value is less than a predetermined significance level, .

Testing Statistical Signification for a LinkTesting Statistical Signification for a Link

True significant level ’ is larger than . (multi hypotheses) The possibility of ’ of observing at least one of the

attributes smaller than is:

One must use a very small value for .

The significance level limits the type I error probability.

qvaluepP

valuepPvaluepMinP

qjj

qj jjqj

,1

,1,1

])[(

]}){([])(['

NotationsNotations

Y : class attribute X : predictor attributes ij : the proportion of (Y = Yi and X = Xj) in the sub-popul

ation corresponding to the working node. i+ and +j : marginal proportions

ij0 : the products of marginal proportions

nij : the number of (ij) cell in the sample tabulation. E(nij) = n ij : expected value of nij

Probability Distribution of The CriterionProbability Distribution of The Criterion

Measure the link by 2 statistic or information gain When H0 is true and the sample size is large, both have approxi

mate chi-square distribution with degrees of freedom = (p – 1)(q – 1).

When H0 is false, the distribution is approximately non-central chi-square.

Central chi-square distribution When H0 is true, = 0. The further the truth is from H0, the larger .

Noncentral chi-square distribution No closed analytic formulation Asymptotically normal for large values of . : a function of sample size n and the frequencies ij in the who

le database

Probability Distribution of The CriterionProbability Distribution of The Criterion

The value of For information gain

For 2 statistic

ij ij

ijijI nn 2

0log2

ij ij

ijijK nn 2

0

20 )(

Equalizing of Normal Risk ProbabilitiesEqualizing of Normal Risk Probabilities

Find the minimal sample sizes to get a power (1- ) T1- : the critical value

If p = q = 2, v = 1 and = nR2

][)(00 1 valuepPTTP HH

)2(2/))(()( 111 vTZPTTP ααH

)4(2/)1()( 22111

nRvnRTZPTTP ααH

Equalizing of Normal Risk ProbabilitiesEqualizing of Normal Risk Probabilities

The weaker the link (R2) is in the database, the larger the sample size must be to make evidence for it.

n increases as the significance level decreases: If one wants to reduce risk probabilities, a larger sample is needed.

Sampling MethodsSampling Methods

Algorithm S Sequentially processes the DB records and determine whether e

ach record is selected The first record is selected with probability (n / N) If m records are selected from among the first t records, the (t+

1)st record is selected with probability (n-m)/(N-t). When n records are selected, stops.

Algorithm D Random jump between selected records

ExperimentsExperiments

Objective of the Experiments To show that a tree built with local sampling has a generalizatio

n error rate comparable to that of a tree built with the complete database

To show that sampling reduces computing time.

Artificial Database Artificial Problem

“Breiman’s et al. waves”

Generate 100 times two files, one of 500,000 records for training, the other of 50,000 records for the validation.

Binary discretization ChAID decision tree algorithm

ExperimentsExperiments

The marginal profit becomes weak.

With Real Benchmark DBsWith Real Benchmark DBs

5 DBs from UCI which contain more than 12,900 individuals.

Repeat 10 times the following operations Subdivide randomly the DB in a training set and in a test set.

Test the trees.


The influence of n The sample size must be too small. Sampling drastically reduces computing time. “Letter” DB : data fragmentation

ConclusionsConclusions

Working on samples is useful. “Step by step” characteristics of DT allows us to propose

a strategy using successive samples. Theoretical and Empirical evidence Open Problems

Optimal sampling methods

Learning Imbalanced Classes Local equal-size sampling

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many...

Documents

Transcript of Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many...