Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many...
-
Upload
curtis-berry -
Category
Documents
-
view
221 -
download
3
Transcript of Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many...
Chapter 10.Chapter 10.Sampling Strategy for Building Sampling Strategy for Building Decision Trees from Very Large Decision Trees from Very Large
Databases Comprising Many Databases Comprising Many Continuous AttributesContinuous Attributes
Jean-Hugues Chauchat and Ricco Rakotomalala
Laboratory ERIC – University Lumiére Lyon
Summarized by Seong-Bae Park
IntroductionIntroduction
Fast and Efficient Sampling Strategy to Build DTs from a very Large Database
Propose a Strategy Using Successive Samples, one on Each Tree Node
FrameworkFramework Play Tennis Table
Handling Continuous Attributes in DTHandling Continuous Attributes in DT Discretization Global Discretization
Each continuous attribute was converted to a discrete one.
1. Each continuous variable is sorted.
2-1. Several cutting points are tested so as to find the subdivision which is the best according to the class attribute.
• Use a splitting measure (entropy gain, chi-square, purity measure)
2-2. Looking for the number of intervals and their boundaries. Local Discretization
It is not necessary to determine how many intervals should be created as each split creates two intervals.
Interaction among attributes is accounted for. Require initially a sorting of the values O(n log n)
Need sampling to reduce n
Local Sampling StrategyLocal Sampling Strategy
During construction, on each leaf, a sample is drawn from the part of the database that corresponds to the path associated to the leaf.
Process1. First, a complete list of individuals on the base is drawn;
2. The first sample is selected while the base is being read;
3. This sample is used to identify the best segmentation attribute, if it exists, otherwise, the stopping rule has played its role and the node becomes a terminal leaf;
4. If a segmentation is possible, then the list in step 1 is broken up into sub-lists corresponding to the various leaves just obtained;
5. Step 4 requires passing through the DB to update each examples’ leaf; this pass is an opportunity to select the samples that will be used in later computations.
Iterate Step 3 to Step 5 until all nodes are converted to terminal leaves.
Local Sampling StrategyLocal Sampling Strategy
Determining The Sample SizeDetermining The Sample Size
The size of the sample must be such that1) This split be recognized as such, that is the power of the test mu
st be sufficient;
2) The discretization point be estimated as precisely as possible;
3) If, on the given node on the base, many splitting attributes are possible, the criterion for the optimal attribute remains maximal in the sample.
Testing Statistical Signification for a LinkTesting Statistical Signification for a Link
For each node, we use statistical tests concepts: probability of type I and type II errors ( and ) Looking for the attribute which provides the best split according to the
criterion T. The split is done if two conditions are met:
1) If this split is the best,
2) If this split is possible (T(Sample Data) is unlikely when H0 is true.)
• Null Hypothesis H0:
“There is no link between the class attribute and the predictive attribute we are testing.”
• p – value : the probability of T being greater than or equal to T(Sample Data)
• H0 is rejected, so the split is possible, if the p – value is less than a predetermined significance level, .
Testing Statistical Signification for a LinkTesting Statistical Signification for a Link
True significant level ’ is larger than . (multi hypotheses) The possibility of ’ of observing at least one of the
attributes smaller than is:
One must use a very small value for .
The significance level limits the type I error probability.
qvaluepP
valuepPvaluepMinP
qjj
qj jjqj
,1
,1,1
])[(
]}){([])(['
NotationsNotations
Y : class attribute X : predictor attributes ij : the proportion of (Y = Yi and X = Xj) in the sub-popul
ation corresponding to the working node. i+ and +j : marginal proportions
ij0 : the products of marginal proportions
nij : the number of (ij) cell in the sample tabulation. E(nij) = n ij : expected value of nij
Probability Distribution of The CriterionProbability Distribution of The Criterion
Measure the link by 2 statistic or information gain When H0 is true and the sample size is large, both have approxi
mate chi-square distribution with degrees of freedom = (p – 1)(q – 1).
When H0 is false, the distribution is approximately non-central chi-square.
Central chi-square distribution When H0 is true, = 0. The further the truth is from H0, the larger .
Noncentral chi-square distribution No closed analytic formulation Asymptotically normal for large values of . : a function of sample size n and the frequencies ij in the who
le database
Probability Distribution of The CriterionProbability Distribution of The Criterion
The value of For information gain
For 2 statistic
ij ij
ijijI nn 2
0log2
ij ij
ijijK nn 2
0
20 )(
Equalizing of Normal Risk ProbabilitiesEqualizing of Normal Risk Probabilities
Find the minimal sample sizes to get a power (1- ) T1- : the critical value
If p = q = 2, v = 1 and = nR2
][)(00 1 valuepPTTP HH
)2(2/))(()( 111 vTZPTTP ααH
)4(2/)1()( 22111
nRvnRTZPTTP ααH
Equalizing of Normal Risk ProbabilitiesEqualizing of Normal Risk Probabilities
The weaker the link (R2) is in the database, the larger the sample size must be to make evidence for it.
n increases as the significance level decreases: If one wants to reduce risk probabilities, a larger sample is needed.
Sampling MethodsSampling Methods
Algorithm S Sequentially processes the DB records and determine whether e
ach record is selected The first record is selected with probability (n / N) If m records are selected from among the first t records, the (t+
1)st record is selected with probability (n-m)/(N-t). When n records are selected, stops.
Algorithm D Random jump between selected records
ExperimentsExperiments
Objective of the Experiments To show that a tree built with local sampling has a generalizatio
n error rate comparable to that of a tree built with the complete database
To show that sampling reduces computing time.
Artificial Database Artificial Problem
“Breiman’s et al. waves”
Generate 100 times two files, one of 500,000 records for training, the other of 50,000 records for the validation.
Binary discretization ChAID decision tree algorithm
ExperimentsExperiments
The marginal profit becomes weak.
With Real Benchmark DBsWith Real Benchmark DBs
5 DBs from UCI which contain more than 12,900 individuals.
Repeat 10 times the following operations Subdivide randomly the DB in a training set and in a test set.
Test the trees.
With Real Benchmark DBsWith Real Benchmark DBs
With Real Benchmark DBsWith Real Benchmark DBs
The influence of n The sample size must be too small. Sampling drastically reduces computing time. “Letter” DB : data fragmentation
ConclusionsConclusions
Working on samples is useful. “Step by step” characteristics of DT allows us to propose
a strategy using successive samples. Theoretical and Empirical evidence Open Problems
Optimal sampling methods
Learning Imbalanced Classes Local equal-size sampling