BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
-
Upload
aron-riley -
Category
Documents
-
view
212 -
download
0
Transcript of BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
![Page 1: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/1.jpg)
BOAT - Optimistic Decision Tree Construction
Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
![Page 2: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/2.jpg)
Problem
• Efficient construction of decision trees
• As few passes of the database as possible
• Sample of dataset to give insight to the full database
![Page 3: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/3.jpg)
Motivation
• Standard decision tree construction is not sufficient– For tree of height h, need h passes through
entire database– To include new data, must rebuild the tree– For large databases, this is not feasible
• Need fast, scalable method
![Page 4: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/4.jpg)
Intuition
• Begin with sample of data– Build decision tree on the sample– For numeric data, use a confidence interval for
split
• Make limited passes of full data to both verify sampled tree and construct full tree– Only data that falls in confidence interval needs
be rescanned to determine how to propagate
![Page 5: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/5.jpg)
Criteria selection
• Use of impurity functions to generate the attribute to split on– Entropy, gini, index of correlation– Calculated for sample, could be wrong in the
full dataset
• Minimize the impurity function in attribute selection and confidence interval
![Page 6: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/6.jpg)
Confidence Interval
• Construct T trees
• If at node n, the splitting attribute is not the same in all trees, discard n and its subtree in all trees
• For categorical data, if the splitting subset is not identical in all trees, remove node n and its subtree in all trees
![Page 7: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/7.jpg)
Confidence Interval
• Confidence interval on numeric attributes determined by the range of split points on the T trees
• Exact split point is likely to be between the min and max of the values of the split points on the T trees.
![Page 8: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/8.jpg)
Verification
• Verifying predictions– Use a lower bound for the impurity function to
determine if confidence interval and splitting attribute are correct
– Discard node and its subtree completely if incorrect
• Rerun algorithm on any set of data related to a discarded node
![Page 9: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/9.jpg)
Invalidated Predictions
• Discarded top nodes would result in resampling of entire database– No savings on full scans– Doesn’t usually happen– Basic probability distribution likely captured by
sample– Error in the detail (low) level
![Page 10: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/10.jpg)
Dynamic Environments
• No need to frequently rebuild the decision trees
• Store the confidence intervals
• Only need rebuild of tree if underlying probability distribution changes
![Page 11: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/11.jpg)
Experimental Results
• Used 200000 tuple sample size
• Grew 20 trees on 50000 tuples drawn from pool
• Datasets of 1.5 million tuples
• Outperforms brute-force method by a factor of 2 or 3
![Page 12: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/12.jpg)
Experimental Results
• Robust to noise– Noise affects detail-level probability
distribution– Affected the lower levels, requiring rescans of
small amounts of data
• Dynamic updating data– BOAT is much faster than brute-force
![Page 13: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/13.jpg)
Weak Points
• May not be as useful on complex probability distributions– Failure at high level of tree means that most of
the tree is discarded
• Hypotheses generate as simple as regular decision trees– Simply a way to speed generation
![Page 14: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/14.jpg)
Suggested Improvements
• Clustering to give a better sample to draw from– Groups of datapoints with a measure of
frequency of occurance– Would give better samples of the data and its
underlying probability distribution
![Page 15: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/15.jpg)
Suggested Improvements
• Extremely large datasets– For TB+ datasets, even two or three passes of
DB may be too many– Use MCMC to draw many different samples– Estimate probability density function by
resampling
• Would not guarantee tree accuracy
![Page 16: BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649ceb5503460f949b6157/html5/thumbnails/16.jpg)
Conclusion
• Effective way to build scalable decision trees
• Much faster than the standard method
• Useful in large datasets