Consistency of Random ForestsHoang N.V.
[email protected] of Computer Science
FITA โ Viet Nam Institute of Agriculture
Seminar IT R&D, HANUHa Noi, December 2015
Machine Learning, what is?
โtrueโ
Parametric
Non-parametric
Supervised problems: not too difficult
Unsupervised problems: is very difficult
Find a parameter which minimize the loss function
Supervised Learning
โ๐
L is a loss function
Classification: zero-one loss function
Regression: ๐1, ๐2
Bias-variance tradeoff
If the model is too simple, the solution is biased and does not fit the data.
If the model is too complex then it is very sensitive to small changes in the data.
[Hastie et all., 2005]
Ensemble Methods
Bagging[Random Forest]
Tree Predictor
โ๐
โ๐
Pick an internal node to split
Pick the best split in
Split A into two child nodes ( and )
Set
A splitting scheme induces a partition ฮ of the feature space into non-overlapping rectangles ๐1, โฆ , ๐โ.
Tree Predictor
โ๐
โ๐
Select an internal node to split
Select the best split in
Split A into two child nodes ( and )
Set
A splitting scheme induces a partition ฮ of the feature space into non-overlapping rectangles ๐1, โฆ , ๐โ.
Predicting Rule
ฮ
ฮ โ๐
ฮ โ๐
Tree Predictor
โ๐
โ๐
Select an internal node to split
Select the best split in
Split A into two child nodes ( and )
Set
A splitting scheme induces a partition ฮ of the feature space into non-overlapping rectangles ๐1, โฆ , ๐โ.
Training Methods
ID3 (Iterative Dichotomiser 3)
C4.5
CART (Classification and Regression Tree)
CHAID
MARS
Conditional Inference Tree
Predicting Rule
ฮ
ฮ โ๐
ฮ โ๐
Forest = Aggregation of trees
Aggregating Rule
โ๐ โ๐
โ๐ โ๐ โ๐
Grow different trees from same learning set โ๐
Sampling with replacement [Breiman, 1994]
Random subspace sampling [Ho, 1995 & 1998]
Random output sampling [Breiman, 1998]
Randomized C4.5 [Dietterich, 1998]
Purely random forest [Breiman, 2000]
Extremely random trees [Guerts, 2006]
Grow different trees from same learning set โ๐
Sampling with replacement - random subspace [Breiman, 2001]
Sampling with replacement - weighted subspace [Amaratunga, 2008; Xu, 2008; Wu, 2012]
Sampling with replacement - random subspace and regularized [Deng, 2012]
Sampling with replacement - random subspace and guided-regularized [Deng, 2013]
Sampling with replacement - random subspace and random split position selection [Saรฏp Ciss, 2014]
Some RF extensions
quantile estimation Meinshausen, 2006
survival analysis Ishwaran et al., 2008
ranking Clemencon et al., 2013
online learning Denil et al., 2013;
Lakshminarayanan et al., 2014
GWA problems Yang et al., 2013; Botta et al., 2014
What is a good learner?[What is friendly with my data?]
What is good in high-dimensional settings
Breiman, 2001
Wu et al., 2012
Deng, 2012
Deng., 2013
Saรฏp Ciss, 2014
Simulation Experiment
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y1 y2 y3 y4
AA ABB B
Random Forest [Breiman, 2001]
WSRF [Wu, 2012]
Random Uniform Forest [Saรฏp Ciss, 2014]
RRF [Deng, 2012]
GRRF [Deng, 2013]
Simulation Experiment
Random Forest [Breiman, 2001]
WSRF [Wu, 2012]
Random Uniform Forest [Saรฏp Ciss, 2014]
RRF [Deng, 2012]
GRRF [Deng, 2013]
GRRF with AUC [Deng, 2013]
GRRF with ER [Deng, 2013]
Simulation Experiment
B
A
C
D
B E
A
E D
Multiple Class Tree
Random Forest [Breiman, 2001]
WSRF [Wu, 2012]
Random Uniform Forest [Saรฏp Ciss, 2014]
RRF [Deng, 2012]
GRRF [Deng, 2013]
GRRF with ER [Deng, 2013]
What is a good learner?[Nothing you do will convince me]
[I need rigorous theoretical guarantees]
Asymptotic statistics and learning theory[go beyond experiment results]
Machine Learning, what is?
Parametric
Non-parametric
Supervised problems: not too difficult
Unsupervised problems: is very difficult
Find a parameter which minimize the loss function
โrightโ
Pattern (which learnt from โ๐ ) is โtrueโ, isnโt it? How much do I believe?
Is this procedure friendly with my data?
What is the best possible procedure for my problem?
What is if our assumptions are wrong?
โefficientโ
How many observations do I need in order to achieve a โbelievedโ pattern?
How many computations do I need?
Assumption: There are some patterns
Learning Theory [Vapnik, 1999]
asymptotic theory
necessary and sufficient conditions
the best
possible
Supervised Learning
โ๐
L is a loss function
Classification: zero-one loss function
Regression: ๐1, ๐2
Supervised Learning
Generator ๐
๐
๐ฆ
๐ฆ ๐ฆโฒ
Supervisor
Machine Learning
Two different goals
imitate (prediction accuracy)
identify (interpretability)
What is the best predictor?
What is the best predictor
Bayes model
residual error
A model built from any learning set โ, Err( ) โค Err( )
In theory, when ๐(๐๐) is known
What is the best predictor
If is zero-one loss function, the Bayes model is
In classification, the best possible classifier consists in systematically predicting the most likely class ๐ฆ โ {๐1, โฆ , ๐๐ฝ} given ๐ = ๐
What is the best predictor
If is the squared error loss function, the Bayes model is
In regression, the best possible regressor consists in systematically predicting the average value of ๐ given ๐ = ๐
Given a learning algorithm ๐ and a loss function
๐๐ = ๐(โ๐)
โ๐
๐ โ๐ }
Learning algorithm ๐ is consistent in L if and only if
๐ธ๐๐( ) โถ๐โโ๐ ๐ธ๐๐(๐๐ต)
๐ธ๐๐(๐๐ ) โถ๐โโ๐ ๐ธ๐๐(๐๐ต)
Random Forests are consistent, arenโt they?
๐
ฮ ๐
ฮ is used to sample the training set or to select the candidate directionsor positions for splitting
ฮ is independent of the dataset and thus unrelated to the particularproblem
In some new variants of RF, ฮ is depend on the dataset
Generalized Random Forest
Bagging Procedure
ฮ ฮ1, โฆ , ฮm
ฮi โ๐ ๐(ฮ๐ โ๐)
โ๐ ฮ๐ โ๐
โ๐ ฮ1 โ๐ ฮ๐ โ๐
Generalized Random Forest
Consistency of Random Forests
lim๐โโ
Err(๐ป๐ . ; โ๐ ) = Err(๐๐ต)Problem
- m is finite โ predictor depend on trees that formed forest
- structure of a tree depend on ฮi and learning setโน finite forest is actually a subtle combination of randomness and
depending-on-data structuresโน finite forests predictions can be difficult to interpret (randomprediction or not)
- Non-asymptotic rate of convergence
Challenges
๐ปโ(๐ฅ; โ๐) = ๐ผฮ{ ฮ โ๐ }
lim๐โโ
๐ป๐(๐ฅ; โ๐) = ๐ปโ(๐ฅ; โ๐)
Problem
- infinite forest is good than finite forest, isnโt it?- What is good m? (rate of convergence)
Challenge
Consistency of Random Forests
Review some recent results
Strength, Correlation and Err [Breiman, 2001]
ฮi โ๐ ฮi โ๐
ฮ โ๐ ฮ โ๐
Theorem 2.3 An upper bound for the generalization error is given by:๐๐ธโ โค ๐(1 โ ๐ 2)/๐ 2
where ๐ is the mean value of the correlation, s is the strength of theset of classifiers.
RF and Adaptive Nearest Neighbors [Lin et al, 2006]
ฮi โ๐ ๐(ฮ๐ โ๐)
ฮ
ฮi โ๐=
1
๐๐ฃ:๐
๐ฃโ ๐ฟ ฮ๐,๐
๐:๐๐ฃโ ๐ฟ ฮ๐,๐
๐ฆj = ๐=1๐ ๐ค
๐ฮ๐
๐ฆ๐
โ๐ ฮ๐ โ๐ ๐=1๐ ๐ค๐๐ฆ๐
๐ค๐ ๐ค๐ฮ๐
๐ปโ(๐; โ๐) ฮฮ
ฮ
Non-adaptive if ๐ค๐ not depend on ๐ฆ๐โs of the learning set
RF and Adaptive Nearest Neighbors [Lin et al, 2006]
ฮi โ๐ ๐(ฮ๐ โ๐)
ฮ
ฮi โ๐=
1
๐๐ฃ:๐
๐ฃโ ๐ฟ ฮ๐,๐
๐:๐๐ฃโ ๐ฟ ฮ๐,๐
๐ฆj = ๐=1๐ ๐ค
๐ฮ๐
๐ฆ๐
โ๐ ฮ๐ โ๐ ๐=1๐ ๐ค๐๐ฆ๐
๐ค๐ ๐ค๐ฮ๐
๐ปโ(๐; โ๐) ฮฮ
ฮ
Non-adaptive if ๐ค๐ not depend on ๐ฆ๐โs of the learning set
The terminal node size k should be made to increase with the sample size ๐. Therefore, growing large trees (k being a small constant) does not always
give the best performance.
Biau et al, 2008Given a learning set โ๐ = ๐1, ๐ฆ1 , โฆ , ๐๐, ๐ฆ๐ of โ๐ โ {0, 1}
Binary classifier ๐๐ which trained from โ๐: โ๐ โถ {0, 1}
๐๐ ๐๐(๐) โ ๐
๐๐ต ๐ฅ = ๐{ } ๐๐ต
A sequence {๐๐} of classifiers is consistent for a certain distribution of (๐, ๐) if ๐ธ๐๐(๐๐) โถ in probability
Assume that the sequence {๐๐} of randomized classifiers is consistent for a certain distribution of ๐, ๐ . Then the voting classifier ๐ป๐(for any value of m) and the averaged classifier ๐ปโ are also consistent.
Biau et al, 2008
Growing Trees
Node ๐ด is randomly selected
The split feature j is selected uniformly at random from [1, โฆ , ๐]
Finally, the selected node is split along the randomly chosen feature at a random location
* Recursive node splits do not depend on the labels ๐ฆ1, โฆ , ๐ฆ๐
Theorem 2 Assume that the distribution of ๐ is supported on [0, 1]๐.Then purely random forest classifier ๐ปโ is consistent whenever ๐ โถโ and ๐
๐ โถ 0 as n โถ โ.
Biau et al, 2008
Growing Trees
โ๐ ๐( )
Theorem 6 Let {๐ฮ} be a sequence of classifiers that is consistent for thedistribution of ๐๐. Consider the bagging classifier ๐ป๐ and ๐ปโ, using parameter๐๐. If ๐๐๐ โถ โ as n โถ โ then both classifier are consistent.
Biau et al, 2012
Growing Trees
At each node, a coordinate is selected with ๐๐๐ โ (0, 1) is the probability j-th feature is selected
the split is at the midpoint of the chosen side
Theorem 1 Assume that the distribution of ๐ has support on [0, 1]๐.Then the random forests estimate ๐ปโ(๐; โ๐) is consistent whenever
๐๐๐๐๐๐๐๐ โถ โ for all j=1, โฆ, p and ๐๐๐ โถ 0 as ๐ โถ โ.
Biau et al, 2012
Assume that X is uniformly distributed on [0,1]๐
๐๐๐ = (๐/๐บ)(๐ + ๐๐๐) ๐๐๐ ๐ โ ๐ข
In sparse settings
Estimation Error (variance)
๐ผ{[๐ปโ ๐; โ๐ โ ๐ปโ ๐; โ๐ ]2} โค ๐ถ๐2S2
S โ 1
๐2๐
(1 + ๐๐)๐๐
๐(๐๐๐๐๐)๐/2๐
If ๐ < ๐๐๐ < ๐ form some constants ๐, ๐ โ 0,1 then
1 + ๐๐ โค๐ โ 1
๐2๐ 1 โ ๐
๐2๐
Biau et al, 2012
Assume that X is uniformly distributed on [0,1]๐ and ๐๐ต ๐๐ฎ is ๐ฟ โ ๐ฟ๐๐๐ ๐โ๐๐ก๐ง on [0,1]๐
๐๐๐ = (๐/๐บ)(๐ + ๐๐๐) ๐๐๐ ๐ โ ๐ข
In sparse settings
Approximation Error (bias2)
๐ผ ๐ปโ ๐; โ๐ โ ๐๐ต ๐2
โค 2๐๐ฟ2๐๐โ
0.75๐๐๐๐2 1+๐พ๐ + [ sup
๐ฅโ[0,1]๐๐๐ต
2(๐)]๐โ๐/2๐๐
where ๐พ๐ = min๐โ ๐ฎ
๐๐๐ tends to 0 as n tends to infinity.
Finite and infinite RFs [Scornet, 2014]
โ๐ ฮ๐ โ๐
๐ปโ(๐; โ๐) = ๐ผฮ{ ฮ โ๐ }
โ๐
โ๐ ๐ปโ(๐; โ๐)
Theorem 3.1 Conditionally on โ๐, almost surely, for all ๐ฅ โ 0, 1 ๐, wehave: ๐ป๐ ๐; โ๐
๐โโ๐ปโ(๐; โ๐).
Finite and infinite RFs [Scornet, 2014]
One has ๐ = ๐ ๐ + ๐ where ๐ is a centered Gaussian noise withfinite variance ๐2, independent of ๐,
and ๐ โ = sup๐ฅโ[0,1]๐
|๐(๐ฅ)| < โ.
Assumption H
Theorem 3.3 Assume H is satisfied. Then, for all m, ๐ โ โโ,
๐ธ๐๐ ๐ป๐ ๐; โ๐ = ๐ธ๐๐ ๐ปโ(๐; โ๐) +1
๐๐ผ๐,โ
๐[๐ฮ[๐ฮ (๐; ฮ, โ๐)]]
โ ๐ โฅ8 ๐ โ
2 + ๐2
๐+
32๐2๐๐๐๐
๐๐กโ๐๐ ๐ธ๐๐ Hm โ ๐ธ๐๐ Hโ โค ๐
0 โค ๐ธ๐๐ Hm โ ๐ธ๐๐ Hโ โค8
m( ๐ โ
2 + ๐2(1 + 4๐๐๐๐))
RF and Additive regression model [Scornet et al., 2015]
Growing Treeswithout replacement
Assume that ๐ด is selected node and ๐ด > 1
Select uniformly, without replacement, a subset โณ๐ก๐๐ฆ โ 1, โฆ , ๐ , |โณ๐ก๐๐ฆ| = ๐๐ก๐๐ฆ
Select the best split in A by optimizing the CART-split criterion along the coordinates in โณ๐ก๐๐ฆ
Cut the cell ๐ด according to the best split. Call ๐ด๐ฟ and ๐ด๐ the true resulting cell
Set ๐ด ๐ด๐ฟ ๐ด๐
RF and Additive regression model [Scornet et al., 2015]
๐ = ๐=1๐
๐๐(๐(๐)) + ๐
Assumption H1
Theorem 3.1 Assume that (H1) is satisfied. Then, provided ๐ โถ โ and๐ก๐(๐๐๐๐๐)9/๐๐ โถ 0, is consistent.
Theorem 3.2 Assume that (H1) and (H2) are satisfied and let ๐ก๐ = ๐๐.Then, provided ๐๐ โถ โ, ๐ก๐ โถ โ and ๐๐๐๐๐๐/๐ โถ 0, is consistent.
RF and Additive regression model [Scornet et al., 2015]
, ๐1,๐ ๐ , โฆ , ๐๐,๐ ๐ the first cut directions used to construct
the cell containing ๐. ๐๐,๐ ๐ = โ if the cell has been cut strictly less than
q times.
Theorem 3.2 Assume that (H1) is satisfied. let k โ โโ and ๐ > 0.Assume that there is no interval [๐, ๐] and no ๐ โ {1, โฆ , ๐} such that๐๐ is constant on [๐, ๐]. Then, with probability 1 โ ๐, for all ๐ large
enough, we have, for all 1 โค ๐ โค ๐, ๐๐,๐ ๐ โ {1, โฆ , ๐}.
[Wager, 2015]
A partition ฮ is ๐ผ, ๐ โ ๐ฃ๐๐๐๐ if can generated by a recursive partitioning scheme inwhich each child node contains at least a fraction ๐ผ of the data points in its parentnode for some 0 < ๐ผ < 0.5, and each terminal node contains at least ๐ trainingexamples for some k โ N.
Given a dataset ๐, ๐ฑ๐ผ,๐(๐) denote the set of ๐ผ, ๐ โ ๐ฃ๐๐๐๐ partitions
๐ฮ: [0,1]๐โ โ, ๐ฮ ๐ =1
|{๐๐: ๐๐ โ ๐ฟ(๐)}| {๐: ๐๐โ๐ฟ(๐)} ๐ฆ๐ (is called valid tree)
๐ฮโ: [0,1]๐โ โ, ๐ฮ
โ ๐ = ๐ผ[๐|๐ โ ๐ฟ(๐)] (is called partition-optimal tree)
Whether we can treat ๐ฮ as a good approximation to ๐ฮโ the
supported on the partition ฮ
Given a learning set โ๐ of [0,1]๐โ โ๐
2,๐
2with ๐~๐([0,1]๐)
[Wager, 2015]
Theorem 1Given parameters n, p, k such that
lim๐โโ
log ๐ log ๐
๐= 0 ๐๐๐ ๐ = ฮฉ(๐)
then
lim๐,๐,๐โโ
โ sup๐ฅโ 0,1 ๐,ฮโ๐ฑ๐ผ,๐
|๐ฮ โ ๐ฮโ| โค 6๐
log ๐ log(๐)
klog((1 โ ๐ผ)โ1)= 1
[Wager, 2015]
Growing Trees (Guest-and-check)
Select a currently un-split node ๐ด containing at least 2k training examples
Pick a candidate splitting variable ๐ โ {1, โฆ , ๐} uniformly at random
Pick the minimum squared error (โ( ๐)) splitting point ๐
If either there has already been a successful split along variable j for some other nod or
โ( ๐) โฅ 36๐2log ๐ log(๐)
๐๐๐๐((1 โ ๐ผ)โ1)
The split succeeds and we cut the node ๐ด at ๐ along the j-th variable; if not we do not split the node ๐ด this time.
[Wager, 2015]In sparse settings
and a set of sign variables ๐๐ โ ยฑ1 such that, for all
๐ โ and all ๐ฅ โ [0,1]๐,
๐ผ ๐ ๐ โ๐ = ๐ฅ โ๐ , ๐ ๐ >12
โ ๐ผ ๐ ๐ โ๐ = ๐ฅ โ๐ , ๐ ๐ โค12
โฅ ๐ฝ๐๐
Assumption H1
is Lipschitz-continuous in
Assumption H2
[Wager, 2015]In sparse settings
and a set of sign variables ๐๐ โ ยฑ1 such that, for all
๐ โ and all ๐ฅ โ [0,1]๐,
๐ผ ๐ ๐ โ๐ = ๐ฅ โ๐ , ๐ ๐ >12
โ ๐ผ ๐ ๐ โ๐ = ๐ฅ โ๐ , ๐ ๐ โค12
โฅ ๐ฝ๐๐
Assumption H1
is Lipschitz-continuous in
Assumption H2
Theorem 2 Under the conditions of theorem 1, suppose thatassumptions in the sparse setting hold, then guest-and-check forest isconsistent.
[Wager, 2015]
๐ป{ฮ}1๐ต: [0,1]๐โ โ, ๐ป ฮ 1
๐ต ๐ =1
๐ต ๐=1
๐ต ๐ฮ๐(๐) (is called valid forest)
๐ป{ฮ}1
๐ตโ : [0,1]๐โ โ, ๐ป
{ฮ}1๐ต
โ ๐ =1
๐ต ๐=1
๐ต ๐ฮ๐
โ (๐) (is called partition-optimal forest)
Theorem 4
lim๐,๐,๐โโ
โ sup๐ปโโ๐ผ,๐
1
๐
๐=1
๐
(๐ฆ๐ โ ๐ป(๐ฅ๐))2 โ ๐ผ[(๐ โ ๐ป(๐))2] โค 11๐2log ๐ log(๐)
klog((1 โ ๐ผ)โ1)= 1
ReferencesB. Efron. Estimation and accuracy after model selection. Journal of the American Statistical Association, 2013.
B. Lakshminarayanan et al. Mondrian forests: Efficient online random forests. arXiv:1406.2673, 2014.
B. Xu et al. Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining (IJDWM) 8(2), 37:44โ63, 2012.
D. Amaratunga et al. Enriched random forests. Bioinformatics (Oxford, England), 24(18): 2010โ2014, 2008 doi:10.1093/bioinformatics/btn356
E. Scornet. On the asymptotics of random forests. arXiv:1409.2090, 2014
E. Scornet et al. Consistency of random forests. The Annals of Statistics. 43 (2015), no. 4, 1716--1741. doi:10.1214/15-AOS1321. http://projecteuclid.org/euclid.aos/1434546220.
G. Biau et al. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9:2015โ2033, 2008.
G. Biau. Analysis of a random forests model. Journal of Machine Learning Research, 13:1063โ1095, 2012.
H. Deng et al. Feature Selection via Regularized Trees, The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012.
H. Deng et al. Gene Selection with Guided Regularized Random Forest , Pattern Recognition, 46.12 (2013): 3483-3489
H. Ishwaran et al. Random survival forest. The Annals of Applied Statistics, 2:841โ860, 2008.
L. Breiman. Bagging predictors. Technical Report No. 421, Statistics Department, UC Berkeley, 1994.
ReferencesL.Breiman. Randomizing outputs to increase prediction accuracy, Technical Report 518, Statistics Department, UC Berkeley, 1998.
L. Breiman. Some infinite theory for predictor ensembles. Technical Report 577, Statistics Department, UC Berkeley, 2000.
L. Breiman. Random forests. Machine Learning, 45:5โ32, 2001.
M. Denil et al. Consistency of online random forests. International Conference on Machine Learning (ICML) 2013.
N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983โ999, 2006.
P. Geurts et al. Extremely randomized trees. Machine Learning, 63(1):3-42, 2006.
Q. Wu et al. Snp selection and classification of genome-wide snp data using stratified sampling random forests. NanoBioscience, IEEE Transactions on 11(3): 216โ227, 2012.
T. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization, Machine Learning 1-22, 1998
T.K. Ho. Random decision forests, Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on , 1:278-282, 14-16 Aug 1995, doi: 10.1109/ICDAR.1995.598994
T.K. Ho. The random subspace method for constructing decision forests, IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998
Saรฏp Ciss. Random Uniform Forests. 2015. <hal-01104340v2>
S. Clยดemenยธcon et al. Ranking forests. Journal of Machine Learning Research, 14:39โ73, 2013.
ReferencesS. Wager. Asymptotic theory for random forests. arXiv:1405.0352, 2014
S. Wager. Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388, 2015
Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578โ590, 2006.
V. N. Vapnik. An overview of Statistical Learning Theory. IEEE Trans. on Neural Networks, 10(5):988-999, 1999.
RF and Additive regression model [Scornet et al., 2015]
the indicator that falls in the same cell as in the
random tree designed with โn and the random parameter
where โฒis an independent copy of .
โฒ, ๐1, โฆ , ๐๐,
โฒ, ๐1, โฆ , ๐๐
RF and Additive regression model [Scornet et al., 2015]
Top Related