Download - Conistency of random forests

Transcript
Page 1: Conistency of random forests

Consistency of Random ForestsHoang N.V.

[email protected] of Computer Science

FITA โ€“ Viet Nam Institute of Agriculture

Seminar IT R&D, HANUHa Noi, December 2015

Page 2: Conistency of random forests

Machine Learning, what is?

โ€œtrueโ€

Parametric

Non-parametric

Supervised problems: not too difficult

Unsupervised problems: is very difficult

Find a parameter which minimize the loss function

Page 3: Conistency of random forests

Supervised Learning

โ„’๐‘›

L is a loss function

Classification: zero-one loss function

Regression: ๐•ƒ1, ๐•ƒ2

Page 4: Conistency of random forests

Bias-variance tradeoff

If the model is too simple, the solution is biased and does not fit the data.

If the model is too complex then it is very sensitive to small changes in the data.

Page 5: Conistency of random forests

[Hastie et all., 2005]

Page 6: Conistency of random forests

Ensemble Methods

Page 7: Conistency of random forests

Bagging[Random Forest]

Page 8: Conistency of random forests

Tree Predictor

โ„’๐‘›

โ„’๐‘›

Pick an internal node to split

Pick the best split in

Split A into two child nodes ( and )

Set

A splitting scheme induces a partition ฮ› of the feature space into non-overlapping rectangles ๐‘ƒ1, โ€ฆ , ๐‘ƒโ„“.

Page 9: Conistency of random forests

Tree Predictor

โ„’๐‘›

โ„’๐‘›

Select an internal node to split

Select the best split in

Split A into two child nodes ( and )

Set

A splitting scheme induces a partition ฮ› of the feature space into non-overlapping rectangles ๐‘ƒ1, โ€ฆ , ๐‘ƒโ„“.

Predicting Rule

ฮ›

ฮ› โ„’๐‘›

ฮ› โ„’๐‘›

Page 10: Conistency of random forests

Tree Predictor

โ„’๐‘›

โ„’๐‘›

Select an internal node to split

Select the best split in

Split A into two child nodes ( and )

Set

A splitting scheme induces a partition ฮ› of the feature space into non-overlapping rectangles ๐‘ƒ1, โ€ฆ , ๐‘ƒโ„“.

Training Methods

ID3 (Iterative Dichotomiser 3)

C4.5

CART (Classification and Regression Tree)

CHAID

MARS

Conditional Inference Tree

Predicting Rule

ฮ›

ฮ› โ„’๐‘›

ฮ› โ„’๐‘›

Page 11: Conistency of random forests

Forest = Aggregation of trees

Aggregating Rule

โ„’๐‘› โ„’๐‘›

โ„’๐‘› โ„’๐‘› โ„’๐‘›

Page 12: Conistency of random forests
Page 13: Conistency of random forests

Grow different trees from same learning set โ„’๐‘›

Sampling with replacement [Breiman, 1994]

Random subspace sampling [Ho, 1995 & 1998]

Random output sampling [Breiman, 1998]

Randomized C4.5 [Dietterich, 1998]

Purely random forest [Breiman, 2000]

Extremely random trees [Guerts, 2006]

Page 14: Conistency of random forests

Grow different trees from same learning set โ„’๐‘›

Sampling with replacement - random subspace [Breiman, 2001]

Sampling with replacement - weighted subspace [Amaratunga, 2008; Xu, 2008; Wu, 2012]

Sampling with replacement - random subspace and regularized [Deng, 2012]

Sampling with replacement - random subspace and guided-regularized [Deng, 2013]

Sampling with replacement - random subspace and random split position selection [Saรฏp Ciss, 2014]

Page 15: Conistency of random forests

Some RF extensions

quantile estimation Meinshausen, 2006

survival analysis Ishwaran et al., 2008

ranking Clemencon et al., 2013

online learning Denil et al., 2013;

Lakshminarayanan et al., 2014

GWA problems Yang et al., 2013; Botta et al., 2014

Page 16: Conistency of random forests

What is a good learner?[What is friendly with my data?]

Page 17: Conistency of random forests

What is good in high-dimensional settings

Breiman, 2001

Wu et al., 2012

Deng, 2012

Deng., 2013

Saรฏp Ciss, 2014

Page 18: Conistency of random forests

Simulation Experiment

Page 19: Conistency of random forests

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y1 y2 y3 y4

AA ABB B

Page 20: Conistency of random forests

Random Forest [Breiman, 2001]

Page 21: Conistency of random forests

WSRF [Wu, 2012]

Page 22: Conistency of random forests

Random Uniform Forest [Saรฏp Ciss, 2014]

Page 23: Conistency of random forests

RRF [Deng, 2012]

Page 24: Conistency of random forests

GRRF [Deng, 2013]

Page 25: Conistency of random forests
Page 26: Conistency of random forests

Simulation Experiment

Page 27: Conistency of random forests

Random Forest [Breiman, 2001]

Page 28: Conistency of random forests

WSRF [Wu, 2012]

Page 29: Conistency of random forests

Random Uniform Forest [Saรฏp Ciss, 2014]

Page 30: Conistency of random forests

RRF [Deng, 2012]

Page 31: Conistency of random forests

GRRF [Deng, 2013]

Page 32: Conistency of random forests

GRRF with AUC [Deng, 2013]

Page 33: Conistency of random forests

GRRF with ER [Deng, 2013]

Page 34: Conistency of random forests

Simulation Experiment

B

A

C

D

B E

A

E D

Multiple Class Tree

Page 35: Conistency of random forests

Random Forest [Breiman, 2001]

Page 36: Conistency of random forests

WSRF [Wu, 2012]

Page 37: Conistency of random forests

Random Uniform Forest [Saรฏp Ciss, 2014]

Page 38: Conistency of random forests

RRF [Deng, 2012]

Page 39: Conistency of random forests

GRRF [Deng, 2013]

Page 40: Conistency of random forests

GRRF with ER [Deng, 2013]

Page 41: Conistency of random forests

What is a good learner?[Nothing you do will convince me]

[I need rigorous theoretical guarantees]

Page 42: Conistency of random forests

Asymptotic statistics and learning theory[go beyond experiment results]

Page 43: Conistency of random forests

Machine Learning, what is?

Parametric

Non-parametric

Supervised problems: not too difficult

Unsupervised problems: is very difficult

Find a parameter which minimize the loss function

Page 44: Conistency of random forests

โ€œrightโ€

Pattern (which learnt from โ„’๐‘› ) is โ€œtrueโ€, isnโ€™t it? How much do I believe?

Is this procedure friendly with my data?

What is the best possible procedure for my problem?

What is if our assumptions are wrong?

โ€œefficientโ€

How many observations do I need in order to achieve a โ€œbelievedโ€ pattern?

How many computations do I need?

Assumption: There are some patterns

Page 45: Conistency of random forests

Learning Theory [Vapnik, 1999]

asymptotic theory

necessary and sufficient conditions

the best

possible

Page 46: Conistency of random forests

Supervised Learning

โ„’๐‘›

L is a loss function

Classification: zero-one loss function

Regression: ๐•ƒ1, ๐•ƒ2

Page 47: Conistency of random forests

Supervised Learning

Generator ๐’™

๐’™

๐‘ฆ

๐‘ฆ ๐‘ฆโ€ฒ

Supervisor

Machine Learning

Two different goals

imitate (prediction accuracy)

identify (interpretability)

Page 48: Conistency of random forests

What is the best predictor?

Page 49: Conistency of random forests

What is the best predictor

Bayes model

residual error

A model built from any learning set โ„’, Err( ) โ‰ค Err( )

In theory, when ๐‘ƒ(๐‘‹๐‘Œ) is known

Page 50: Conistency of random forests

What is the best predictor

If is zero-one loss function, the Bayes model is

In classification, the best possible classifier consists in systematically predicting the most likely class ๐‘ฆ โˆˆ {๐‘1, โ€ฆ , ๐‘๐ฝ} given ๐‘‹ = ๐’™

Page 51: Conistency of random forests

What is the best predictor

If is the squared error loss function, the Bayes model is

In regression, the best possible regressor consists in systematically predicting the average value of ๐‘Œ given ๐‘‹ = ๐’™

Page 52: Conistency of random forests

Given a learning algorithm ๐’œ and a loss function

๐œ‘๐‘› = ๐’œ(โ„’๐‘›)

โ„’๐‘›

๐’œ โ„’๐‘› }

Learning algorithm ๐’œ is consistent in L if and only if

๐ธ๐‘Ÿ๐‘Ÿ( ) โŸถ๐‘›โ†’โˆž๐‘ƒ ๐ธ๐‘Ÿ๐‘Ÿ(๐œ‘๐ต)

๐ธ๐‘Ÿ๐‘Ÿ(๐œ‘๐‘› ) โŸถ๐‘›โ†’โˆž๐‘ƒ ๐ธ๐‘Ÿ๐‘Ÿ(๐œ‘๐ต)

Page 53: Conistency of random forests

Random Forests are consistent, arenโ€™t they?

Page 54: Conistency of random forests

๐’œ

ฮ˜ ๐’œ

ฮ˜ is used to sample the training set or to select the candidate directionsor positions for splitting

ฮ˜ is independent of the dataset and thus unrelated to the particularproblem

In some new variants of RF, ฮ˜ is depend on the dataset

Generalized Random Forest

Page 55: Conistency of random forests

Bagging Procedure

ฮ˜ ฮ˜1, โ€ฆ , ฮ˜m

ฮ˜i โ„’๐‘› ๐’œ(ฮ˜๐‘– โ„’๐‘›)

โ„’๐‘› ฮ˜๐‘– โ„’๐‘›

โ„’๐‘› ฮ˜1 โ„’๐‘› ฮ˜๐‘š โ„’๐‘›

Generalized Random Forest

Page 56: Conistency of random forests

Consistency of Random Forests

lim๐‘›โ†’โˆž

Err(๐ป๐‘š . ; โ„’๐‘› ) = Err(๐œ‘๐ต)Problem

- m is finite โ‡’ predictor depend on trees that formed forest

- structure of a tree depend on ฮ˜i and learning setโŸน finite forest is actually a subtle combination of randomness and

depending-on-data structuresโŸน finite forests predictions can be difficult to interpret (randomprediction or not)

- Non-asymptotic rate of convergence

Challenges

Page 57: Conistency of random forests

๐ปโˆž(๐‘ฅ; โ„’๐‘›) = ๐”ผฮ˜{ ฮ˜ โ„’๐‘› }

lim๐‘šโ†’โˆž

๐ป๐‘š(๐‘ฅ; โ„’๐‘›) = ๐ปโˆž(๐‘ฅ; โ„’๐‘›)

Problem

- infinite forest is good than finite forest, isnโ€™t it?- What is good m? (rate of convergence)

Challenge

Consistency of Random Forests

Page 58: Conistency of random forests

Review some recent results

Page 59: Conistency of random forests

Strength, Correlation and Err [Breiman, 2001]

ฮ˜i โ„’๐‘› ฮ˜i โ„’๐‘›

ฮ˜ โ„’๐‘› ฮ˜ โ„’๐‘›

Theorem 2.3 An upper bound for the generalization error is given by:๐‘ƒ๐ธโˆ— โ‰ค ๐œŒ(1 โˆ’ ๐‘ 2)/๐‘ 2

where ๐œŒ is the mean value of the correlation, s is the strength of theset of classifiers.

Page 60: Conistency of random forests

RF and Adaptive Nearest Neighbors [Lin et al, 2006]

ฮ˜i โ„’๐‘› ๐’œ(ฮ˜๐‘– โ„’๐‘›)

ฮ˜

ฮ˜i โ„’๐‘›=

1

๐’™๐ฃ:๐’™

๐ฃโˆˆ ๐ฟ ฮ˜๐‘–,๐’™

๐‘—:๐’™๐ฃโˆˆ ๐ฟ ฮ˜๐‘–,๐’™

๐‘ฆj = ๐‘—=1๐‘› ๐‘ค

๐‘—ฮ›๐‘–

๐‘ฆ๐‘—

โ„’๐‘› ฮ˜๐‘– โ„’๐‘› ๐‘—=1๐‘› ๐‘ค๐‘—๐‘ฆ๐‘—

๐‘ค๐‘— ๐‘ค๐‘—ฮ›๐‘–

๐ปโˆž(๐’™; โ„’๐‘›) ฮ˜ฮ˜

ฮ˜

Non-adaptive if ๐‘ค๐‘— not depend on ๐‘ฆ๐‘–โ€™s of the learning set

Page 61: Conistency of random forests

RF and Adaptive Nearest Neighbors [Lin et al, 2006]

ฮ˜i โ„’๐‘› ๐’œ(ฮ˜๐‘– โ„’๐‘›)

ฮ˜

ฮ˜i โ„’๐‘›=

1

๐’™๐ฃ:๐’™

๐ฃโˆˆ ๐ฟ ฮ˜๐‘–,๐’™

๐‘—:๐’™๐ฃโˆˆ ๐ฟ ฮ˜๐‘–,๐’™

๐‘ฆj = ๐‘—=1๐‘› ๐‘ค

๐‘—ฮ›๐‘–

๐‘ฆ๐‘—

โ„’๐‘› ฮ˜๐‘– โ„’๐‘› ๐‘—=1๐‘› ๐‘ค๐‘—๐‘ฆ๐‘—

๐‘ค๐‘— ๐‘ค๐‘—ฮ›๐‘–

๐ปโˆž(๐’™; โ„’๐‘›) ฮ˜ฮ˜

ฮ˜

Non-adaptive if ๐‘ค๐‘— not depend on ๐‘ฆ๐‘–โ€™s of the learning set

The terminal node size k should be made to increase with the sample size ๐‘›. Therefore, growing large trees (k being a small constant) does not always

give the best performance.

Page 62: Conistency of random forests

Biau et al, 2008Given a learning set โ„’๐‘› = ๐’™1, ๐‘ฆ1 , โ€ฆ , ๐’™๐‘›, ๐‘ฆ๐‘› of โ„๐‘‘ โˆ— {0, 1}

Binary classifier ๐œ‘๐‘› which trained from โ„’๐‘›: โ„๐‘‘ โŸถ {0, 1}

๐œ‘๐‘› ๐œ‘๐‘›(๐‘‹) โ‰  ๐‘Œ

๐œ‘๐ต ๐‘ฅ = ๐•€{ } ๐œ‘๐ต

A sequence {๐œ‘๐‘›} of classifiers is consistent for a certain distribution of (๐‘‹, ๐‘Œ) if ๐ธ๐‘Ÿ๐‘Ÿ(๐œ‘๐‘›) โŸถ in probability

Assume that the sequence {๐‘‡๐‘›} of randomized classifiers is consistent for a certain distribution of ๐‘‹, ๐‘Œ . Then the voting classifier ๐ป๐‘š(for any value of m) and the averaged classifier ๐ปโˆž are also consistent.

Page 63: Conistency of random forests

Biau et al, 2008

Growing Trees

Node ๐ด is randomly selected

The split feature j is selected uniformly at random from [1, โ€ฆ , ๐‘]

Finally, the selected node is split along the randomly chosen feature at a random location

* Recursive node splits do not depend on the labels ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘›

Theorem 2 Assume that the distribution of ๐‘‹ is supported on [0, 1]๐‘‘.Then purely random forest classifier ๐ปโˆž is consistent whenever ๐‘˜ โŸถโˆž and ๐‘˜

๐‘› โŸถ 0 as n โŸถ โˆž.

Page 64: Conistency of random forests

Biau et al, 2008

Growing Trees

โ„’๐‘› ๐’œ( )

Theorem 6 Let {๐‘‡ฮ›} be a sequence of classifiers that is consistent for thedistribution of ๐‘‹๐‘Œ. Consider the bagging classifier ๐ป๐‘š and ๐ปโˆž, using parameter๐‘ž๐‘›. If ๐‘›๐‘ž๐‘› โŸถ โˆž as n โŸถ โˆž then both classifier are consistent.

Page 65: Conistency of random forests

Biau et al, 2012

Growing Trees

At each node, a coordinate is selected with ๐‘๐‘›๐‘— โˆˆ (0, 1) is the probability j-th feature is selected

the split is at the midpoint of the chosen side

Theorem 1 Assume that the distribution of ๐‘‹ has support on [0, 1]๐‘‘.Then the random forests estimate ๐ปโˆž(๐’™; โ„’๐‘›) is consistent whenever

๐‘๐‘›๐‘—๐‘™๐‘œ๐‘”๐‘˜๐‘› โŸถ โˆž for all j=1, โ€ฆ, p and ๐‘˜๐‘›๐‘› โŸถ 0 as ๐‘› โŸถ โˆž.

Page 66: Conistency of random forests

Biau et al, 2012

Assume that X is uniformly distributed on [0,1]๐‘

๐’‘๐’๐’‹ = (๐Ÿ/๐‘บ)(๐Ÿ + ๐ƒ๐’๐’‹) ๐’‡๐’๐’“ ๐’‹ โˆˆ ๐“ข

In sparse settings

Estimation Error (variance)

๐”ผ{[๐ปโˆž ๐’™; โ„’๐‘› โˆ’ ๐ปโˆž ๐’™; โ„’๐‘› ]2} โ‰ค ๐ถ๐œŽ2S2

S โˆ’ 1

๐‘†2๐‘

(1 + ๐œ‰๐‘›)๐‘˜๐‘›

๐‘›(๐‘™๐‘œ๐‘”๐‘˜๐‘›)๐‘†/2๐‘

If ๐‘Ž < ๐‘๐‘›๐‘— < ๐‘ form some constants ๐‘Ž, ๐‘ โˆˆ 0,1 then

1 + ๐œ‰๐‘› โ‰ค๐‘† โˆ’ 1

๐‘†2๐‘Ž 1 โˆ’ ๐‘

๐‘†2๐‘

Page 67: Conistency of random forests

Biau et al, 2012

Assume that X is uniformly distributed on [0,1]๐‘ and ๐œ‘๐ต ๐’™๐’ฎ is ๐ฟ โˆ’ ๐ฟ๐‘–๐‘๐‘ ๐‘โ„Ž๐‘–๐‘ก๐‘ง on [0,1]๐‘ 

๐’‘๐’๐’‹ = (๐Ÿ/๐‘บ)(๐Ÿ + ๐ƒ๐’๐’‹) ๐’‡๐’๐’“ ๐’‹ โˆˆ ๐“ข

In sparse settings

Approximation Error (bias2)

๐”ผ ๐ปโˆž ๐’™; โ„’๐‘› โˆ’ ๐œ‘๐ต ๐’™2

โ‰ค 2๐‘†๐ฟ2๐‘˜๐‘›โˆ’

0.75๐‘†๐‘™๐‘œ๐‘”2 1+๐›พ๐‘› + [ sup

๐‘ฅโˆˆ[0,1]๐‘๐œ‘๐ต

2(๐’™)]๐‘’โˆ’๐‘›/2๐‘˜๐‘›

where ๐›พ๐‘› = min๐‘—โˆˆ ๐’ฎ

๐œ‰๐‘›๐‘— tends to 0 as n tends to infinity.

Page 68: Conistency of random forests

Finite and infinite RFs [Scornet, 2014]

โ„’๐‘› ฮ˜๐‘– โ„’๐‘›

๐ปโˆž(๐’™; โ„’๐‘›) = ๐”ผฮ˜{ ฮ˜ โ„’๐‘› }

โ„’๐‘›

โ„’๐‘› ๐ปโˆž(๐’™; โ„’๐‘›)

Theorem 3.1 Conditionally on โ„’๐‘›, almost surely, for all ๐‘ฅ โˆˆ 0, 1 ๐‘, wehave: ๐ป๐‘š ๐’™; โ„’๐‘›

๐‘€โ†’โˆž๐ปโˆž(๐’™; โ„’๐‘›).

Page 69: Conistency of random forests

Finite and infinite RFs [Scornet, 2014]

One has ๐‘Œ = ๐‘š ๐‘‹ + ๐œ€ where ๐œ€ is a centered Gaussian noise withfinite variance ๐œŽ2, independent of ๐‘‹,

and ๐‘š โˆž = sup๐‘ฅโˆˆ[0,1]๐‘

|๐‘š(๐‘ฅ)| < โˆž.

Assumption H

Theorem 3.3 Assume H is satisfied. Then, for all m, ๐‘› โˆˆ โ„•โˆ—,

๐ธ๐‘Ÿ๐‘Ÿ ๐ป๐‘š ๐’™; โ„’๐‘› = ๐ธ๐‘Ÿ๐‘Ÿ ๐ปโˆž(๐’™; โ„’๐‘›) +1

๐‘š๐”ผ๐‘‹,โ„’

๐‘›[๐•ฮ˜[๐‘‡ฮ› (๐’™; ฮ˜, โ„’๐‘›)]]

โ‡’ ๐‘š โ‰ฅ8 ๐‘š โˆž

2 + ๐œŽ2

๐œ€+

32๐œŽ2๐‘™๐‘œ๐‘”๐‘›

๐œ€๐‘กโ„Ž๐‘’๐‘› ๐ธ๐‘Ÿ๐‘Ÿ Hm โˆ’ ๐ธ๐‘Ÿ๐‘Ÿ Hโˆž โ‰ค ๐œ€

0 โ‰ค ๐ธ๐‘Ÿ๐‘Ÿ Hm โˆ’ ๐ธ๐‘Ÿ๐‘Ÿ Hโˆž โ‰ค8

m( ๐‘š โˆž

2 + ๐œŽ2(1 + 4๐‘™๐‘œ๐‘”๐‘›))

Page 70: Conistency of random forests

RF and Additive regression model [Scornet et al., 2015]

Growing Treeswithout replacement

Assume that ๐ด is selected node and ๐ด > 1

Select uniformly, without replacement, a subset โ„ณ๐‘ก๐‘Ÿ๐‘ฆ โŠ‚ 1, โ€ฆ , ๐‘ , |โ„ณ๐‘ก๐‘Ÿ๐‘ฆ| = ๐‘š๐‘ก๐‘Ÿ๐‘ฆ

Select the best split in A by optimizing the CART-split criterion along the coordinates in โ„ณ๐‘ก๐‘Ÿ๐‘ฆ

Cut the cell ๐ด according to the best split. Call ๐ด๐ฟ and ๐ด๐‘… the true resulting cell

Set ๐ด ๐ด๐ฟ ๐ด๐‘…

Page 71: Conistency of random forests

RF and Additive regression model [Scornet et al., 2015]

๐‘Œ = ๐‘—=1๐‘

๐‘š๐‘—(๐‘‹(๐‘—)) + ๐œ€

Assumption H1

Theorem 3.1 Assume that (H1) is satisfied. Then, provided ๐‘› โŸถ โˆž and๐‘ก๐‘›(๐‘™๐‘œ๐‘”๐‘Ž๐‘›)9/๐‘Ž๐‘› โŸถ 0, is consistent.

Theorem 3.2 Assume that (H1) and (H2) are satisfied and let ๐‘ก๐‘› = ๐‘Ž๐‘›.Then, provided ๐‘Ž๐‘› โŸถ โˆž, ๐‘ก๐‘› โŸถ โˆž and ๐‘Ž๐‘›๐‘™๐‘œ๐‘”๐‘›/๐‘› โŸถ 0, is consistent.

Page 72: Conistency of random forests

RF and Additive regression model [Scornet et al., 2015]

, ๐‘—1,๐‘› ๐‘‹ , โ€ฆ , ๐‘—๐‘˜,๐‘› ๐‘‹ the first cut directions used to construct

the cell containing ๐‘‹. ๐‘—๐‘ž,๐‘› ๐‘‹ = โˆž if the cell has been cut strictly less than

q times.

Theorem 3.2 Assume that (H1) is satisfied. let k โˆˆ โ„•โˆ— and ๐œ‰ > 0.Assume that there is no interval [๐‘Ž, ๐‘] and no ๐‘— โˆˆ {1, โ€ฆ , ๐‘†} such that๐‘š๐‘— is constant on [๐‘Ž, ๐‘]. Then, with probability 1 โˆ’ ๐œ‰, for all ๐‘› large

enough, we have, for all 1 โ‰ค ๐‘ž โ‰ค ๐‘˜, ๐‘—๐‘ž,๐‘› ๐‘‹ โˆˆ {1, โ€ฆ , ๐‘†}.

Page 73: Conistency of random forests

[Wager, 2015]

A partition ฮ› is ๐›ผ, ๐‘˜ โˆ’ ๐‘ฃ๐‘Ž๐‘™๐‘–๐‘‘ if can generated by a recursive partitioning scheme inwhich each child node contains at least a fraction ๐›ผ of the data points in its parentnode for some 0 < ๐›ผ < 0.5, and each terminal node contains at least ๐‘˜ trainingexamples for some k โˆˆ N.

Given a dataset ๐‘‹, ๐’ฑ๐›ผ,๐‘˜(๐‘‹) denote the set of ๐›ผ, ๐‘˜ โˆ’ ๐‘ฃ๐‘Ž๐‘™๐‘–๐‘‘ partitions

๐‘‡ฮ›: [0,1]๐‘โ†’ โ„, ๐‘‡ฮ› ๐’™ =1

|{๐’™๐‘–: ๐’™๐‘– โˆˆ ๐ฟ(๐’™)}| {๐‘–: ๐’™๐‘–โˆˆ๐ฟ(๐’™)} ๐‘ฆ๐‘– (is called valid tree)

๐‘‡ฮ›โˆ—: [0,1]๐‘โ†’ โ„, ๐‘‡ฮ›

โˆ— ๐’™ = ๐”ผ[๐‘Œ|๐‘‹ โˆˆ ๐ฟ(๐’™)] (is called partition-optimal tree)

Whether we can treat ๐‘‡ฮ› as a good approximation to ๐‘‡ฮ›โˆ— the

supported on the partition ฮ›

Page 74: Conistency of random forests

Given a learning set โ„’๐‘› of [0,1]๐‘โˆ— โˆ’๐‘€

2,๐‘€

2with ๐‘‹~๐‘ˆ([0,1]๐‘)

[Wager, 2015]

Theorem 1Given parameters n, p, k such that

lim๐‘›โ†’โˆž

log ๐‘› log ๐‘

๐‘˜= 0 ๐‘Ž๐‘›๐‘‘ ๐‘ = ฮฉ(๐‘›)

then

lim๐‘›,๐‘‘,๐‘˜โ†’โˆž

โ„™ sup๐‘ฅโˆˆ 0,1 ๐‘,ฮ›โˆˆ๐’ฑ๐›ผ,๐‘˜

|๐‘‡ฮ› โˆ’ ๐‘‡ฮ›โˆ—| โ‰ค 6๐‘€

log ๐‘› log(๐‘)

klog((1 โˆ’ ๐›ผ)โˆ’1)= 1

Page 75: Conistency of random forests

[Wager, 2015]

Growing Trees (Guest-and-check)

Select a currently un-split node ๐ด containing at least 2k training examples

Pick a candidate splitting variable ๐‘— โˆˆ {1, โ€ฆ , ๐‘} uniformly at random

Pick the minimum squared error (โ„“( ๐œƒ)) splitting point ๐œƒ

If either there has already been a successful split along variable j for some other nod or

โ„“( ๐œƒ) โ‰ฅ 36๐‘€2log ๐‘› log(๐‘‘)

๐‘˜๐‘™๐‘œ๐‘”((1 โˆ’ ๐›ผ)โˆ’1)

The split succeeds and we cut the node ๐ด at ๐œƒ along the j-th variable; if not we do not split the node ๐ด this time.

Page 76: Conistency of random forests

[Wager, 2015]In sparse settings

and a set of sign variables ๐œŽ๐‘— โˆˆ ยฑ1 such that, for all

๐‘— โˆˆ and all ๐‘ฅ โˆˆ [0,1]๐‘,

๐”ผ ๐‘Œ ๐‘‹ โˆ’๐‘— = ๐‘ฅ โˆ’๐‘— , ๐‘‹ ๐‘— >12

โˆ’ ๐”ผ ๐‘Œ ๐‘‹ โˆ’๐‘— = ๐‘ฅ โˆ’๐‘— , ๐‘‹ ๐‘— โ‰ค12

โ‰ฅ ๐›ฝ๐œŽ๐‘—

Assumption H1

is Lipschitz-continuous in

Assumption H2

Page 77: Conistency of random forests

[Wager, 2015]In sparse settings

and a set of sign variables ๐œŽ๐‘— โˆˆ ยฑ1 such that, for all

๐‘— โˆˆ and all ๐‘ฅ โˆˆ [0,1]๐‘,

๐”ผ ๐‘Œ ๐‘‹ โˆ’๐‘— = ๐‘ฅ โˆ’๐‘— , ๐‘‹ ๐‘— >12

โˆ’ ๐”ผ ๐‘Œ ๐‘‹ โˆ’๐‘— = ๐‘ฅ โˆ’๐‘— , ๐‘‹ ๐‘— โ‰ค12

โ‰ฅ ๐›ฝ๐œŽ๐‘—

Assumption H1

is Lipschitz-continuous in

Assumption H2

Theorem 2 Under the conditions of theorem 1, suppose thatassumptions in the sparse setting hold, then guest-and-check forest isconsistent.

Page 78: Conistency of random forests

[Wager, 2015]

๐ป{ฮ›}1๐ต: [0,1]๐‘โ†’ โ„, ๐ป ฮ› 1

๐ต ๐’™ =1

๐ต ๐‘=1

๐ต ๐‘‡ฮ›๐‘(๐’™) (is called valid forest)

๐ป{ฮ›}1

๐ตโˆ— : [0,1]๐‘โ†’ โ„, ๐ป

{ฮ›}1๐ต

โˆ— ๐’™ =1

๐ต ๐‘=1

๐ต ๐‘‡ฮ›๐‘

โˆ— (๐’™) (is called partition-optimal forest)

Theorem 4

lim๐‘›,๐‘‘,๐‘˜โ†’โˆž

โ„™ sup๐ปโˆˆโ„‹๐›ผ,๐‘˜

1

๐‘›

๐‘–=1

๐‘›

(๐‘ฆ๐‘– โˆ’ ๐ป(๐‘ฅ๐‘–))2 โˆ’ ๐”ผ[(๐‘Œ โˆ’ ๐ป(๐‘‹))2] โ‰ค 11๐‘€2log ๐‘› log(๐‘)

klog((1 โˆ’ ๐›ผ)โˆ’1)= 1

Page 79: Conistency of random forests

ReferencesB. Efron. Estimation and accuracy after model selection. Journal of the American Statistical Association, 2013.

B. Lakshminarayanan et al. Mondrian forests: Efficient online random forests. arXiv:1406.2673, 2014.

B. Xu et al. Classifying very high-dimensional data with random forests built from small subspaces. International Journal of Data Warehousing and Mining (IJDWM) 8(2), 37:44โ€“63, 2012.

D. Amaratunga et al. Enriched random forests. Bioinformatics (Oxford, England), 24(18): 2010โ€“2014, 2008 doi:10.1093/bioinformatics/btn356

E. Scornet. On the asymptotics of random forests. arXiv:1409.2090, 2014

E. Scornet et al. Consistency of random forests. The Annals of Statistics. 43 (2015), no. 4, 1716--1741. doi:10.1214/15-AOS1321. http://projecteuclid.org/euclid.aos/1434546220.

G. Biau et al. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9:2015โ€“2033, 2008.

G. Biau. Analysis of a random forests model. Journal of Machine Learning Research, 13:1063โ€“1095, 2012.

H. Deng et al. Feature Selection via Regularized Trees, The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, 2012.

H. Deng et al. Gene Selection with Guided Regularized Random Forest , Pattern Recognition, 46.12 (2013): 3483-3489

H. Ishwaran et al. Random survival forest. The Annals of Applied Statistics, 2:841โ€“860, 2008.

L. Breiman. Bagging predictors. Technical Report No. 421, Statistics Department, UC Berkeley, 1994.

Page 80: Conistency of random forests

ReferencesL.Breiman. Randomizing outputs to increase prediction accuracy, Technical Report 518, Statistics Department, UC Berkeley, 1998.

L. Breiman. Some infinite theory for predictor ensembles. Technical Report 577, Statistics Department, UC Berkeley, 2000.

L. Breiman. Random forests. Machine Learning, 45:5โ€“32, 2001.

M. Denil et al. Consistency of online random forests. International Conference on Machine Learning (ICML) 2013.

N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983โ€“999, 2006.

P. Geurts et al. Extremely randomized trees. Machine Learning, 63(1):3-42, 2006.

Q. Wu et al. Snp selection and classification of genome-wide snp data using stratified sampling random forests. NanoBioscience, IEEE Transactions on 11(3): 216โ€“227, 2012.

T. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization, Machine Learning 1-22, 1998

T.K. Ho. Random decision forests, Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on , 1:278-282, 14-16 Aug 1995, doi: 10.1109/ICDAR.1995.598994

T.K. Ho. The random subspace method for constructing decision forests, IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(8):832-844, 1998

Saรฏp Ciss. Random Uniform Forests. 2015. <hal-01104340v2>

S. Clยดemenยธcon et al. Ranking forests. Journal of Machine Learning Research, 14:39โ€“73, 2013.

Page 81: Conistency of random forests

ReferencesS. Wager. Asymptotic theory for random forests. arXiv:1405.0352, 2014

S. Wager. Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388, 2015

Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578โ€“590, 2006.

V. N. Vapnik. An overview of Statistical Learning Theory. IEEE Trans. on Neural Networks, 10(5):988-999, 1999.

Page 82: Conistency of random forests

RF and Additive regression model [Scornet et al., 2015]

the indicator that falls in the same cell as in the

random tree designed with โ„’n and the random parameter

where โ€ฒis an independent copy of .

โ€ฒ, ๐‘‹1, โ€ฆ , ๐‘‹๐‘›,

โ€ฒ, ๐‘‹1, โ€ฆ , ๐‘‹๐‘›

Page 83: Conistency of random forests

RF and Additive regression model [Scornet et al., 2015]