Confidence Sets for Split Points in Decision Trees

24
Confidence Sets for Split Points in Decision Trees Moulinath Banerjee University of Michigan Ian W. McKeague Florida State University December 19, 2003 Abstract uhlmann and Yu (2002) recently found the limiting distribution of the least squares estimator of split points in decision trees (CART). Their result is an instance of cube-root asymptotics with a non-normal limit, and raises the interesting question of how to construct confidence sets for split points. The bootstrap fails in this setting, and standard (Wald type) confidence intervals are found to be inaccurate. The present paper introduces a confidence set constructed by inverting a centered residual sum of squares statistic. A motivation for developing such a confidence set comes from the problem of phosphorus pollution in the Everglades. Ecologists have suggested that split points provide a phosphorus threshold at which biological imbalance occurs, and the lower endpoint of the confidence set may be interpreted as a level that is protective of the ecosystem. This is illustrated using data from a Duke University Wetlands Center phosphorus dosing study in the Everglades. Key words and phrases: CART, change-point estimation, cube–root asymptotics, hazard function, Nelson–Aalen estimator, nonparametric regression. 1 Introduction It has been about twenty years since decision trees (CART) came into widespread use for obtaining simple predictive rules for the classification of complex data. For each predictor variable in a (binary) regression tree analysis, the predicted response splits according to whether or , for some split point . Although the rationale behind CART is primarily statistical, the split point can be important in its own right, and in some applications it represents a parameter of real scientific interest. For example, split points have been interpreted as thresholds for the presence of environmental damage in the development of pollution control standards. In a recent study [5] of the effects of phosphorus pollution in the Everglades, split points are used in a novel way to identify threshold levels of phosphorus concentration that are associated with declines in the abundance of certain species. The present paper introduces an approach to finding accurate confidence sets for such split points. The split point represents the best approximation of a binary decision tree (piecewise constant function with a single jump) to the regression curve , where is the response. The limiting distribution of the least squares estimator of the split point has recently been obtained by B¨ uhlmann and Yu [2]. Their result involves a cube-root rate of convergence (instead of the usual -rate), a non- normal limit, and is developed in a homoscedastic nonparametric regression framework in which the regression curve is assumed to be a smooth function of . Efron’s bootstrap fails for , but standard (Wald type) confidence intervals of the form are still available, with plug-in estimates of various parameters in the limiting distribution. Currently, however, this is the only known way to obtain confidence sets for split points. 1

Transcript of Confidence Sets for Split Points in Decision Trees

Page 1: Confidence Sets for Split Points in Decision Trees

Confidence Sets for Split Points in Decision Trees

Moulinath BanerjeeUniversity of Michigan

Ian W. McKeagueFlorida State University

December 19, 2003

Abstract

Buhlmann and Yu (2002) recently found the limiting distribution of the least squares estimatorof split points in decision trees (CART). Their result is an instance of cube-root asymptotics witha non-normal limit, and raises the interesting question of how to construct confidence sets for splitpoints. The bootstrap fails in this setting, and standard (Wald type) confidence intervals are foundto be inaccurate. The present paper introduces a confidence set constructed by inverting a centeredresidual sum of squares statistic. A motivation for developing such a confidence set comes fromthe problem of phosphorus pollution in the Everglades. Ecologists have suggested that split pointsprovide a phosphorus threshold at which biological imbalance occurs, and the lower endpoint of theconfidence set may be interpreted as a level that is protective of the ecosystem. This is illustratedusing data from a Duke University Wetlands Center phosphorus dosing study in the Everglades.

Key words and phrases: CART, change-point estimation, cube–root asymptotics, hazard function, Nelson–Aalenestimator, nonparametric regression.

1 Introduction

It has been about twenty years since decision trees (CART) came into widespread use for obtainingsimple predictive rules for the classification of complex data. For each predictor variable

�in a (binary)

regression tree analysis, the predicted response splits according to whether� ���

or� ���

, forsome split point

�. Although the rationale behind CART is primarily statistical, the split point can be

important in its own right, and in some applications it represents a parameter of real scientific interest.For example, split points have been interpreted as thresholds for the presence of environmental damagein the development of pollution control standards. In a recent study [5] of the effects of phosphoruspollution in the Everglades, split points are used in a novel way to identify threshold levels of phosphorusconcentration that are associated with declines in the abundance of certain species. The present paperintroduces an approach to finding accurate confidence sets for such split points.

The split point represents the best approximation of a binary decision tree (piecewise constantfunction with a single jump) to the regression curve ����� � �����

, where � is the response. The limitingdistribution of the least squares estimator ���� of the split point has recently been obtained by Buhlmannand Yu [2]. Their result involves a cube-root rate of convergence (instead of the usual � � -rate), a non-normal limit, and is developed in a homoscedastic nonparametric regression framework in which theregression curve is assumed to be a smooth function of

�. Efron’s bootstrap fails for �� � , but standard

(Wald type) confidence intervals of the form ���������������� are still available, with plug-in estimates ofvarious parameters in the limiting distribution. Currently, however, this is the only known way to obtainconfidence sets for split points.

1

Page 2: Confidence Sets for Split Points in Decision Trees

In the present paper we further develop the split point estimation methods initiated in [2].Specifically, we investigate the following questions:

1. Can the assumption of homoscedasticity be dropped? We find that this is the case, under somemild conditions on the conditional variance function.

2. Are the standard Wald type confidence intervals accurate in the sense that their nominal coverageagrees with actual coverage? We find that they are grossly inaccurate, even at large sample sizes.

3. Is there a better way to produce confidence sets for split points? We introduce a centered residualsum of squares statistic and find that it has much greater stability than �� � . This statistic is invertedto provide substantially more accurate confidence sets (at moderate sample sizes) than availableusing the standard Wald type confidence interval.

4. Are there other statistical models beyond nonparametric regression in which confidence sets forsplit points can be developed? We study how to obtain confidence sets for split points of hazardfunctions in the setting of right-censored survival data.

A more developed approach to the estimation of jump points is change-point analysis (see, e.g., [9–11], [14–16]) in which the aim is to estimate the locations of abrupt changes (jump discontinuities) in anotherwise smooth curve. However, in contrast to split point estimation, there is no distinction betweenthe “working” model that has the jump point and the regression model that generates the data. The rateof convergence of change-point estimators is close to � , much faster than the �! #"%$ rate for split pointestimators; this is hardly surprising because split point estimation depends on local features of a smoothregression curve, which are harder to estimate than jumps.

Split point estimation is model robust in the sense that the confidence sets still apply undermisspecification of the working model. Also, as we show, the working model of a piecewise constantfunction with a single jump can be naturally extended to allow a smooth parametric curve to the left ofthe jump and a smooth parametric curve of the same form to the right of the jump. A model of this typeis a two-phase linear regression, as has been found useful, e.g., in change-point analysis for climate data[12].

Split point analysis can thus be seen as a complimentary approach to change-point analysis; it ismore appropriate in applications (such as [5]) in which the underlying curve is thought to be smooth,and does not require the existence of a jump discontinuity except in the working model, which issimply used as a device to define interesting structure. While the statistical literature on change-pointanalysis is comprehensive and vast, split point analysis has been quite neglected and is in need of furtherdevelopment. In particular, the problem of obtaining accurate confidence sets for split points has not yetbeen studied.

The paper is organized as follows. In Section 2 we consider the nonparametric regression setting,and in Section 3 the survival analysis setting. Simulation results and an application to Everglades dataare presented in Section 4. Proofs of all the results are collected in Section 5. In Section 6 we discuss anextension of our procedure to decision trees that incorporate general parametric working models.

2 Split point estimation in nonparametric regression

In this section we study the problem of estimating the split point in a binary decision tree fornonparametric regression.

2

Page 3: Confidence Sets for Split Points in Decision Trees

Let�'& � denote the (one-dimensional) predictor and response variables, respectively. The

nonparametric regression function ()� ���*� ����+ �,�-���is to be approximated using a decision tree

with a single (terminal) node, i.e., a piecewise constant function with a single jump. The predictor�

isassumed to vary in a compact interval . / &�0'1 and to have a density 2435�76 � . For convenience, we adopt theusual representation � � ()� �8��9+: , with the error

:;� �=<>����+ �?� having zero conditional mean given�. The conditional (Lebesgue) density of

:given

�@�A�is denoted by 2CBD�76E �F� , and the conditional

variance of:

given�G�H�

is denoted ICJ�� ��� .Suppose we have � i.i.d. observations � � & � �K& � � J & � J �K&�L�L�LM& � �*�N& � �O� of � �P& � � . Consider the

working model in which ( is treated as a piecewise constant function with a single jump, with parameters�RQTS & QVU &��W� , where�

is the point at which the function jumps, Q4S is the value to the left of the jump andQ U is the value to the right of the jump. Best projected values are then defined by�RQYXS & QCXU &�� X �)� argmin Z\[^] Z\_`] aF�G. �b<8QTSdc�� �e�f�O� <8QVUgc�� �e�f�O�h1 J L (2.1)

Before proceeding further, we impose some mild conditions.

Conditions

(A1) There is a unique vector �RQ XS & Q XU &�� X � , with Q XSji� Q XU and / �k� X �k0 , that minimizes the rightside of (2.1).

(A2) ()� ��� is continuous and is continuously differentiable in an open neighborhood l of� X . Also,(Nmn� � X � i� / .

(A3) 2o3p� ��� does not vanish and is continuously differentiable on l .

(A4) I J � �F� is continuous on l .

(A5) For some qprf/ , ts?u$�v�w sup xzyz{52 B �Rs8 ���)| / as `s? |~}.

The vector �RQ XS & Q XU &�� X � then satisfies the normal equationsQ XS � ����+ ����� X �K& Q XU � ����+ ����� X �K& ()� � X �)� Q XS 9 Q XU� LEstimates of these quantities are obtained using least squares as�O�QTS & �QVU & ��d���)� argmin Z [ ] Z _ ] a �� � � ��

� <8QTS�c�� � � �f�W� <�QVUgc�� � � �f�W��� J LThe goal is to estimate the (unique) split point

� X .We now introduce the basic idea used to construct our proposed confidence interval for

� X . Considertesting the null hypothesis that the best projected value

� X is located at some point�. The best piecewise

constant approximation to ( under this null hypothesis is estimated by ���Q aS & �Q aU &��O� , where�W�Q aS & �Q aU �)� argmin Z [ ] Z _ �� � � ��� <�QTSdc�� � � �f�O� <8QVUgc�� � � �f�O��� J L

3

Page 4: Confidence Sets for Split Points in Decision Trees

A reasonable test statistic is the (centered) residual sum of squares defined by���T��� � �W��� �� � � � �� <��Q aS c�� � � �f�W� <��Q aU c�� � � �f�W�7� J < �� � � � �

� <��QTS�c�� � � � ����V� <��QVUgc�� � � � ��d���7� J LIf the working model for ( is true, and the errors are Gaussian with constant variance, then up toconstants, RSS

� � �O� is a likelihood ratio statistic, but in general it has no such interpretation. The keyresult below provides the joint asymptotic distribution of ���� and

���V�W� � � X � .Theorem 2.1 Suppose conditions (A1)–(A5) hold. Then� � #"%$ � ��d� < � X �K& ��� #"%$ ���V��� � � X � � | a � argmax ���E�R� �K& � `QYXS <�QYXU max � �>�R� ���o&where �E�R� �����!� �R� � <=�T� J LHere

� �R� � is standard two-sided Brownian motion started from 0 and�o& � are positive constants given

by � J � 2o3p� � X � I J � � X �K& � � c� K2T3p� � X � ( m � � X � LRemark: Denoting �E�R� �)� ��� ] � �R� � it can be shown using Brownian scaling (see, e.g., [1]) that��� ] � �R� ��� a � � ��� � � #"%$ � ] ���n� �`�O� J�"%$ � �K&so the limit in the above theorem can be expressed more simply as� �n� �`��� � J�"%$ argmax ��� ] �R� �K& � tQ XS <�Q XU � � �O� � � #"%$ max � � ] �R� � � LFrom Theorem 2.1 and the above remark, the standard (Wald type) confidence set for

� X is seento be �� � � � � #"%$ � ���� �� � J�"%$ 2o  "%J , where 2   is the upper ¡ -quantile of ¢ ��£�¤ ¢`¥ � � ] �R� � and �� , �� are suitableestimates (discussed below) of the constants

�, � . The proposed RSS type confidence set for

� X is foundby inverting the hypothesis test:¦ �>§����V��� � �W�©¨ � � #"%$ ©�QTSV<��QVU� �� � ��V� �� � #"%$4ª  o«where ª   is the upper ¡ -quantile of

¤ ¢`¥ � � ] �R� � .3 Split point estimation for a hazard function

In this section we discuss a version of our approach in the previous section in the context of hazardfunction estimation. We consider the classical case of right censored survival data. A split point in thissetting could represent, for example, the age at which a diagnostic test is to be recommended based on adecision tree analysis of the age-specific hazard rate of a certain disease.

Let ¬ be the survival time of interest with continuous distribution function ­ and smoothinstantaneous hazard function ®C�R� � , let ¯ be a censoring time which is independent of ¬ with absolutelycontinuous distribution function ° and let

�±� ¬³²´¯ and µ � c��¶¬ ¨ ¯ � . Suppose we have � i.i.d.

4

Page 5: Confidence Sets for Split Points in Decision Trees

observations � � & µ �K&�L�L�L\& � �*��& µ �O� of � �P& µ � . A binary tree approximation to ®Y�R� � is identified bythe 3-vector �RQoS & QTU &��W� where

�is the jump (or split) point, Q�S is the value to the left of the jump and Q�U

is the value to the right of the jump. Best projected values are defined by�RQ XS & Q XU &�� X �)� argmin Z [ ] Z _ ] a�·¹¸X»º ®Y�R� � <�¼®C�R�¾½�QTS & QVU &��W�#¿ J � � & (3.2)

where ¼®C�R�K½�QTS & QVU &��W�À� QTS�c��R� �f�O�49 QVUÁc��R� �f�O�is the working model for ®C�R� � and Â+r�/ should be thought of as the time when monitoring of individualsfor the disease is stopped. Before proceeding further, once again we impose some mild conditions.

Conditions

(B1) There is a unique vector �RQ XS & Q XU &�� X � with Q XS i� Q XU and / ��� X �  that minimizes the right sideof (3.2).

(B2) ® is continuously differentiable in a neighborhood of� X , and ®Vmn� � X � i� / .

(B3) ­>� � r�/ and °E� � rf/ , where ­ and ° are the survival functions of ¬ and ¯ , respectively.

The normal equations are obtained by setting the partial derivatives of the right side of (3.2) withrespect to �RQoS & QVU &��W� to be equal to 0. This givesQ XS �Äà � � X �� X & Q XU �Åà � � < à � � X �Â*< � X & ®C� � X �)� Q XS 9 Q XU� &where à �R� ���ÇÆ�ÈX ®C�nÉ �W� É is the cumulative hazard function.

Expanding the integral in (3.2) and discarding the terms not involving QCS & Q U &�� , shows that�RQ XS & Q XU &�� X �@�argmin Z\[¶] Z\_`] aNÊ��RQTS & QVU &��W�K&

where Ê��RQoS & QVU &��O��� �RQ JS <�Q JU �W�Ë9 Q JU  9 � �RQVUÌ<8QTS � à � �W� < � QVU à � �oLEstimates of �RQ XS & Q XU &�� X � are obtained by replacing the unknown à �R� � in the above expression by the

Nelson–Aalen estimator �à � �R� � and then maximizing the resulting quantity with respect to �RQYS & Q U &��W� .Thus �O�QTS & �QVU & ����V�)� argmin Z [ ] Z _ ] aNÊ � �RQTS & QVU &��O�K&where Ê � �RQTS & QTU &��W��� �RQ JS <8Q JU �7��9 Q JU  9 � �RQVU�<8QTS � �à � � �W� < � QVU �à � � �and �à � �R� ��� c� �� � � µ

� c ¦ � � ¨ � «cÁ<�Í � � � � < � &where Í � is the empirical distribution function of the

� �s.

Now consider testing the null hypothesis that the optimal split point� X actually attains the value

�.

Interpreting Ê � �RQTS & QVU &��O� as a residual sum of squares (since, up to a constant, it estimates the integratedsquared discrepancy of the stump based on QFS & Q U &�� from the true instantaneous hazard over the domain

5

Page 6: Confidence Sets for Split Points in Decision Trees

of interest), we can introduce a centered residual sum of squares statistic, as in the regression setting.Formally define ���V��� � �W��� ��Î � Ê � �RQ aS & Q aU &��O� <?Ê � �O�QTS & �QVU & ��d�O� � &where, for a fixed

�, �RQ aS & Q aU � is the minimizer of Ê � �RQTS & QTU &��W� over all �RQoS & QVU � . For / �����  ,�Q aS � �à � � �O�� & �Q aU � �à � � � < �à � � �O�Â*< � L

The null hypothesis would be rejected for large values of���V�o� � �W� .

We have the following key theorem that gives the joint distribution of the estimated split point andthe residual sum of squares statistic.

Theorem 3.1 Suppose conditions (B1)–(B3) hold. Then� � #"%$ �T��d� < � X �K& � � #"%$ ���V��� � � X � � | a � argmax � �E�R� �K& � `Q XS <�Q XU max � �>�R� ���o&where �E�R� �)�H�!� �R� � <��o�%J for positive constants

�and � given by,� J � ®Y� � X �­>� � X � °>� � X � and � �Ïc� u® m � � X � L

The numerator in the expression for� J can be replaced by �RQ XS 9 Q XU ��� � , which is easily estimated.

Given estimates of�T& � (discussed below), the two types of confidence sets have the same form as in the

nonparametric regression case.

4 Numerical examples

In this section we study examples with simulated data and an application to the Everglades datamentioned in the Introduction.

In the nonparametric regression setting, estimates of ( m � � X � , 2T3p� � X � and I4J�� � X � for estimatingthe constants

�and � are found using David Ruppert’s (Matlab) implementation of local polynomial

regression and density estimation with empirical-bias bandwidth selection [7]. This provides a reliableway of estimating these nuisance parameters by plugging-in the split point estimator. We use the defaultvalues of the various tuning parameters specified by Ruppert.

For the nonparametric regression simulation study we specify ()� ���Ð�Ñ�, I J � �F�¹� L ��Ò

,� Ó

Uniform . / & c 1 , :�Ó lÐ�/ &�L ��Ò � with�

and:

independent, and where the split point� X � c � � . Figure 1

shows that the 95% CIs based on RSS are quite accurate, even with sample sizes as small as 50, but thatthe Wald type CIs are grossly inaccurate and completely misleading. For � � c�/�/�/ , we found the WaldCI had coverage of around 77%, compared with 96% for the RSS based CI. For the simulation resultsin Figure 2, we took I J � ���!�Ô� ¥OÕC�%< � L×Ö`��� ; the 2.8 is chosen so that I J � L Ò �!�ØL ��Ò to make it consistentwith the constant variance case. The coverage percentages for the RSS type CI are almost the same asbefore, but the Wald type CI does noticeably worse.

Why is the RSS CI much more accurate than the Wald CI? From the proof of Theorem 2.1 (inSection 5), the RSS type CI is seen to be calibrated in a way that depends on the limiting distribution ofmax � � � �R� � , in which the maximization is a continuous operation. For the Wald type CI, however, thecalibration involves the limiting distribution of argmax �d� � �R� � and the argmax operation is discontinuous

6

Page 7: Confidence Sets for Split Points in Decision Trees

(in sup norm), so we expect less stability and a slower rate of convergence in that case. This is naturallyreflected in the accuracy of the confidence intervals. Figure 3 shows the extent to which the large samplecalibration differs from small sample behavior, with much greater discrepancy in the case of ��d� (see thevertical lines, showing the nominal 95% critical values).

50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

sample size

cove

rage

(pe

rcen

t)

50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

sample size

CI w

idth

Figure 1: Simulation results for split point estimation in the nonparametric regression setting: coverageand width of 95% confidence intervals at selected sample sizes; coverage (left panel), width (right panel),Wald (circles), RSS (stars). Results based on 1000 replicated samples in each case.

50 100 150 200 2500

10

20

30

40

50

60

70

80

90

100

sample size

cove

rage

(pe

rcen

t)

50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

sample size

CI w

idth

Figure 2: As in Figure 1 except with non-constant variance.

7

Page 8: Confidence Sets for Split Points in Decision Trees

•• • ••

•••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••

•••• •

Quantiles of Limit Distribution

Norm

aliz

ed d

hat

-2 -1 0 1 2

-2-1

01

2

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••

••••••••

••••••

•• ••

• • •

Quantiles of Limit Distribution

Norm

aliz

ed R

SS

0 1 2 3

01

23

4

Figure 3: Accuracy of the large sample approximations in the nonparametric regression setting. Resultsbased on 10,000 replicated samples of size � � ��Ò / . Left panel: Q-Q plot of �© #"%$�� ��d� < � X � (normalized)vs. argmax ��� ] �R� � ; Right panel: Q-Q plot of � � #"%$ �Ë�V��� � � X � (normalized) vs. max � � ] �R� � .

0 10 20 30 40 50 60 70 80

0

10

20

30

40

50

60

70

80

90

Total Phosphorus

Utri

cula

ria P

.

Figure 4: Data from the DUWC Everglades phosphorus dosing study showing variations in bladderwort(Utricularia P.) stem density (number of stems per square meter) in response to total phosphorusconcentration (six month geometric mean, units of ppb). The vertical lines show the limits of the RSS95% confidence interval for the split point. The local polynomial regression fit is also plotted.

The “river of grass” known as the Everglades is a majestic wetland covering much of South Florida.Severe damage to large swaths of this unique ecosystem has been caused by pollution from agriculturalfertilizers and the disruption of water flow (e.g., from the construction of canals). Efforts to restore theEverglades started in earnest in the early 1990s. In 1994, the Florida legislature passed the EvergladesForever Act which called for a threshold level of total phosphorus that would prevent an “imbalance innatural populations of aquatic flora or fauna.” This threshold may eventually be set at around 10 or 15parts per billion (ppb), but it remains undecided despite extensive scientific study and much political and

8

Page 9: Confidence Sets for Split Points in Decision Trees

500 1000 1500 20000

10

20

30

40

50

60

70

80

90

100

sample size

cove

rage

(pe

rcen

t)

500 1000 1500 20000

0.2

0.4

0.6

0.8

1

1.2

sample size

CI w

idth

Figure 5: Simulation results for split point estimation in the hazard function setting, see Figure 1 for thekey. Results based on 10,000 replicated samples in each case.

legal debate; see [3] for a discussion of the statistical issues involved.Between 1992 and 1998, the Duke University Wetlands Center (DUWC) carried out a dosing

experiment at two unimpacted sites in the Everglades. This experiment was designed to find the thresholdlevel of total phosphorus concentration at which biological imbalance occurs. Changes in the abundanceof various phosphorus-sensitive species were monitored along dosing channels in which a gradient ofphosphorus concentration had been established. Qian, King and Richardson [5] analyzed data fromthis experiment using Bayesian change-point analysis, with the response along the phosphorus gradientassumed to be normally (or binomially) distributed with mean and variance changing at the threshold.This resulted in extremely narrow credible regions for the threshold, which was to be expected in viewof the small number of parameters in the model. They also used split point estimation (just as wedescribed it in the nonparametric regression setting of Section 2), the split point being interpreted as thethreshold level at which biological imbalance occurs. Uncertainty in the split point was evaluated usingthe bootstrap (however, recall that the bootstrap fails for split point estimation, see [2], page 940).

We illustrate our approach with one particular species monitored in the DUWC dosing experiment:the bladderwort Utricularia Purpurea, which is considered a keystone species for the health of theEverglades ecosystem. Figure 4 shows 340 observations of stem density plotted against the six monthgeometric mean of total phosphorus concentration. The displayed data were collected in August 1995,March 1996, April 1998 and August 1998 (observations taken at unusually low or high water levels,or before the system stabilized in 1995, are excluded). As expected, the estimated regression functionshows a fairly steady decrease in stem density with increasing phosphorus concentration. The 95%Wald-type and RSS-type CIs for the split point are 9.5–16.1 ppb and 10.6–24.3 ppb, respectively. Thelower endpoint of the RSS-type interval, 10.6 ppb, is only slightly higher than the 10 ppb thresholdrecommended by the Florida Department of Environmental Protection (based on an observational study,see [13]), yet considerably less than the 15.6 ppb recommended by the DUWC based on Bayesianchange-point analysis.

9

Page 10: Confidence Sets for Split Points in Decision Trees

The interpretation of the split point as a biological threshold is the source of some controversy in thedebate over a numeric phosphorus criterion (see [13]). It can be argued that the split point is only crudelyrelated to biological response and that it is a statistical construct depending on an artificial workingmodel. Indeed, the CI for the split point can change substantially under a different working model. Forinstance, in a two-phase linear regression split point analysis of the above data, the 95% RSS-type CIfor the split point becomes 11.8–17.2 ppb. Yet the approach has played an important role in laying thegroundwork for the ‘hard decision’ of determining a phosphorus threshold and fulfills a clear need in theabsence of better biological understanding. Also, as mentioned earlier, the approach is more appropriatethan a change-point analysis.

Now we turn to a simulation example for the hazard function setting. We specify the survival time ¬as Weibull with hazard function ®Y�R� ��� � , the censoring time ¯ as exponential with mean 3, and the endof follow-up  � 1.5. The censoring rate is about 50%, which includes the roughly 20% of observationsthat exceed the end of follow-up. The split point is

� X � .75. The nuisance parameters to be estimatedin this case are the survival functions (of ¬ and ¯ ) evaluated at the split point, and ® m � � X � ; this is doneby plugging the split point estimate into the respective Kaplan–Meier estimates and the Ramlau-Hansenkernel estimate of the derivative of a hazard function (see [6], page 459). Here we use the Epanechnikovkernel

0 �R� �Ù�ÄLÛÚ Ò �%c�<Ü� J � , for �\ ¨ c , and bandwidth .25. Figure 5 shows that the 95% CIs based onRSS perform well, with the observed coverage probability around 97% over the range � � 200–2000,but that the Wald type CIs are again grossly inaccurate.

5 Proofs

We follow the general approach van der Vaart and Wellner ([8], Part 3) developed for M-estimators, andalso refer to their book for various results on empirical processes. For the sake of readability we haveincluded some details that overlap with Kim and Pollard [4] and Buhlmann and Yu [2].

Proof of Theorem 2.1: Our goal is to derive the limiting distribution of�R� #"%$ �T�� � < � X �K& ��� #"%$ ��V� � � � X ���when the underlying split-point in the population is indeed

� X . Note that without loss of generality itsuffices to study the case that �RQ XS & Q XU � is known, since the vector �V�QTS & �QVU � converges at � � -rate to thepopulation projected values �RQ XS & Q XU � , which is faster than the rate of convergence of �� to

� X . Thus, forthis section, we take ��d�5� argmin a �� � � gÝ �

� <�Q XS c�� � � �f�W� <�Q XU c�� � � �f�W�%Þ Jand���T��� � � X ��� �� � � Ý �

� <8QYXS c�� � � �f� X � <�QYXU c�� � � �f� X � Þ J < �� � � � �� <8QCXS c�� � � � ����O� <8QYXU c�� � � � ��d�O� � J L

Also, in what follows we will assume that Q XS rjQ XU . The derivation for the other case is analogous. Now,elementary algebra shows that

Ý �� <8QYXS c�� � � �f�W� <8QYXU c�� � � �f�O� Þ J <Á� J� � Ý QYXS <?QCXU Þ Ý QCXS 9 QYXU < � �

� Þ � c�� � � �f�O��9 QYXU Ý QYXU < � �� Þ L

10

Page 11: Confidence Sets for Split Points in Decision Trees

Letting ß � denote the empirical measure of the pairs

¦ � � � & � � � « �� � , we have� � #"%$ �Ë�V��� � � X �à� � � #"%$ �� � � �RQ XS <�Q XU � Ý Q XS 9 Q XU < � �� Þ � c�� � � �f� X � <�c�� � � � ������ �� � � � #"%$ �RQ XS <8Q XU � �� � � á �

� < Q XS 9 Q XU� â � c�� � � � �� � � <fc�� � � �f� X � �� � � J�"%$ �RQ XS <8Q XU � ß ��ã �76 & ����T�o&where ã ��� �P& � �K&��W�À� á �b< Q XS 9 Q XU� â º c�� ���f�W� <fc�� ���f� X �#¿ÌLIt is easily seen that�� � �

argmax aåä �� � � Ý �� <8QYXS c�� � � �f� X � <8QYXU c�� � � �f� X �%Þ J< �� � � gÝ �

� <8Q XS c�� � � �f�W� <8QTUgc�� � � �f�W�%Þ JOæ�argmax a � �>�RQ XS <�Q XU � ß �;ã �76 &��W��argmax a � J�"%$ ß � ã �76 &��W�oL

Now, define the process � � �R� ��� � J�"%$ ß ��ã �76 &�� X 9 �W� � #"%$ �oLLetting �� � �� #"%$�� ���� < � X � , so that ����ç� � X 9 ��W� � #"%$ , we have �� �

argmax ��� � �R� � and� � #"%$ ���T��� � � X �H� � �RQ XS <bQ XU � � � � �� � . It therefore suffices to find the joint limit distribution of� �� & � � � �� ��� . Lemma 5.1 below shows that the random processes � � �R� � converge in distribution in thespace èËS×é%ê��Rë � (the space of locally bounded functions on ë equipped with the topology of uniformconvergence on compacta) to the Gaussian process �E�R� �Ë�Ä�©� �R� � <f�o� J whose distribution is a tightBorel measure concentrated on ¯!ì � x �Rë � (the separable subspace of èíSÛé7ê��Rë � of all continuous functionson ë that converge to < } as the argument runs off to

}or < } and that have a unique maximum).

Furthermore, the sequence

¦ �� � « of maximizers of

¦ � � �R� � « is îÙï!�%c � . The result then follows by Theorem5.1 below.

Proof of Theorem 3.1: As in the proof of Theorem 2.1, assume that Q XS and Q XU are known and thatQ XS rjQ XU . Then Ê � �RQTS & QVU &��W� is simply a function of�

and is given byÊ � � �O��� ���RQ XS � J <³�RQ XU � J �W�Ë9 �RQ XU � J  9 � �RQ XU <?Q XS � �à � � �O� < � Q XU �à � � �oLFor / ¨f��¨  , define the process ð � �O��� Ê � � � X � <?Ê � � �W�oLThen �� � � argmax a ð � �O� and

���V� � � � X �;� � ð �V�� � � . We argue the consistency of �� � at the end of thisproof.

11

Page 12: Confidence Sets for Split Points in Decision Trees

Elementary algebra shows thatð � �O��� � �RQ XS <�Q XU � � �à � � �W� <=®Y� � X �W� <³� �à � � � X � <=®C� � X �W� X � � LFor �©ñÜ�%< }�&K}�� , define the process� � �R� ��� � J�"%$ � �à � � � X 9 �W� � #"%$ � <=®C� � X � � � X 9 ��� � #"%$ � <³� �à � � � X � <=®C� � X �W� X � � LThen �� �*� argmax ��� � �R� ��� �� #"%$�� ���� < � X � and � � #"%$ �Ë�V��� � � X ��� � �RQ XS <åQ XU � � � � �� �O� . Now, note that� � �R� ��� � J�"%$ ß �Àã`� �76 &�� X 9 ��� � #"%$ �o&where ã � ���nµ &��?�K&��O�À� µjò � � �?� .óc�� �~¨��W� <fc�� ��¨f� X �h1 <=®C� � X � � � < � X �and ò � �R� ��� cc!<�Í � �R�¾< � |¹ô¾õ öhõ òÜ�R� �)� ccÁ<�͹�R�D< � � ccÁ<�͹�R� � � cÍÐ�R� � &where ͹�R� ���H÷ � � rj� �)� ­>�R� � °E�R� � . Define a surrogate process approximating � � �R� � byø� � �R� �@� � J�"%$ ß � øã � �76 &�� X 9 ��� � #"%$ �K&where øã`� ���nµ &��?�K&��O�À� µ¹òÜ� �?� .óc�� �~¨f�O� <fc�� �e¨f� X �h1 <=®Y� � X � � � < � X �KLLemma 5.3 shows that sup �y�ù �oú ] úÀû C� � �R� � < ø� � �R� � converges to 0 in probability. Also, under the

assumptions of Theorem 3.1,ø� � �R� � converges in distribution to

�!� �R� � <j�T� J as a process in èíS×é%ê��Rë �(the space of locally bounded functions on ë ) equipped with the topology of uniform convergence oncompact sets, with

�and � as defined in Theorem 3.1. This is the content of Lemma 5.2 below. Therefore,� � �R� � also converges in distribution, as a process, to

�!� �R� � <¹�o�KJ , whose distribution is a tight Borelmeasure concentrated on ¯©ì � x �Rë � . Furthermore the sequence �� � is tight as in the regression setting andtherefore, by Theorem 5.1, � �� � & � � � �� � ��� converges jointly to � �� & �>� �� ��� . We conclude that�R� #"%$ �N��d� < � X �K& � � #"%$ ���V��� � � X ����| a � argmax ���E�R� �K& � �RQ XS <8Q XU � max � �>�R� ���oLTo establish the consistency of ���� promised earlier, we proceed as follows: Noting that �à � � �O�f�ß � �nµ¹ò � � �8� c�� �e¨��W���

we have ð � �O��� � �RQ XS <8Q XU �düå� � �O�where

üå� � �O��� ß � . ã`� ���nµ &��?�K&��O�h1 . Since Q XS <åQ XU rf/ , it follows that ���� also maximizesüP� � �W� . Now,

let ü � �W�Ù�H÷ . øãz� ���nµ &��8�K&��W�h1�LOn using the fact that

÷ �nµ¹òÜ� �8� c�� �~¨f�W����� à � �O� we easily conclude thatü � �O�g� à � �O� < à � � X � <ЮY� � X � � � < � X �oLBy condition (B1),

� X ñ��/ &  � is the unique minimizer of Ê��RQ XS & Q XU &��W� (with Ê defined in Section3); consequently, it is the unique maximizer of the function Ê��RQ XS & Q XU &�� X � <AÊ��RQ XS & Q XU &��W� , which

12

Page 13: Confidence Sets for Split Points in Decision Trees

straightforward algebra shows to be� �RQ XS <�Q XU �dü � �O� . Since Q XS <�Q XU r»/ we conclude that

� X isthe unique maximizer of

ü � �O� . Note thatü m � �O��� ®C� �W� <�®Y� � X � and

ü m m � �W��� ® m � �W� ; henceü m � � X �)� /

andü m mý� � X �)� ®�mþ� � X � . Since

� X ñÜ�/ &  � maximizesü � �O� , we must have

ü m m�� � X �©¨ / . This implies that®C� � X �©� / , since by condition (B2) we have ® m � � X � i� / .By Corollary 3.2.3 of [8], it will follow that ���� converges in probability to

� X if we can show that

sup X¾ÿ a ÿ ¸ üå� � �O� < ü � �O� | /in outer probability. But üå� � �W� < ü � ��� ¨ ó�¶ß � < ÷p� øã`� ���nµ &��?�K&��O� 9��� ß � �nµ��Rò � � �8� <8òÜ� �?��� �%c�� �à¨f�W� <fc�� ��¨f� X ������¨ ó�¶ß � < ÷p��� ���nµ &��8�K&��W� 9 ß � ò � � �?� <8òÜ� �?� &with

� ���nµ &��?�K&��O�P� µ¹òÜ� �8� .óc�� � ¨ �O� <bc�� � ¨G� X �h1 . Now, sup aDy�ù X ] ¸ û �¶ß � < ÷Ì��� ���nµ &��8�K&��W� goes to 0 in outer probability since

¦ � �76 &��O��§ / ¨��³¨  « is a Glivenko–Cantelli class of functions.That sup a�y�ù X ] ¸ û ß � ò � � �8� <8òÜ� �8� goes to 0 in probability follows upon noticing that it is dominatedby sup X¾ÿ � ÿ ¸ ò � �R� � <8òÜ�R� � , which converges to 0 almost surely from a standard Glivenko–Cantelliargument and ͹� � rf/ (condition (B3)). This concludes the proof.

Theorem 5.1 Suppose that the process � � �R� � converges in distribution in the space è�SÛé%ê\�Rë � (the spaceof locally bounded functions on ë equipped with the topology of uniform convergence on compacta) tothe Gaussian process �E�R� � , whose distribution is a tight Borel measure concentrated on ¯�ì � x �Rë � (theseparable subspace of èíSÛé7ê\�Rë � of all continuous functions on ë that converge to < } as the argumentruns off to

}or < } and that have a unique maximum). Furthermore, suppose that the sequence

¦ �� � «of maximizers of

¦ � � �R� � « is îÙï!�%c � and converges to argmax � �E�R� � . Then,� �� �o& � � � �� ������| a � argmax � �E�R� �K& �E� argmax � �>�R� ����� max � �E�R� ���KLProof: By invoking Dudley’s representation theorem (Theorem 2.2 of [4]), for the processes � � , wecan construct a sequence of processes

ø� � and a processø� defined on a common probability space� ø� & ø� & ø÷Ì� with (a)

ø� � being distributed like � � , (b)ø� being distributed like � and (c)

ø� � convergingtoø� almost surely (with respect to

ø÷) under the topology of uniform convergence on compact sets.

Thus(i)ø� � , the maximizer of

ø� � , has the same distribution as �� � , (ii)ø� , the maximizer of

ø�E�R� � , hasthe same distribution as argmax �E�R� � and (iii)

ø� � � ø� �O� andø�*� ø� � have the same distribution as � � � �� ���

and max �>�R� � respectively. So to prove the theorem it suffices to show thatø� � converges in

ø÷��(outer)

probability toø� and

ø� � � ø� �V� converges inø÷ �

(outer) probability toø�>� ø� � . The convergence of

ø� � toø� in

outer probability is shown in Theorem 2.7 of [4].To show that

ø� � � ø� � � converges in probability toø�E� ø� � , let

: rÇ/ & q�rÔ/ be given. We need to showthat, eventually, ÷ � � ø� � � ø� �V� < ø�>� ø� � r³q � ��:4LSince

ø� � andø� are î ï �%c � , given

: rf/ , we can find ÐB;rf/ such that, with ê� � ¦ ø� � �ñÜ.ó<�=B & ÜB 1 « & è ê� � ¦ ø� �ñ�.ó<�ÜB & ÜB 1 « &÷�� � ê� �f�à:K�� and

÷�� �è ê� �j�@:K�� , eventually. Furthermore, as

ø� � converges almost surely andtherefore in probability, uniformly, to

ø� on every compact set, with¯ ê� � ¦ sup È yWù ����� ] ���Rû ø� � �nÉ � < ø�E�nÉ � Or�q « &13

Page 14: Confidence Sets for Split Points in Decision Trees

÷�� �þ¯ ê� �;��:K� � , eventually. Hence, eventually,÷�� � ê��� è ê��� ¯ ê� �©��: , so that

÷ � � ��� è ��� ¯ �W� r�cT< : .But ��� è ��� ¯ � � ¦ ø� � � ø� ��� < ø�*� ø� � ¨ q & « (5.3)

and consequently ÷ � �� ø� � � ø� � � < ø�E� ø� � ¨ q �Á�f÷ � � � � è � � ¯ � � r�c!< :for all sufficiently large � . This implies immediately that for all sufficiently large �÷ � �� ø� � � ø� �O� < ø�E� ø� � r�q �!��:4LIt remains to show (5.3). To see this, note that for any �fñ � � è � � ¯ � and ÉÌñÜ.ó<�=B & =B 1 ,ø� � �nÉ � � ø�E�nÉ �49 ø� � �nÉ � < ø�E�nÉ �;¨ ø�E� ø� �49 ø� � �nÉ � < ø�E�nÉ � LTaking the supremum over ÉÌñÜ.ó<� B & B 1 and noting that

ø� � ñÜ.ó<� B & B 1 on the set ��� è ��� ¯ � we

have ø� � � ø� ���;¨ ø�>� ø� �Y9 sup È y�ù ����� ] ���Rû ø� � �nÉ � < ø�>�nÉ � &or equivalently ø� � � ø� ��� < ø�*� ø� �;¨ sup È y�ù ����� ] ���^û ø� � �nÉ � < ø�*�nÉ � LAn analogous derivation (replacing

ø� � everywhere byø� , and

ø� � byø� , and vice–versa) yieldsø�E� ø� � < ø� � � ø� � �;¨ sup È y�ù ����� ] ���^û ø�>�nÉ � < ø� � �nÉ � L

Thus ø� � � ø� ��� < ø�E� ø� � ¨ sup È yWù ��� � ] � � û ø� � �nÉ � < ø�E�nÉ �;¨ q &which completes the proof.

Lemma 5.1 The process � � �R� � defined in the proof of Theorem 2.1 converges in distribution in the spaceèËS×é%ê\�Rë � (the space of locally bounded functions on ë equipped with the topology of uniform convergenceon compacta) to the Gaussian process �E�R� �)�H�!� �R� � <´�o� J whose distribution is a tight Borel measureconcentrated on ¯ ì � x �Rë � . Here

�and � are as defined in Theorem 2.1. Furthermore, the sequence

¦ �� � «of maximizers of

¦ � � �R� � « is îÙï!�%c � (and hence converges to argmax � �E�R� � by Theorem 5.1).

Proof: We apply the general approach outlined on page 288 of [8]. Lettingü�� � �W��� ß � . ã �76 &��O�h1 andü � ���+� ÷ . ã �76 &��W�h1 , we have ����Ð� argmax X¾ÿ a ÿoú üå� � �W� and

� X � argmax X¾ÿ a ÿoú ü � �O� and, in fact,� X is the unique maximizer ofü

under the stipulated conditions. The last assertion needs proof, whichwill be supplied later. We establish the consistency of ���� for

� X and then find the rate of convergence� � of ���� ; in other words that � � for which � � �T��d� < � X � is îÙïÁ�%c � . For consistency of ��d� , note that¦ ã �76 &��O�!§ / ¨���¨f0 « is a VC class of functions and hence universally Glivenko–Cantelli in probability.Therefore

sup X¾ÿ a ÿoú üå� � �W� < ü � �O� � sup X¾ÿ a ÿoú ó�¶ß � < ÷Ì�dã �76 &��O� | /in outer probability. Also

���| ü � �W� is continuous and therefore upper semicontinuous with uniquemaximizer

� X . It follows from Corollary 3.2.3 of [8] that ��d�5� sup X¾ÿ a ÿoú üå� � �O� converges in probabilityto� X .

14

Page 15: Confidence Sets for Split Points in Decision Trees

Next, to derive the rate � � , we invoke Theorem 3.2.5 of [8], with�

playing the role of � ,� X � � X and � . / &�0'1 . We have ü �!� � < ü �"� X ���³ü � ��� < ü � � X �©¨ <Ù¯j� � < � X � J

(for some positive constant ¯ ) for all�

in a neighborhood of� X , on using the continuity of

ü m m � �O� ina neighborhood of

� X and the fact thatü m m�� � X �*� / (which follows from arguments at the end of this

proof). We now need to find functions # � �nq � such that� �í� �%$ sup & a � a(' & ) w �� �¶ß � < ÷Ì� . ã �76 &��O�h1 <��¶ß � < ÷Ì� . ã �76 &�� X �h1*��,+ ¨f0 # � �nq �for some universal positive constant

0. Letting - �+� � �>�¶ß � < ÷p� denote the empirical process and. w �0/ á21 < Q XS 9 Q XU� â Ý c

¦ �´��� « <fc ¦ �?�f� X « Þ�§ � < � X� � q43 &what we need to show can be restated as� �ï ��5 - � 5*687 � ¨f0 # � �nq �oLFrom page 291 of [8] � �ï �95 - � 5 6 7 � ¨f0 m;: �%c & . w � � ÷ � Jw � #"%J &where w is an envelope function for

. w and0 m is some universal constant. Since

. w is a VC classof functions, it is easy to show that (in the notation of [8]): �%c & . w �)� sup <f· X>= c 9@? ��£ l¹� : 5 w 5 < ] J & . w &(A J �þ� ���O�À:Ù��B¹&where

Bis a positive number not depending on q . We take w � �Y& 1 ��� �� 1 <Ü()� � X � �� c�� � ñÜ. � X <Üq &�� X 9 q 1R�oL

Next, we have ���C w � �'& � � J �à� � º ��b<Ü()� � X ��� J c¦ � ñÜ. � X <�q &�� X 9 q 1 « ¿� · a ' v�wa ' � w � º �� <Ü()� � X ��� J � ��� ¿ 2 3 � ���W�d�� · a ' v�wa ' � w �I J � ���Y9 �n()� ��� <Ü()� � X ��� J � 2T35� �F�W���¨ 0 q &

whenever q � q X for some constant0

(depending on q X ) by the continuity of all functions involved inthe integrand in a neighborhood of

� X . Therefore��% Jw � #"%J � 0 #"%J q #"%Jand it follows that � � � 5 - � 5 6D7 �´¨ ø0 q #"%J for some constant

ø0depending neither on � , nor on q ,

whenever q � q X . Hence # � �nq �À� � q works. Indeed # � �nq ��� q   is decreasing for ¡ � c . Solving� J� # � �%c � � �T�;¨ � �15

Page 16: Confidence Sets for Split Points in Decision Trees

we find � �5� � #"%$ works. Since ��d� maximizesüP� � �W� , it follows that � #"%$ �o��d� < � X ��� îÙï!�%c � .

It remains to find the limiting distribution of �� �P� � #"%$ �o��d� < � X � . Now, �� �+� argmax �d� � �R� � with� � �R� ��� �4J�"%$oß ��ã �76 &�� X 9 ��� � #"%$ � . We show that � � �R� ��| a �E�R� � , a Gaussian process in ¯©ì � x �Rë � andthen use the argmax continuous mapping theorem to deduce that �� �5| a �� , the unique maximizer of �E�R� � .Write� J�"%$ ß � $ ã �76 &�� X 9 ����� #"%$ � + � � J�"%$ �¶ß � < ÷Ì� $ ã �76 &�� X 9 �W��� #"%$ � + 9 � J�"%$ ÷ $ ã �76 &�� X 9 �W��� #"%$ � +� ED� �R� �49@EFED� �R� �oLIn terms of the empirical process - � , we have

E��5� - � �n( � ] � � where( � ] � � �C& 1 ��� � #"HG � 1 <=()� � X ��� �%c�� �8¨f� X 9 ��� � #"%$ � <fc�� �?¨f� X ���oLWe will use Theorem 2.11.22 from [8]. On each compact set .ó< 0P&�0'1 , - � ( � ] � converges as a process inICJ .ó< 0'&�0P1 to the tight Gaussian process

�!� �R� � with� J � IFJ�� � X � 2T3p� � X � . Hence - � ( � ] � converges to�!� �R� � on the line, in the topology of uniform convergence on compact sets. Also,

EFE�� �R� � converges onevery .ó< 0P&�0'1 uniformly to the deterministic function <g�o� J , with � � (Nmn� � X � 2 3 � � X � � � rb/ . Hence� � �R� ��| a �E�R� ���H�!� �R� � <í�o� J in

I J .ó< 0P&�0'1 for all0 rf/ . Consequently, � �� �T& � � � �� �O����| a � �� & �>� �� ��� .

We now establish thatE\�

andEFED�

indeed converge to the claimed limits. As far asEt�

is concerned,provided we can verify the other conditions of Theorem 2.11.22, the covariance kernel

0 �nÉ & � � of thelimit of - � ( � ] � is given by the limit of

÷ �n( � ] È ( � ] � � < ÷ ( � ] È ÷ ( � ] � as � | }. We first compute÷ �n( � ] È ( � ] � � . This vanishes if É and � are of opposite signs. For É & �;r�/ ,÷ ( � ] È ( � ] � � ��. � #"%$ �� <Ü()� � X ��� J c ¦ � ñÜ� � X &�� X 9 �nÉ©²�� � � � #"%$ 1 « 1� · a ' vLK ÈNM �PO �RQRSUTWVa ' � #"%$ º �H. �n()� �?�49j: <Ü()� � X ��� J � ���T1 ¿ 2T35� ���W�d�� � #"%$ · a ' vLK ÈNM �PO � QRSUTWVa ' Ý I J � ���49 �n()� �F� <Ü()� � X ��� J Þ 2T3p� �F�W���| I J � � X � 2T35� � X � �nÉ+²�� �� � J �nÉ+²�� �oL

Also, it is easy to see that÷ ( � ] È and

÷ ( � ] � converge to 0. Thus, when É & �!rf/ ,÷ �n( � ] È ( � ] � � < ÷ ( � ] È ÷ ( � ] � |~� J �nÉ©²�� ���H0 �nÉ & � �oLSimilarly, it can be checked that for É & � � / , 0 �nÉ & � ���k� J��%<ÙÉÁ²Ü<Á� � . Thus

0 �nÉ & � � is the covariancekernel of the Gaussian process

�©� �R� � .Next we need to check

sup < · wWXX = ? ��£ lÐ� : 5 ­ � 5 < ] J &HY � &(A J �þ� �����À:À| / & (5.4)

for every q �5| / . HereYÁ�5�[Z � #"HG � 1 <Ü()� � X ��� .óc�� �?��� X 9 �W� � #"%$ � <fc�� �?�f� X �h1Y§ �©ñÜ.ó< 0'&�0'1]\16

Page 17: Confidence Sets for Split Points in Decision Trees

and ­ � � �Y& 1 ��� � #"HG �� 1 <Ü()� � X � �� c�� � ñÜ. � X < 0 � � #"%$ &�� X 9¹0 � � #"%$ 1R�is an envelope for

Y �. Now? ��£ l¹� : 5 ­ � 5 < ] J &HY � &(A J �þ� ���©¨f0_^ � Y � � �%ca`cb �Hd KfegX O á c: â J4K d KhegX O � O

for some universal constant0

. Here^ � Y��� is the VC-dimension of

YÙ�. Using the fact that

^ � YÙ��� isuniformly bounded, we see that the above inequality implies? ��£ l¹� : 5 ­ � 5 < ] J &HY � &(A J �þ� ���©¨f0 � á c: â Èwhere É � sup

� � � ^ � Y � � <fc �Á�³} and0i�

is a constant not depending upon � or � . To check (5.4) ittherefore suffices to check that · wWXX j < ? ��£;:V�;:;| /as q �*| / . But this is trivial. We finally check the conditions (2.11.21) in [8]; these are:÷ � ­ J� � îE�%c �K&=÷ � ­ J� c ¦ ­ � r@k � � « | / & l k�r�/ &and

sup m K È ] �PO ) wWX ÷ �n( � ] È <Ü( � ] � � J | / &nl q � | / LWith ­ � as defined above, an easy computation shows that÷ � ­ J� � 0 c0 � � #"%$ · a ' v ú �FQ9SUT!Va ' �oú � Q9SUT!V �I J � ���Y9 �n()� ��� <Ü()� � X ��� J � 2T3p� �F�W���P� î>�%c �oLDenote the set . � X < 0 � � #"%$ &�� X 9j0 � � #"%$ 1 by o � . Then÷ � �­ J� c ¦ ­ � r@k � � « �@� ��. � #"%$ z�Ô<=()� � X � J c ¦ � ñio � « c ¦ �� <�()� � X � ©c ¦ � ñpo � « r@k©� #"%$ « 1¨ � $ � #"%$ z�b<Ü()� � X � J c ¦ � ñpo � « c ¦ : r@k©� #"%$ � � « +¨ � $ � � #"%$ � : J 9 �n()� �?� <=()� � X ��� J � c ¦ � ñpo � « c ¦ : r�kÀ� #"%$ � � « + (5.5)

eventually, since for all sufficiently large �¦ z� <�()� � X � ©c ¦ � ñpo � « r@k©� #"%$ « � ¦ : r@k©� #"%$ � � « LNow, the right side of (5.5) can be written as ¬ 9 ¬ J where¬ � � � #"%$ ��. : J c ¦ : rqkÀ� #"%$ � � « c ¦ � ñpo � « 1and ¬ J � � � #"%$ �H. �n()� �?� <=()� � X ��� J c ¦ � ñpo � « c ¦ : r@k©� #"%$ � � « 1�L

17

Page 18: Confidence Sets for Split Points in Decision Trees

We will show that ¬ �>r �%c � . Similar arguments can be used to show that ¬ J is alsor �%c � . Using

condition (A5), eventually¬ � � � #"%$ · a ' v ú � Q9SUT!Va ' �oú � Q9SUT!V s · & U & t2u � SUTWV "%J s J 2 B �Rs8 �F�W� s2v�2T3*� ���W�d�¨ � � #"%$ · a ' v ú � Q9SUT!Va ' �oú � Q9SUT!V s · & U & t2u � SUTWV "%J s J ts� � K $�v�w O�� s v 2 3 � �F�W���� � � #"%$ · a ' v ú � Q9SUT!Va ' �oú � Q9SUT!V s · & U & t2u � SUTWV "%J `s?×� K¶ #v�w O � s v 2 3 � ���W�d�� øw �xk & q � � #"%$ · a ' v ú � QRSUTWVa ' �oú � QRSUTWV � � w7"%$ 2T35� ���W�d�� � øw �xk & q �� w%"%$ � #"%$ · a ' v ú � Q9SUT!Va(' �oú � QRSUTWV 2T3p� �F�W���� r �%c �oLIn the above display

øw �xk & q � is a quantity depending only on k and q . Finally, the fact that

sup & È � � & ) w!X ÷ �n( � ] È <�( � ] � � J | /as q � | / can be verified through analogous computations which are omitted.

We next deal withERE��

. For convenience we sketch the uniformity of the convergence ofEFEz� �R� � to

the claimed limit on / ¨ � ¨f0 . We haveEFED� �R� � � � J�"%$ � $ �� <Ü()� � X ��� �%c�� �~�f� X 9 �W� � #"%$ � <fc�� ����� X ��� +� � J�"%$ � $ �n()� �?� <Ü()� � X ��� c�� � ñÜ. � X &�� X 9 �W��� #"%$ ��� +� � J�"%$ · a ' v � � Q9SUT!Va ' �n()� �F� <Ü()� � X ��� 2T3*� ���W�d�� � #"%$ · �X �n()� � X 9 sg� � #"%$ � <Ü()� � X ��� 2T35� � X 9 sg� � #"%$ �W� s� · �X s ()� � X 9 sÙ� � #"%$ � <Ü()� � X �sÙ� � #"%$ 2T3*� � X 9 sÙ� � #"%$ �W� s| · �X sË( m � � X � 2T3p� � X �W� s �"y2z2{}| ���ý¤~?���� zE/ ¨ � ¨f0��� c� ( m � � X � 2T3*� � X � � J � <g�K� J LIt only remains to verify that (i)

� X is the unique maximizer ofü � ��� , and (ii) ( m � � X � 2o35� � X �©� / , so that

the process�©� �R� � <Ü�o� J is indeed in ¯;ì � x �Rë � . To show (i), recall thatü � �W�,� �³. ã ��� �P& � �K&��W�h1� ��� á �b< Q XS 9 Q XU� â �%c�� �e�f�O� <fc�� ���f� X ���W�8L

18

Page 19: Confidence Sets for Split Points in Decision Trees

Let, � � �O��� ��. �Ô<�Q XS c�� ���f�W� <8Q XU c�� �����W�h1 J LBy condition (A1), it follows immediately that

� X ñ �/ &�0�� is the unique minimizer of� � �O� .

Consequently,� X is also the unique maximizer of the function

� � � X � < � � �O� . Straightforward algebrashows that � � � X � < � � �O��� � �RQYXS <8QYXU �dü � �O�and since Q XS <8Q XU r�/ , it follows that

� X is also the unique maximizer ofü � �O� . This shows (i). Now,ü � �W�ç� ��� á ()� �8� < Q XS 9 Q XU� â �%c�� ���f�W� <fc�� ���j� X ���W�Ì9 � º : �%c�� ���f�O� <fc�� ���f� X ���#¿� � � á ()� �8� < Q XS 9 Q XU� â �%c�� ���f�W� <fc�� ���j� X ��� � 9 /� · úX á ()� ��� < Q XS 9 Q XU� â Ý c�� �?�f�W� <�c�� �´�f� X �%Þ 2T35� ���W�d�� · aX á ()� ��� < Q XS 9 Q XU� â 2T35� ���W�d� <=· a 'X á ()� ��� < Q XS 9 Q XU� â 2T3p� �F�W���ÌL

Thus, for / �f�>�f0 ü m � �W��� �n()� �W� <�()� � X ��� 2 3 � �O�o&on recalling that ()� � X ��� �RQ XS 9 Q XU ��� � andü�� � � �O��� ( m � �W� 2T3*� �W�Y9 �n()� �W� <=()� � X ��� 2 mx � �W�oLThus,

ü m � � X �=� / andü m m � � X �Ü� ( m � � X � 2T35� � X � . Since

� X is the maximizer at an interior point,ü m m�� � X �©¨ / . This implies (ii), since by our assumptions (Ym� � X � 2T3p� � X � i� / .Lemma 5.2 The process

ø� � �R� � defined in the proof of Theorem 3.1 converges in distribution in the spaceèËS×é%ê\�Rë � (the space of locally bounded functions on ë equipped with the topology of uniform convergenceon compacta) to the Gaussian process �E�R� �)�H�!� �R� � <´�o� J whose distribution is a tight Borel measureconcentrated on ¯ ì � x �Rë � . Here

�and � are as defined in Theorem 3.1. Furthermore, the sequence

¦ ø� � «of maximizers of

¦ ø� � �R� � « is îÙï!�%c � (and hence converges to argmax � �E�R� � ).Proof: Centering the process

ø� � �R� � by its mean, we obtainø� � �R� �@� � �>�¶ß � < ÷Ì� . � #"HG øã � �76 &�� X 9 �W� � #"%$ �h1O9 � J�"%$ ÷ . øã � �76 &�� X 9 �W� � #"%$ �h1� ED� �R� �49%ERED� �R� �oLWe show that

E�� �R� � converges in distribution to the Gaussian process�!� �R� � for some

� r-/ (to beidentified later) on every set of the form .ó< 0'&�0P1 under the topology of uniform convergence. As inthe proof of Lemma 5.1, we need to appeal to Theorem 2.11.22 of [8] and verify the conditions ofthat theorem. The computations are tedious and in spirit resemble those in the proof of Lemma 5.1, sowe skip most of the details. The assumption of absolute continuity of the censoring distribution is justused to make it easy to check tightness. Here we only verify that the covariance kernel

0 �nÉ & � � of thelimit of the process

E\� �R� � is indeed that of the Gaussian process�©� �R� � . Writing

E`� �R� �g� - � ( � ] � with( � ] � � � #"HG øã`� �76 &�� X 9 �W� � #"%$ � , we have0 �nÉ & � ����? { ¤��� J ÷ �n( � ] È ( � ] � � < ÷ ( � ] È ÷ ( � ] � L19

Page 20: Confidence Sets for Split Points in Decision Trees

It is not difficult to check that÷ ( � ] È and

÷ ( � ] � converge to 0 and that÷ �n( � ] È ( � ] � ��� / if É and � are of

opposite sign. Hence0 �nÉ & � �Ë� / whenever É and � have opposite sign. We next compute the limit of÷ ( � ] È ( � ] � for É & �;rf/ . Straightforward computations show that the only contribution to the limit comes

from�H. � #"%$ µjò J � �?� c ¦ � ñ�� � X &�� X 9 �nÉ©²�� � ��� #"%$ 1 « 1 � · a ' vLK È�M �PO � Q9SUT!Va ' � #"%$ c͹� ��� J 22�À� �F� °E� �F�W���� · a ' vLK È�M �PO �FQ9SUT!Va ' � #"%$ ®C� ���͹� ��� �d�� �nÉ!²�� � c� � #"%$ �nÉ!²�� � · a ' vLK È�M �PO � Q9SUT!Va(' ®C� ���͹� ��� �d�| ®C� � X �͹� � X � �nÉ©²�� �K&where 2g� denotes the density of ¬ . Thus,

0 �nÉ & � �©�b� J)�nÉÁ²�� � for É & ��r�/ with� J � ®C� � X ��� Íj� � X � . It

can be checked similarly that for É & � � / , 0 �nÉ & � ��� � J �%<ÙÉÀ²´<!� � . Hence the limit process (ofE\� �R� � ) is

indeed�!� �R� � .

Next,ERED� �R� � converges uniformly to the deterministic function ®�mþ� � X � � J � � on every compact interval.ó< 0'&�0'1 . For simplicity we only show the uniformity of convergence on / ¨ � ¨k0 . Making use of

condition (B2), we haveEFED� �R� � � � J�"%$ ÷ .uµ¹òÜ� �?� c�� � ñ=� � X &�� X 9 ����� #"%$ 1R�h1 <�� #"%$ ®C� � X � �� � J�"%$ · a ' v � �RQ9SUT!Va ' cÍÐ� �F� 2 � � �F� °E� ���W�d� <�� #"%$ ®Y� � X � �� � J�"%$ · a ' v � � Q9SUT!Va ' �þ®Y� �F� <=®C� � X ���W�d�� � #"%$ · �X �þ®Y� � X 9 sg��� #"%$ � <=®Y� � X ���W� s� · �X s ®C� � X 9 sg� � #"%$ � <=®C� � X �sg� � #"%$ � s| c� ® m � � X � � Juniformly on / ¨ � ¨Ä0 . Thus

ø� � �R� �Ì| a �!� �R� � <f�o� J with � � u®Vmn� � X � � � (that ®Vmþ� � X �*� / wasargued in the proof of Theorem 3.1) under the topology of uniform convergence on compact sets.

To showø� ��� îÙï©�%c � , apply techniques similar to those used in the regression function setting, and�� �5� îÙï!�%c � requires only a minor extension of the argument for

ø� � . We omit the details.

Lemma 5.3 Under conditions (B1)–(B3) and with � � andø� � as defined in the proof of Theorem 3.1,

sup �yWù �oú ] úÀû �� � �R� � < ø� � �R� � | ï?/for every

0 r�/ .Proof: For simplicity we show that�5�*�

sup �y�ù X ] úÀû �� � �R� � < ø� � �R� � | ï?/ L20

Page 21: Confidence Sets for Split Points in Decision Trees

It can be established similarly that

sup �yWù �oú ] X7û d� � �R� � < ø� � �R� � | ï / LReferring to the proof of Theorem 3.1 for the notation, we have�5� �

sup �y�ù X ] úÀû ��� � J�"%$ ß � � ã`� < øãz��� �76 &�� X 9 �W��� #"%$ � ����sup �y�ù X ] úÀû ����� c� �� � � � #"%$ � w µ

� c�� � � ñÜ� � X &�� X 9 �W� � #"%$ 1R� � #"%$�v�w �Rò � � � � � <?òÜ� � � ��� �����¨sup �y�ù X ] úÀû c� �� � � � #"%$ � w µ

� c�� � � ñ�� � X &�� X 9 �W� � #"%$ 1R� � #"%$�v�w ó�Rò � � � � � <8òÜ� � � � ¨ c� �� � � � #"%$ � w µ� c�� � � ñ=� � X &�� X 9¹0 � � #"%$ 1R� Î � #"%$�v�w sup xzy�ù a ' ] a ' v ú � Q9SUT!V û tò � � �F� <8òÜ� ��� � è � Î �gL

In the above display q�r�/ is small enough that / � c ��� <¹q � c ����9 q � c � � . Now, for some smallk+rf/ , using condition (B3),

sup �y�ù a ' � u ] a ' v u û � �³ ò � �R� � <�òÜ�R� � � sup �y�ù a ' � u ] a ' v u û � � Í � �R�D< � <�͹�R�¾< � �� Í � �R�D< � Íj�R�¾< � �� � îÙï©�%c �KLHence

�5� îÙï!�%c � . By Markov’s inequality, for any k�r�/÷ �� è � Or@k � ¨ �H. �� #"%$ � wNµ c ¦ � ñÜ� � X &�� X 9¹0 � � #"%$ 1 « 1k | / &since condition (B2) implies that 2���� ��� is bounded in a neighborhood of

� X and��. � #"%$ µ�c ¦ � ñÜ� � X &�� X 9j0 � � #"%$ 1 « 1C¨ � #"%$ · a ' v ú � QRSUTWVa(' 22��� ���W���P� î>�%c �KLThus è �5��r ï©�%c � and

�5�E� îÙï!�%c �Nr ïÁ�%c ��| ï?/ , completing the proof.

6 Extending the decision tree approach

We conclude with a brief discussion of how the decision tree approach can be extended to parametrically-specified working models.

Consider the nonparametric regression set-up of Section 2, but instead of fitting a piecewise constantfunction we now fit a piecewise parametric curve. Thus the working model for ()� �?� becomesð �RQTS &��8� c�� ���f�W�49 ð �RQ U &��8� c�� �~�f�W�K&where

ð �RQ &���� is (adequately) smooth in both Q and�

and Q varies in ë�� . We seek the best fit of thistype in terms of mean squared error by minimizing�H��b< ð �RQTS &��?� c�� �~�j�W� < ð �RQVU &��8� c�� �~�f�W��� J L

21

Page 22: Confidence Sets for Split Points in Decision Trees

The motivation behind working models of this sort comes from the fact that it may often be possibleto capture (at least partially) the complexities of the unknown regression function by a well-specifiedparametric curve.

As in the decision tree setting, we require a unique vector �RQ XS & Q XU &�� X � to minimize the expression inthe above display, with / � � X � 0 (so the optimal split in the population is unique) and

ð �RQ XS &�� X � i�ð �RQ XU &�� X � (so the best-fitting working model has a jump at� X ). We also assume that conditions (A2)–

(A5) hold. Again the goal is to estimate� X . The normal equations are� ���� QoS ð �RQ XS &��8� �� < ð �RQ XS &��8��� c�� �~�f� X � � � / &��� �� QVU ð �RQ XU &��8� �� < ð �RQ XU &��8��� c�� �~�f� X �W� � / &()� � X ��� ð �RQ XS &�� X �Y9 ð �RQ XU &�� X �� L

Estimates ���QoS & �QVU & ��d��� are obtained by minimizing�T� �RQTS & Q U &��O��� �� � � ��� < ð �RQTS &�� � � c�� � � �f�W�49 ð �RQ U &�� � � c�� � � �f�W��� J L

For fixed�, the vector �RQ aS & Q aU � that minimizes the above expression is obtained by solvingß � ���� QTS ð �RQoS &��?� �� < ð �RQTS &��8��� c�� �~�f�W� � � / &ß � ���� QVU ð �RQVU &��?� �� < ð �RQVU &��8��� c�� �~�f�W� � � / L

One then obtains ���� by minimizing�� � � ��� < ð �W�Q aS &�� � � c�� � � �f�O� < ð �O�Q aU &�� � � c�� � � ���W��� J

with respect to�. We define the centered residual sum of squares as

RSS� � �W�À� SS �O�Q aS & �Q aU &��W� < SS �W�QTS & �QTU & ������KL

As before, �O�QTS & �Q U � converges to the population projected values at � � -rate and to study the limitdistribution of ��d� it suffices to restrict to the case in which �RQ XS & Q XU � is known. Using routine extensionsof the proofs in the previous section, we obtain�R� #"%$ � ��d� < � X �K& � � #"%$ RSS

� � � X ����| a � argmax �#��� ] � �R� �K& max � �í� ] � �R� ���where �í� ] � �R� ��� �O� �R� � <Ü�K�7J , for positive constants

�and � given by� � � �� ð �RQ XS &�� X � < ð �RQ XU &�� X � �� �I J � � X � 2T35� � X ��� #"%J &� � c� �� ( m � � X � � � ð �RQ XS &�� X � < ð �RQ XU &�� X ��� 2T3*� � X � �� L

In analyzing the Everglades data we considered the linear model (ð �RQ &��?�Ë� Q X 9 Q � with Q ��RQ X & Q � ) of two-phase linear regression, as well as the CART model. As we noted, the split point depends

22

Page 23: Confidence Sets for Split Points in Decision Trees

crucially on the choice of working model. This presents a potential difficulty with split point modeling,so care needs to be exercised in choosing

ð. Important considerations in this regard are the complexity of

the underlying curve (as assessed through background knowledge or a nonparametric fit), and scientificinterpretation (if any) of the working model.

The “piecewise parametric curve” fitting approach could also be extended to the hazard functionestimation setting and it would be of interest to explore this in future work.

Acknowledgements: The authors thank Song Qian, Ken Weaver, Jon Wellner and Michael Woodroofefor helpful discussions.

References

[1] Banerjee, M. and Wellner, J. A. (2001). Likelihood ratio tests for monotone functions. Ann. Statist.29 1699–1731.

[2] Buhlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927–961.

[3] Qian, S. S. and Lavine, M. (2003). Setting standards for water quality in the Everglades. Chance 16,No. 3, 10–16.

[4] Kim, J. and Pollard, D. (1990). Cube root asymptotics. Ann. Statist. 18 191-219.

[5] Qian, S. S., King, R. and Richardson, C. J. (2003). Two statistical methods for the detection ofenvironmental thresholds. Ecological Modelling 166 87–97.

[6] Ramlau-Hansen, H. (1983). Smoothing counting process intensities by means of kernel functions.Ann. Statist. 11 453–466.

[7] Ruppert, D. (1997). Empirical-bias bandwidths for local polynomial nonparametric regression anddensity estimation, J. Amer. Statist. Assoc. 92 1049–1062.

[8] Van der Vaart, A. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer,New York.

[9] Antoniadis, A. and Gijbels, I. (2002). Detecting abrupt changes by wavelet methods. Statisticalmodels and methods for discontinuous phenomena (Oslo, 1998). J. Nonparametric Statist. 14 7–29.

[10] Antoniadis, A., Gijbels, I. and MacGibbon, B. (2000). Nonparametric estimation for the locationof a change-point in an otherwise smooth hazard function under random censoring. Scand. J. Statist.27 501–519.

[11] Gijbels, I., Hall, P. and Kneip, A. (1999). On the estimation of jump points in smooth curves. Ann.Inst. Statist. Math. 51 231–251.

[12] Lund, R. and Reeves, J. (2002). Detection of undocumented changepoints: a revision of the two-phase regression model. Journal of Climate 15 2547–2554.

[13] Payne, G., Weaver, K. and Bennett, T. (2003). Development of a numeric phosphoruscriterion for the Everglades protection area. Everglades Consolidated Report, Chapter 5.www.dep.state.fl.us/water/everglades/docs/ch5 03.pdf

23

Page 24: Confidence Sets for Split Points in Decision Trees

[14] Wu, C. Q., Zhao, L. C. and Wu, Y. H. (2003). Estimation in change-point hazard function models.Statist. Probab. Letters 63 41–48.

[15] Chang, I. S., Chen, C. H. and Hsiung, C. A. (1994). Estimation in change-point hazard rate modelswith random censorship. In: Change-point problems, IMS Lecture Notes-Monograph Ser. 23 78–92.

[16] Muller, H. G. and Wang, J. L. (1994). Change-point models for hazard functions. In: Change-pointProblems, IMS Lecture Notes-Monograph Ser. 23 224–241.

24