Explicit Formula for Asymptotic Higher Moments of the Nadaraya-Watson Estimator

24
Sankhy¯ a : The Indian Journal of Statistics 2014, Volume 76-A, Part 1, pp. 77-100 c 2013, Indian Statistical Institute Explicit Formula for Asymptotic Higher Moments of the Nadaraya-Watson Estimator Gery Geenens Universit´ e catholique de Louvain, Louvain-la-Neuve, Belgium, The University of Melbourne, Melbourne, Australia and The University of New South Wales, Sydney, Australia Abstract The Nadaraya-Watson estimator is certainly the most popular nonparamet- ric regression estimator. The asymptotic bias and variance of this estimator, say ˆ m(x), are well known. Nevertheless, its higher moments are rarely men- tioned in the literature. In this paper, explicit formulas for asymptotic higher moments, such as E((ˆ m(x) m(x)) γ ) or E((ˆ m(x) Em(x))) γ ), for γ any positive integer, are derived and illustrated by some examples. In particu- lar, explicit asymptotic expressions for the L γ -errors of ˆ m(x), for any γ, are shown. These results also allow one to give alternative proofs for the asymp- totic normality and a Large Deviation Principle for the estimator. Other kernel regression estimators are also briefly discussed. AMS (2000) subject classification. Primary 62G08; Secondary 62G20. Keywords and phrases. Nonparametric regression, higher moments, asymp- totic normality, L γ -errors, large deviation principle. 1 Introduction Consider the model Y = m(X )+ ε, with Y a scalar outcome, X a univariate regressor, ε a univariate ran- dom disturbance such that E(ε|X ) = 0 and m an unknown smooth func- tion from R to R. The usual way to analyze the effect of X on the re- sponse Y is to estimate the function m from a sample of n observations, say {(X k ,Y k ),k =1,...,n}. Note that m is nothing else but the conditional mean of Y given the value of X , that is E(Y |X = ·). Nonparametric es- timators for this regression function have been developed in order to avoid rash parametric assumptions on its shape. Among them, and although being theoretically outperformed by other smoothers (like local polynomial esti- mators), the Nadaraya-Watson estimator (Nadaraya, 1964; Watson, 1964)

Transcript of Explicit Formula for Asymptotic Higher Moments of the Nadaraya-Watson Estimator

Sankhya : The Indian Journal of Statistics2014, Volume 76-A, Part 1, pp. 77-100c© 2013, Indian Statistical Institute

Explicit Formula for Asymptotic Higher Moments of theNadaraya-Watson Estimator

Gery GeenensUniversite catholique de Louvain, Louvain-la-Neuve, Belgium,

The University of Melbourne, Melbourne, Australia andThe University of New South Wales, Sydney, Australia

Abstract

The Nadaraya-Watson estimator is certainly the most popular nonparamet-ric regression estimator. The asymptotic bias and variance of this estimator,say m(x), are well known. Nevertheless, its higher moments are rarely men-tioned in the literature. In this paper, explicit formulas for asymptotic highermoments, such as E((m(x) − m(x))γ) or E((m(x) − E(m(x)))γ), for γ anypositive integer, are derived and illustrated by some examples. In particu-lar, explicit asymptotic expressions for the Lγ-errors of m(x), for any γ, areshown. These results also allow one to give alternative proofs for the asymp-totic normality and a Large Deviation Principle for the estimator. Otherkernel regression estimators are also briefly discussed.

AMS (2000) subject classification. Primary 62G08; Secondary 62G20.Keywords and phrases. Nonparametric regression, higher moments, asymp-totic normality, Lγ-errors, large deviation principle.

1 Introduction

Consider the modelY = m(X) + ε,

with Y a scalar outcome, X a univariate regressor, ε a univariate ran-dom disturbance such that E(ε|X) = 0 and m an unknown smooth func-tion from R to R. The usual way to analyze the effect of X on the re-sponse Y is to estimate the function m from a sample of n observations, say{(Xk, Yk), k = 1, . . . , n}. Note that m is nothing else but the conditionalmean of Y given the value of X, that is E(Y |X = ·). Nonparametric es-timators for this regression function have been developed in order to avoidrash parametric assumptions on its shape. Among them, and although beingtheoretically outperformed by other smoothers (like local polynomial esti-mators), the Nadaraya-Watson estimator (Nadaraya, 1964; Watson, 1964)

78 G. Geenens

remains very popular. The reason for this is certainly to be found in itssimplicity of derivation, interpretation and implementation, as evidenced bythe ever abundant literature about it. It is based on the following very sim-ple idea: as the function m is assumed to be smooth, most of informationneeded to estimate it at point x is to be found in observations (Xk, Yk) suchthat Xk is close to x. On the other hand, observations with Xk very distantfrom x should be of fewer importance. Setting up this idea leads to definingthe estimator of m at point x as

m(x) =

n∑

k=1

K

(x − Xk

h

)Yk

n∑

k=1

K

(x − Xk

h

) , (1.1)

where K is a smooth function, called the kernel, and h a positive number,possibly depending on n, called the bandwidth. The estimator m(x) appearsto be the weighted average of the observed responses {Yk}, with weightsdepending on the distance between x and Xk: the kernel K defines the waythe weights vary with the distance, while the bandwidth h quantifies thenotion of closeness between two points of R.

This estimator has been exhaustively studied in the literature. See e.g.Bierens (1987) or Hardle et al. (2004, section I.4) for comprehensive surveys.Usually, the following is assumed:

(A1) the sample {(Xk, Yk) : k = 1, . . . , n} is made up of independent andidentically distributed observations, with E(Yk|Xk = x) = m(x);

(A2) the functions m and f , the marginal density of X, are bounded andtwice continuously differentiable on the support of X, say SX , and, thederivatives are bounded;

(A3) the function σ2(x).= var(Y |X = x) is continuous, bounded and bounded

away from zero on SX ;

(A4) the kernel K is a bounded symmetric probability density on [−1, 1];

(A5) the bandwidth h.= hn satisfies h → 0 and nh → ∞, as n → ∞.

Note that, since the results derived in this paper will only concern a fixedvalue of x, the foregoing conditions (A2) and (A3) only have to hold in a

Moments of the Nadaraya-Watson estimator 79

neighborhood of that x. With these assumptions, standard results statethat, for any x in the interior1 of SX such that f(x) > 0,

E(m(x)) = m(x) +1

2ψ2h

2(m′′(x) + 2f ′(x)m′(x)/f(x)) + o(h2) (1.2)

and

var(m(x)) =ν0σ

2(x)

(nh)f(x)+ o((nh)−1), (1.3)

as n → ∞, with ψq =∫

uqK(u)du and νq =∫

uqK2(u)du. From there, itfollows that the asymptotic mean squared error of the estimator is

E((m(x) − m(x))2) =ν0σ

2(x)

(nh)f(x)+

1

4h4ψ2

2(m′′(x) + 2f ′(x)m′(x)/f(x))2

+ o((nh)−1) + o(h4), (1.4)

which yields an asymptotic optimal value of h, defined as the minimizer ofthe integrated dominant terms, given by

hopt = Cn−1/5, (1.5)

with C a constant depending on m, f and σ2.Expressions of the bias and the variance of m(x), given by (1.2) and (1.3),

are well established in the nonparametric regression theory. Surprisinglyenough, higher moments are on the other hand rarely mentioned in theliterature. Let us only cite the related work of Doukhan and Lang (2009),who recently provided asymptotic bounds for ‖m(x) − m(x)‖γ for arbitraryγ, as an application of more general results about moments of randomlyweighted sums. In this paper, we tackle the question differently and deriveexplicit formulas (not only bounds) for

E((m(x) − m(x))γ), (1.6)

for any positive integer γ. It is acknowledged, however, that some proofsof the below results make explicit use of Theorem 2 of Doukhan and Lang(2009), as this result allows us to bypass a lengthy and technical part of theproof.

1The interior of SX is defined as the set of points far away from at least a distanceh from the boundaries of SX . Indeed, the properties of the estimator are known to beslightly different close to the boundaries, as there are fewer observations, asymmetricallyspread around x. See Wand and Jones (1995, section 5.5) or Fan and Gijbels (1996, section2.2), among others, for a discussion about this topic.

80 G. Geenens

See that (1.2) and (1.4) actually give the asymptotic expression for thismoment for γ = 1 and γ = 2. Theorems 2.1 and 2.2 below provide theresult for any higher value of γ. To our knowledge, these are new results,at least under the synthesized form given here. Besides being interestingon their own, these formulas will alleviate technical proofs of other resultsin kernel regression, as higher moments of the kernel estimator are oftenrequired. Our main results are presented in Section 2, with some examples.Interesting implications of these new results are also set out in Section 3.Some ideas of further applications are mentioned in Section 4, as conclusion.

2 Main Results

2.1. Explicit Formulas for the Asymptotic Higher Moments. This pa-per addresses the problem of deriving the higher moments of the Nadaraya-Watson estimator. It will, therefore, be assumed throughout that these mo-ments exist. This will actually follow from the existence of the (conditional)higher moments of the error ε. Consequently, we suppose:

(A6) any conditional moment of ε given X = x exists and is bounded.

This guarantees the existence of E((m(x) − m(x))γ) for all positive integerγ. Note that the existence of E((m(x)−m(x))γ), for a fixed value of γ, onlyrequires the existence of the conditional moments of ε up to the order γ.However, Assumption (A6) is convenient in our setting, firstly as the focusis here on giving a generic expression for the higher moments holding truefor any value γ (given their existence), and secondly as we will make use ofthe whole family of moments {E((m(x)−m(x))γ) : γ = 1, 2, . . .} to highlightsome interesting points in Section 2.3.

Now, assume at first that the power γ appearing in (1.6) is even, anddefine α = γ/2. Assume also that the bandwidth is taken of optimal order,that is h ∼ n−1/5, as stated by (1.5). Then, we have the following result.

Theorem 2.1. Under assumptions (A1)–(A6), for any nonnegative inte-ger α, if h∼n−1/5, it holds, for any x in the interior of SX such that f(x)>0,

E((m(x)−m(x))2α) =2α∑

κ=α

((2α)!

2κ(2κ − 2α)!(2α − κ)!

h2(2κ−2α)

(nh)2α−κν2α−κ0 ψ2κ−2α

2

× σ2(2α−κ)(x)(m′′(x)f(x)+2f ′(x)m′(x))2κ−2α

f2α−κ(x)

)

+ o(n−4α/5), (2.1)

as n → ∞.

Moments of the Nadaraya-Watson estimator 81

Proof. See Appendix.

The next theorem is the analog, for an odd power γ = 2α + 1, forany nonnegative integer α. Again, it is supposed that the bandwidth is ofasymptotic optimal order.

Theorem 2.2. Under assumptions (A1)–(A6), for any nonnegative in-teger α, if h ∼ n−1/5, it holds, for any x in the interior of SX such thatf(x) > 0,

E((m(x) − m(x))2α+1)

=2α+1∑

κ=α+1

((2α+1)!

2κ(2κ− (2α+1))!((2α+1)−κ)!

h2(2κ−(2α+1))

(nh)(2α+1)−κν

(2α+1)−κ0 ψ

2κ−(2α+1)2

× σ2((2α+1)−κ)(x)(m′′(x)f(x) + 2f ′(x)m′(x))2κ−(2α+1)

f (2α+1)−κ(x)

)+ o(n−(4α+2)/5),

(2.2)

as n → ∞.

The proof is in every respect similar to the one of Theorem 2.1, and istherefore omitted.

Theorems 2.1 and 2.2 give any asymptotic moment of (m(x) − m(x)),which makes them very powerful analytical tools. It is interesting to notethat no conditional moments of ε given X = x higher than the conditionalvariance σ2(x) appear in these asymptotic expressions of higher moments ofthe estimator. Besides, from these results, it is not difficult to derive explicitexpressions for the central moments of m(x), that is E((m(x)−E(m(x)))γ).Those are provided by Corollaries 2.1 and 2.2.

Corollary 2.1. Under assumptions (A1)–(A6), for any nonnegativeinteger α, if h ∼ n−1/5, it holds, for any x in the interior of SX such thatf(x) > 0,

E((m(x) − E(m(x)))2α) =(2α)!

2αα!

να0

(nh)α

σ2α(x)

fα(x)+ o(n−4α/5),

as n → ∞.

Corollary 2.2. Under assumptions (A1)–(A6) and assuming thatξ3(x) = E(ε3|X = x) is continuous at x, for any positive integer α, if

82 G. Geenens

h ∼ n−1/5, it holds, for any x in the interior of SX such thatf(x) > 0,

E((m(x) − E(m(x)))2α+1) =(2α + 1)!

3 × 2α−1(α − 1)!

να−10

(nh)α+1

× σ2(α−1)(x)ξ3(x)

fα+1(x)

∫K3(u)du+o(n−4(α+1)/5),

as n → ∞.

Sketches of the proofs of these two results are found in Appendix. Re-mark that Corollary 2.2 requires an extra condition on the smoothness ofξ3(x), the third odd moment of the error term. This one explicitly arises inthe development only in the considered situation. Also, it is not valid in thecase α = 0, for evident reasons: E(m(x) − E(m(x))) = 0, trivially.

Remark 2.1. The results of Theorem 2.1, Theorem 2.2, Corollary 2.1and Corollary 2.2 were basically derived for a bandwidth h of asymptoticoptimal order. Nevertheless, the results hold for any order of h. Indeed,working with the asymptotic optimal bandwidth is actually the most com-plex situation to handle, in the sense that the bias and variance terms arethen of equal order, so that none can be neglected. On the other hand,when assuming h = o(n−1/5) (“undersmoothing” case) for example, the biasis seen to be asymptotically negligible with respect to the standard deviation(check expressions (1.2) and (1.3)). Then, it is not difficult to show that weget, for instance,

E((m(x) − m(x))2α) =1

(nh)αfα(x)

(2α)!

2αα!να0 σ2α(x) + o((nh)−α),

as n → ∞ when h = o(n−1/5). Incidentally, we see that asymptotic valuesof E((m(x) − m(x))2α) and E((m(x) − E(m(x)))2α) are equal in this case,which obviously reflects the fact that |E(m(x))−m(x)| is here asymptoticallynegligible in front of the stochastic term. In any case, expressions for anyother order of the bandwidth can directly be deduced from the above results:only the dominant term in the sums appearing in (2.1) and (2.2) has to beconsidered.

2.2. Examples. Some examples of the direct application of the previousresults are given here. The first three examples obviously just recover well-known results, while Examples 2.4 and 2.5 go a step further. Suppose thath ∼ n−1/5.

Moments of the Nadaraya-Watson estimator 83

Example 2.1. Asymptotic bias. Take α = 0 in Theorem 2.2, and find

E(m(x) − m(x)) =1

2h2ψ2(m

′′(x) + 2f ′(x)m′(x)/f(x)) + o(h2)

as n → ∞, in accord with (1.2).

Example 2.2. Asymptotic variance. Take α = 1 in Corollary 2.1,and find

E((m(x) − E(m(x)))2) =ν0σ

2(x)

(nh)f(x)+ o(n−4/5) (2.3)

as n → ∞, in accord with (1.3).

Example 2.3. Asymptotic mean squared error. Take α = 1 inTheorem 2.1, and find

E((m(x) − m(x))2) =ν0σ

2(x)

nhf(x)+

1

4h4ψ2

2(m′′(x)f(x)

+ 2f ′(x)m′(x)/f(x))2 + o(n−4/5)

as n → ∞, in accord with (1.4).

Example 2.4. Asymptotic skewness. The skewness is the ratio ofthe third central moment to the third power of the standard deviation of arandom variable, and quantifies the asymmetry of the underlying probabilitydistribution. Take α = 1 in Corollary 2.2, and find

E((m(x) − E(m(x)))3) =2

(nh)2f2(x)ξ3(x)

∫K3(u)du + o(n−8/5)

as n → ∞, where ξ3(x) = E(ε3|X = x) is assumed to be finite and continu-ous at x. Therefore, the asymptotic skewness of m(x) is found to be, fromExample 2.2,

2ξ3(x)∫

K3(u)du

(nh)1/2f1/2(x)ν3/20 σ3(x)

+ o(n−2/5) n → ∞.

Example 2.5. Asymptotic kurtosis. The kurtosis is the ratio of thefourth central moment to the square of the variance of a random variable,and quantifies the “peakedness” of the underlying probability distribution.Take α = 2 in Corollary 2.1, and find

E((m(x) − E(m(x)))4) =3

(nh)2f2(x)ν20σ4(x) + o(n−8/5) (2.4)

84 G. Geenens

as n → ∞. From Example 2.2, the asymptotic kurtosis of m(x) is thereforefound to equal 3 + o(1). However, this first-order result is not very informa-tive, as it is usually worked with the excess of kurtosis, that is the differencebetween the kurtosis and 3 (the kurtosis of the standard normal distribu-tion). At this point, it can just be concluded that the excess of kurtosis ofm(x) tends to 0 as n grows to infinity. Nevertheless, explicitly writing thehigher order terms in (2.3) and (2.4), in a tedious but direct way, shows thatwe have

E((m(x) − E(m(x)))4) =3ν2

0σ4(x)

(nh)2f2(x)+

3A(x)

n2+

3h2B(x)

n2

+ξ4(x)

∫K4(u)du

(nh)3f3(x)+ o((nh)−3)

and

{E((m(x) − E(m(x)))2)}2 =ν20σ4(x)

(nh)2f2(x)+

A(x)

n2+

h2B(x)

n2+ o((nh)−3)

as n → ∞, with A and B some bounded functions. Consequently, the ratioyields an asymptotic kurtosis equal to

3 +ξ4(x)

∫K4(u)du

nhf(x)σ4(x)ν20

+ o(n−4/5) n → ∞.

This, obviously, requires the existence and the boundedness of four deriva-tives of m and f , as well as the continuity of ξ4 at x.

2.3. Some Consequences.2.3.1. Asymptotic Normality. Working with explicit expressions of the

moments turns out to be a very powerful tool when analyzing random vari-ables. Indeed, the expressions given in Subsection 2.1. directly imply severalinteresting results about the estimator, some of which being well known, oth-ers not so. For instance, the asymptotic normality of the estimator readilyfollows from Corollary 2.1 and Corollary 2.2, which can be rewritten

E

⎝(

(nh)1/2f1/2(x)

ν1/20 σ(x)

(m(x) − E(m(x)))

)2α⎞

⎠ =(2α)!

2αα!+ o(1)

and

E

⎝(

(nh)1/2f1/2(x)

ν1/20 σ(x)

(m(x) − E(m(x)))

)2α+1⎞

⎠ = o(1),

Moments of the Nadaraya-Watson estimator 85

as n → ∞. Since (2α)!2αα! and 0 are precisely the even and odd moments of the

standard normal distribution, which is uniquely determined by its moments,we get

(nh)1/2f1/2(x)

ν1/20 σ(x)

(m(x) − E(m(x)))L−→ N (0, 1).

Denoting λ = limn→∞ nh5 and incorporating in the previous expression thebias term given by Example 2.1 complete an alternative proof of the well-known result

√nh(m(x) − m(x))

L−→ Zb,v2 ∼ N (b(x), v2(x)), (2.5)

with b(x) = 1/2λψ2(m′′(x) + 2f ′(x)m′(x)/f(x)) and v2(x) = ν0σ

2(x)/f(x),not based on usual Central Limit Theorem arguments (see a.o. Theorem2.2.1 in Bierens, 1987).

2.3.2. Lγ-errors. The convergence of moments towards the moments ofthe normal distribution also allows one to draw interesting conclusions about

the uniform integrability of any sequence of type(√

nh(m(x) − E(m(x))))γ

or(√

nh(m(x) − m(x)))γ

, for γ any positive integer. Besides, as

(√nh(m(x) − m(x)))

)2α=

(√nh|m(x) − m(x)|)

)2αuniformly integrable

implies, by Jensen’s inequality,

(√nh|m(x) − m(x)|)

)2α−1uniformly integrable,

and as it follows from (2.5) that

√nh|m(x) − m(x)| L−→|Zb,v2 |,

we have, for any positive integer γ,

(nh)γ/2E(|m(x) − m(x)|γ) → E(|Zb,v2 |γ),

from which the asymptotic Lγ-error of the estimator is readily deduced.Indeed, it is known that, for any normally distributed random variable Zb,v2 ,|Zb,v2 |/v follows a non-central χ distribution with one degree of freedom and

86 G. Geenens

non-centrality parameter b(x). From the properties of that distribution, weget

(nh)γ/2E(|m(x) − m(x)|γ) → vγ(x)

γ∑

q=0

q

)(b(x)

v(x)

)γ−q

×(

Iq

(− b(x)

v(x)

)+ (−1)γ−qIq

(b(x)

v(x)

)),

where

Iq(t) =

∫ ∞

tuqdΦ(u)

is the incomplete moment of a standard normal distribution, with Φ thestandard normal cumulative distribution function and φ the standard normaldensity. Seeing that I0(t) = 1 − Φ(t), I1(t) = I1(−t) = φ(t) and Iq(t) =tq−1φ(t)+(q−1)Iq−2(t), we can explicitly write any Lγ-error of the estimator.For instance, the L1-error, that is the Mean Average Error, of the Nadaraya-Watson estimator is asymptotically given by

E(|m(x) − m(x)|)

= (nh)−1/2

(v(x)

√2

πexp

(− b2(x)

2v2(x)

)+ b(x)

(1 − 2Φ

(− b(x)

v(x)

)))

+ o((nh)−1/2),

as n → ∞. Wand (1990) gives the MAE of the Gasser-Muller nonparametricregression estimator, but we are not aware of something similar in the yetabundant Nadaraya-Watson estimator literature.

2.3.3. Large Deviation Principle. Another important implication of thederived results concerns a Large Deviation Principle (LDP) for the NWestimator. Generally speaking, for a given sequence of random variables,LDP provides asymptotic approximations of tail probabilities based on themoment-generating function (mgf) which are usually more accurate thanthe ones ensuing from the Central Limit Theorem. See the classical refer-ences on the topic, such as Dembo and Zeitouni (1998) or Bucklew (2004).Corollaries 2.1 and 2.2 allow us to easily derive the mgf of the sequence{m(x) − E(m(x))}, provided this function exists. Note that a much deeperanalysis of the remainder small-o terms in Corollaries 2.1 and 2.2 would berequired to derive primitive conditions for the existence of that mgf, but thisis probably besides the point of this paper. We will therefore merely assume

Moments of the Nadaraya-Watson estimator 87

that the mgf exists, which amounts to saying that the sum (2.6) below isconvergent for all t ∈ (−ε, ε), for some ε > 0. This is, for instance, guaran-teed if the error ε has bounded support in a neighborhood of x. Then, themgf of {m(x) − E(m(x))} is by definition

Mn(t) = E(exp(t (m(x) − E(m(x)))))

=∞∑

q=0

tq

q!E((m(x) − E(m(x)))q) . (2.6)

Replacing the even and odd moments given by Corollaries 2.1 and 2.2, wefind, after some algebraic reorganization, that

Mn(t) = exp

(nht2ν0σ

2(x)

2f(x)

)(1 +

nht3ξ3(x)∫

K3

3f2(x)

)(1 + o(1)).

Then, standard results of LDP theory can be stated. For instance, theGartner-Ellis theorem applies as follows. The log-mgf is seen to be

ϕn(t).= log Mn(t) =

nht2ν0σ2(x)

2f(x)+ log

(1 +

nht3ξ3(x)∫

K3

3f2(x)

)+ o(1),

so that

ϕ(t).= lim

n→∞1

nhϕn(t) =

t2ν0σ2(x)

2f(x).

It is easily checked that Dϕ.= {t ∈ R : ϕ(t) < ∞} has nonempty interior,

that ϕ is differentiable on Dϕ, that the origin belongs to Dϕ and that ϕ issteep (i.e. if tn → ±∞, ϕ′(tn) → ±∞). Then, the rate function I(z) =supt(tz − ϕ(t)) is found to be

I(z) =z2f(x)

2ν0σ2(x)

and for any 0 < a < b < ∞, the theorem states that

− infz∈(a,b)

I(z) ≤ lim infn→∞

1

nhlog P ((m(x) − E(m(x))) ∈ (a, b))

and

lim supn→∞

1

nhlog P ((m(x) − E(m(x))) ∈ [a, b]) ≤ − inf

z∈[a,b]I(z).

88 G. Geenens

Now, as the continuous nature of the random variables m(x) − E(m(x))allows us to write

P ((m(x) − E(m(x))) ∈ (a, b)) = P ((m(x) − E(m(x))) ∈ [a, b]) ,

and as

− infz∈(a,b)

I(z) = − infz∈[a,b]

I(z) = − a2f(x)

2ν0σ2(x),

b can be taken arbitrarily large so that

limn→∞

1

nhlog P ((m(x) − E(m(x))) > a) = − a2f(x)

2ν0σ2(x).

Note that the bias term, of order O(h2), cannot alter the result, and it canalso be stated

limn→∞

1

nhlog P ((m(x) − m(x)) > a) = − a2f(x)

2ν0σ2(x).

The same argument with −∞ < b < a < 0 leads to

limn→∞

1

nhlog P ((m(x) − m(x)) < a) = − a2f(x)

2ν0σ2(x),

so that finally it holds

limn→∞

1

nhlog P (|m(x) − m(x)| > a) = − a2f(x)

2ν0σ2(x),

for any positive a. See Theorem 2 and Corollary 2 of Louani (1999) forrelated results.

3 Other Kernel Estimators

The use of higher-order kernels in (1.1) has been suggested in order toreduce the bias of the estimator. The order of the kernel function K isdefined as the integer J such that

⎧⎪⎪⎨

⎪⎪⎩

∫ujK(u) du = 0 for j = 1, . . . , J − 1,

∫uJK(u) du = 0

.

Moments of the Nadaraya-Watson estimator 89

Kernels satisfying Assumption (A4) above are clearly of second order, as∫u2K(u) du must be positive for non-degenerate probability density func-

tions K. If that assumption is relaxed, though, we can work with a higher-order kernel, say of order J > 2, for which ψ2 =

∫u2K(u) du = 0. As

the asymptotic bias of the estimator m(x) behaves like 1/2h2ψ2(m′′(x) +

2f ′(x)m′(x)/f(x))+o(h2), the dominant bias term vanishes and thereby thebias is reduced. See Wand and Jones (1995, Section 2.8). Note that it is stillassumed that K is symmetric, so that the kernel order J is necessarily even.

Of course, this bias reduction impacts the expression of the MSE of theestimator, and therefore the order of the optimal bandwidth (1.5). If mand f are assumed to be J times differentiable in a neighborhood of x, thenthe bias becomes of order hJ and the best MSE of order n−2J/(2J+1) withh ∼ n−1/2J+1. It can be shown in an identical way to the proof of Theorem2.1 that, with a kernel such that ψ1, ψ2, . . . , ψJ−1 = 0 and ψJ = 0,

E((m(x) − m(x))2α) =

2α∑

κ=α

⎧⎨

⎩(2α)!

22α−κ(2κ − 2α)!(2α − κ)!

×hJ(2κ−2α)

(nh)2α−κν2α−κ0 σ2(2α−κ)(x)

(ψJ

J !

)2κ−2α

×

⎝m(J)(x) +J−1∑

q=1

(J

q

)m(J−q)(x)f (q)(x)

f(x)

⎠2κ−2α⎫⎬

+ o(n−2Jα/(2J+1))).

The analog for an odd moment is

E((m(x) − m(x))2α+1)

=2α+1∑

κ=α+1

⎧⎪⎨

⎪⎩(2α + 1)!

2(2α+1)−κ(2κ − (2α + 1))!((2α + 1) − κ)!

× hJ(2κ−(2α+1))

(nh)(2α+1)−κν

(2α+1)−κ0 σ2((2α+1)−κ)(x)

(ψJ

J !

)2κ−(2α+1)

×

⎝m(J)(x)+J−1∑

q=1

(J

q

)m(J−q)(x)f (q)(x)

f(x)

⎠2κ−(2α+1)

⎫⎪⎬

⎪⎭

+ o(n−(2α+1)J/(2J+1))).

90 G. Geenens

For instance, an explicit expression for the bias of the estimator when a 4thorder kernel K is used is

E(m(x)) − m(x) =1

24h4ψ4

(m(IV )(x) + 4

m′′′(x)f ′(x)

f(x)

+ 6m′′(x)f ′′(x)

f(x)+ 4

m′(x) f ′′′(x)

f(x)

)

+ o(h4),

provided m and f are smooth enough. Finally, as the order of the kernelonly impacts the bias terms in the above expansions, the expressions of thecentered moments (Corollaries 2.1 and 2.2) remain unchanged whatever theorder J . It must be noted that the asymptotic reduction of bias offered byhigher order kernels often goes unnoticed with finite sample sizes. That isthe reason why many authors have advocated against using them, arguingthat the price to pay in terms of interpretability and plausibility (due to thenegative values taken by the kernel) is too high for the improvement theyyield in practice.

In the same spirit, similar results for multivariate nonparametric regres-sion or for other kernel smoothers, such as the local linear estimator, couldtheoretically be derived. Nevertheless, the development quickly becomeshardly manageable. For instance, for the local linear estimator, a result like(2.1) should be derived from

(N0D2 − N1D1)2α

(D0D2 − D21)

2α=

∑2αβ=0

(2αβ

)Nβ

0 Dβ2 N2α−β

1 D2α−β1

∑2αβ=0

(2αβ

)Dβ

0 Dβ2 D4α−2β

1

,

where Nq =∑

k K(

x−Xkh

)(Xk −x)q(Yk −m(x)) and Dq =

∑k K

(x−Xk

h

(Xk − x)q, rather than from N2α0 /D2α

0 for the Nadaraya-Watson estimator(see (A.2) in the proof of Theorem 2.1 in Appendix). This makes the devel-

opment intractable, and this work does not venture in that direction.

4 Concluding Remarks

In this paper explicit formulas for any asymptotic moment of theNadaraya-Watson kernel regression estimator are derived, going thereforefurther than the usual well known asymptotic bias and variance formulas.As illustration, the explicit expressions for the asymptotic skewness andkurtosis of the NW estimator are shown, and their interest when deriving

Moments of the Nadaraya-Watson estimator 91

alternative proofs of essential results such as CLT, Lγ-convergence or LDPis set out. The main results are proved for a bandwidth of asymptotic op-timal order, but can easily be adapted to any other bandwidth choice, asRemark 2.1 points out. One can expect that these expressions could be usedin many situations. We have for instance in mind the search for confidenceintervals for nonparametric regression estimators. The quality of such in-tervals based on the asymptotic normality of m(x) (Cao-Abad, 1991) couldprobably be improved, in finite samples, by a correction for the skewness andthe kurtosis of the estimator. Estimation of these quantities could be basedon the results we obtained, and then injected in Edgeworth expansions (Hall,1992a, b) in order to make the asymptotic approximation of the distribu-tion more accurate. It could be worth comparing the performances of suchcorrected asymptotic confidence intervals with bootstrap based confidenceintervals (Hardle and Mammen, 1991), for example. This idea is investi-gated in another paper. In any case, higher moments of estimators oftenarise in technical calculations, and this gives a prime theoretical interest tothe established results.

References

bierens, h.j. (1987). Kernel estimators of regression functions. In Advances in Econo-metrics, (T. F. Bewley, ed.), Cambridge University Press, 99–144.

bucklew, j.a. (2004). Introduction to rare event simulation. Springer, New York.

cao-abad, r. (1991). Rate of convergence for the wild bootstrap in nonparametricregression. Ann. Statist., 19, 2226–2231.

dembo, a. and zeitouni, o. (1998). Large deviations techniques and applications, (2nded.). Springer-Verlag, New York.

doukhan, p. and lang, g. (2009). Evaluation for moments of a ratio with applicationto regression estimation. Bernoulli, 15, 1259–1286.

fan, j. and gijbels, i. (1996). Local polynomial modelling and its applications.Chapman and Hall, London.

hall, p. (1992a). The bootstrap and Edgeworth expansion. Springer-Verlag, New-York.

hall, p. (1992b). On the removal of skewness by transformation. J. R. Statist. Soc. B,54, 221–228.

hardle, w. and mammen, e. (1991). Bootstrap methods in nonparametric regression.In Nonparametric Functional Estimation and Related Topics, (G. Roussas, ed.),Kluwer Academic Publisher, 111–123.

hardle, w., muller, m., sperlich, s. and werwatz, a. (2004). Nonparametric andsemiparametric models. An introduction. Springer-Verlag, New-York.

louani, d. (1999). Some large deviations limit theorems in conditional nonparametricstatistics. Statistics, 33, 171–196.

nadaraya, e.a. (1964). On estimating regression. Theory Probab. Applic., 9, 141–142.

92 G. Geenens

wand, m.p. (1990). On exact L1 rates in nomparametric kernel regression. Scand. J.Statist., 251–256.

wand, m.p. and jones, m.c. (1995). Kernel smoothing. Chapman and Hall, London.

watson, g.s. (1964). Smooth regression analysis. Sankhya A, 26, 359–372.

A Appendix

A.1. Proof of Theorem 2.1. First of all, see that we can write

m(x) − m(x) =

∑nk=1 K

(x−Xk

h

)(Yk − m(x))

∑nk=1 K

(x−Xk

h

) , (A.1)

so that we have

(m(x) − m(x))2α =(nh)−2α

(∑nk=1 K

(x−Xk

h

)(Yk − m(x))

)2α

(nh)−2α(∑n

k=1 K(

x−Xkh

))2α

.=

N2α

D2α.

(A.2)The numerator N2α can be developed by the multinomial theorem as

N2α = (nh)−2α∑

(α1,α2,...,αn)∈R2α

(2α

α1 α2 . . . αn

)

×n∏

k=1

Kαk

(x − Xk

h

)(Yk − m(x))αk ,

where R2α = {(α1, α2, . . . , αn) ∈ Nn :

∑k αk = 2α} and

(2α

α1 α2 . . . αn

)=

(2α)!

α1! α2! . . . αn!

are the multinomial coefficients. As we have E(N2α) = E(E(N2α|{Xk}))and the Xk’s are independent, we can write

E(N2α) =(nh)−2α∑

(α1,α2,...,αn)∈R2α

(2α

α1 α2 . . . αn

)

×n∏

k=1

E

(Kαk

(x − Xk

h

)E((Yk − m(x))αk |Xk)

). (A.3)

Moments of the Nadaraya-Watson estimator 93

For αk = 0, E (Kαk (x − Xk/h) E((Yk − m(x))αk |Xk)) is trivially equal to1, while for αk > 0, we have

E((Yk − m(x))αk |Xk = ·)

= E

⎝αk∑

β=0

(αk

β

)(Yk − m(·))αk−β(m(·) − m(x))β |Xk = ·

=

αk∑

β=0

(αk

β

)(m(·) − m(x))βE((Yk − m(·))αk−β |Xk = ·)

=

αk∑

β=0

(αk

β

)(m(·) − m(x))βξαk−β(·) (A.4)

.= ϕαk,x(·),

where we defined

ξαk(·) .

= E(εαk |X = ·). (A.5)

As E(Kαk

(x−Xk

h

)ϕαk,x(Xk)

)=

∫Kαk

(x−z

h

)ϕαk,x(z)f(z)dz, with the

change of variable u = (x − z)/h and the continuity of ϕαk,x(x) impliedby Assumptions (A2) and (A6), we can write

E

(Kαk

(x − Xk

h

)ϕαk,x(Xk)

)= h

∫Kαk(u)ϕαk,x(x − uh)f(x − uh)du

= hϕαk,x(x)f(x)

∫Kαk(u)du + o(h).

(A.6)

In particular, as ϕ1,x is easily seen to be twice differentiable from Assumption(A2), it follows in the particular case αk = 1, with the symmetry of K,

E

(K

(x − Xk

h

)ϕ1,x(Xk)

)= hϕ1,x(x)f(x)

+1

2h3(ϕ1,xf)′′(x)

∫u2K(u)du + o(h3). (A.7)

94 G. Geenens

Note also that, from (A.4), it directly appears that ϕαk,x(x) = ξαk(x), so

that ϕ1,x(x) = 0. Put (A.6) and (A.7) in (A.3) to find

E(N2α) = (nh)−2α∑

(α1,α2,...,αn)∈R2α

(2α

α1 α2 . . . αn

)

×∏

k:αk=1

(1

2h3(ϕ1,xf)′′(x)

∫u2K(u)du + o(h3)

)

×∏

k:αk>1

(hξαk

(x)f(x)

∫Kαk(u)du + o(h)

).

Now, denote κ1(α1, α2, . . . , αn).= #{k : αk > 0} and κ2(α1, α2, . . . , αn)

.=

#{k : αk = 1}, and see that we can write, distinguishing between the possiblevalues of κ1 in R2α,

E(N2α) = (nh)−2α2α∑

κ=1

hκfκ(x)∑

(α1,α2,...,αn)∈Rκ2α

(2α

α1 α2 . . . αn

)

×∏

k:αk=1

(1

2h2f−1(x)(ϕ1,xf)′′(x)

∫u2K(u)du + o(h2)

)

×∏

k:αk>1

(ξαk

(x)

∫Kαk(u)du + o(1)

), (A.8)

where Rκ2α

.= {(α1, α2, . . . , αn) ∈ R2α : κ1(α1, α2, . . . , αn) = κ}. Also, it is

not difficult to see that for any κ,

min(α1,α2,...,αn)∈Rκ

κ2(α1, α2, . . . , αn) = max(2κ − 2α, 0), (A.9)

so that each term in the sum∑

(α1,α2,...,αn)∈Rκ2α

in (A.8) is at most of order

O(h2max(2κ−2α,0)). Moreover, as #Rκ2α = O(nκ), each term in the sum

∑2ακ=1

in (A.8) is at most O(nκhκ+2 max(2κ−2α,0)).Try now to maximize this order with respect to κ, to identify which

terms in the sum are dominant. If κ ≤ α, the order is O((nh)κ), which isclearly maximized when κ = α, since nh → ∞, and leads to a maximumorder O((nh)α), that is O(n4α/5) as h ∼ n−1/5. Now, if κ > α, the orderis O(nκhκ+2(2κ−2α)), that is O(n4α/5) for any κ, as h ∼ n−1/5. Therefore,dominant terms in the sum

∑2ακ=1 in (A.8) are all terms for κ from α to

2α. Then, compute these terms. For a fixed κ, the maximum order in the

Moments of the Nadaraya-Watson estimator 95

sum∑

(α1,α2,...,αn)∈Rκ2α

is attained for vectors (α1, α2, . . . , αn) for which theminimum possible number of components are equal to one, that is 2κ − 2α,following (A.9). Thus, it remains 2α − (2κ − 2α) = 4α − 2κ multiplicitiesto be shared out between κ − (2κ − 2α) = 2α − κ components, without anyamong these ones equal to one. The only possibility is to have these 2α − κcomponents equal to 2. Hence, vectors from Rκ

2α leading to the maximumorder are those with (2κ − 2α) components equal to 1, (2α − κ) componentsequal to 2, and the others equal to zero. There are

(nκ

)(κ

2κ−2α

)such vectors:

the former binomial coefficient is for the number of possibilities to select κnon-zero components among n, the latter for the number of possibilities toselect 2κ − 2α components equal to 1 among those κ. We can now write,from (A.8) and the preceding argument,

E(N2α) = (nh)−2α2α∑

κ=α

hκfκ(x)

(n

κ

)(κ

2κ − 2α

)(2α)!

22α−κ

×(

ξ2(x)

∫K2(u)du + O(h2)

)2α−κ

×(

1

2h2(ϕ1,xf)′′(x)/f(x)

∫u2K(u)du + o(h2)

)2κ−2α

+ lower order terms. (A.10)

Develop

(n

κ

)(κ

2κ − 2α

)(2α)!

22α−κ=

n!

κ!(n − κ)!

κ!

(2κ − 2α)!(2α − κ)!

(2α)!

22α−κ

= n(n − 1) . . . (n − κ + 1)(2α)!

22α−κ(2κ − 2α)!(2α − κ)!

= nκ (2α)!

22α−κ(2κ − 2α)!(2α − κ)!+ O(nκ−1)

and, from (A.4)–(A.5),

ϕ1,x(x) = ξ1(x) ≡ 0

ϕ′1,x(x) = m′(x)

ϕ′′1,x(x) = m′′(x),

so that(ϕ1,xf)′′(x) = m′′(x)f(x) + 2m′(x)f ′(x).

96 G. Geenens

Finally, seeing that ξ2(x) = σ2(x), we have

E(N2α)

=2α∑

κ=α

(nh)κ−2αh2(2κ−2α)fκ(x)(2α)!

22α−κ(2κ − 2α)!(2α − κ)!σ2(2α−κ)(x)ν2α−κ

0

× 1

22κ−2α(m′′(x) + 2m′(x)f ′(x)/f(x))2κ−2αψ2κ−2α

2 + o(n−4α/5). (A.11)

Now for the denominator in (A.2). From the same argument as above,we write

E(D2α) = (nh)−2α∑

(α1,α2,...,αn)∈R2α

(2α

α1 α2 . . . αn

) n∏

k=1

E

(Kαk

(x − Xk

h

)).

We easily find that E(Kαk

(x−Xk

h

))= 1 if αk = 0 and

E

(Kαk

(x − Xk

h

))= hf(x)

∫Kαk(u)du + O(h3)

if not, so that

E(D2α) = (nh)−2α2α∑

κ=1

hκfκ(x)

×∑

(α1,α2,...,αn)∈Rκ2α

(2α

α1 α2 . . . αn

) ∏

k:αk>0

(∫Kαk(u)du + O(h2)

).

Here, again as #Rκ2α = O(nκ), it is clear that the term attaining the maxi-

mum order is the one with κ = 2α, as nh → ∞. In this case, correspondingvectors (α1, α2, . . . , αn) are those with 2α components equal to 1, and theothers zero. As there are

(n2α

)such vectors, we can write

E(D2α) = (nh)−2αh2αf2α(x)

(n

)(2α)!(1 + O(h2))2α + O((nh)−1),

that is

E(D2α) = f2α(x)(1 + O(h2)) + O((nh)−1). (A.12)

Moments of the Nadaraya-Watson estimator 97

The variance also easily follows, as var(D2α) = E((D2α)2)− (E(D2α))2, andas (D2α)2 = D4α we have, by (A.12),

var(D2α) = (f4α(x)(1 + O(h2)) + O((nh)−1))

− (f2α(x)(1 + O(h2)) + O((nh)−1))2

= O((nh)−1).

Now, see that

N2α

D2α=

N2α

E(D2α)× 1

1 − Δ

with Δ = E(D2α) − D2α/E(D2α). As

1

1 − Δ= 1 + Δ +

Δ2

1 − Δ,

we have

N2α

D2α=

E(N2α)

E(D2α)+

N2α − E(N2α)

E(D2α)− N2α(D2α − E(D2α))

E(D2α)2

+N2α(D2α − E(D2α))2

D2αE(D2α)2,

so that, taking expectation of both sides,

E

(N2α

D2α

)=

E(N2α)

E(D2α)− cov(N2α, D2α)

E(D2α)2+ E

(N2α(D2α − E(D2α))2

D2αE(D2α)2

).

(A.13)

The covariance term can be shown to be o(n−4α/5), similarly to the deriva-tions which have yielded E(N2α). The main steps are the following. Wehave

cov(N2α, D2α) = E(N2αD2α) − E(N2α)E(D2α).

Write

N2αD2α = (nh)−4α

(n∑

k=1

K

(x − Xk

h

)(Yk − m(x))

)2α

×(

n∑

k=1

K

(x − Xk

h

))2α

,

98 G. Geenens

so that

E(N2αD2α)

= (nh)−4α∑

(α1,α2,...,αn)∈R2α

(β1,β2,...,βn)∈R2α

(2α

α1 α2 . . . αn

)(2α

β1 β2 . . . βn

)

×n∏

k=1

E

(Kαk+βk

(x − Xk

h

)(Yk − m(x))αk

)

For each term in the double sum, denote κa = #{k : αk > 0}, κb = #{k :βk > 0} and κab = #{k : αk > 0, βk > 0}, and see that the order ofeach of them is at most O((nκa+κb−κabhκa+κb−κab+2max(2κa−2α,0)). The max-imum order would be attained with κa = α, κb = 2α and κab = 0, but seethat this corresponds to the case where the sums over (α1, α2, . . . , αn) and(β1, β2, . . . , βn) can be entirely separated out, and therefore leads to the sim-plification with the terms arising in the development of E(N2α)E(D2α) inthe expression of the covariance. Therefore, the leading terms in the covari-ance development are of the same order as those such that κa = α, κb = 2αand κab = 1, which are of order O((nh)3α−1), so that

cov(N2α, D2α) = O((nh)−(α+1)) = o((nh)−α) = o(n−4α/5). (A.14)

Next, the last term is shown to be negligible as well. The Cauchy-Schwarzinequality guarantees that

E

(N2α(D2α − E(D2α))2

D2α

)≤ E

((N2α)2

(D2α)2

)1/2

E((D2α − E(D2α))4)1/2.

Under our assumptions, Theorem 2 and Proposition 6 of Doukhan and Lang(2009) apply, with (in their notation) p = 4α, q = r = s = 6α (say), d = 1and ρ = 2, and state that

E

((N2α)2

(D2α)2

)= E((m(x) − m(x))4α) = ‖m(x) − m(x)‖4α

4α = O((nh)−2α).

Tedious calculation in the same vein as above also yields

E((D2α − E(D2α))4) = O((nh)−2)

(it is actually enough to have E((D2α − E(D2α))4) = o(1)), so that

E

(N2α(D2α − E(D2α))2

D2α

)= o((nh)−α) = o(n−4α/5). (A.15)

Moments of the Nadaraya-Watson estimator 99

Finally, it remains, from (A.13), (A.11), (A.12), (A.14) and (A.15),

E((m(x) − m(x))2α)

=2α∑

κ=α

(h2(2κ−2α)

(nh)2α−κf2α−κ(x)

(2α)!

2κ(2κ − 2α)!(2α − κ)!

× ν2α−κ0 ψ2κ−2α

2 σ2(2α−κ)(x)(m′′(x)f(x)+2f ′(x)m′(x))2κ−2α

)+o(n−4α/5),

as announced.A.2. Proof of Corollary 2.1. The proof is the same as the one of

Theorem 2.1, from

m(x) − E(m(x)) =

∑nk=1 K

(x−Xk

h

)(Yk − E(m(x)))

∑nk=1 K

(x−Xk

h

)

in place of (A.1). The basic modifications are the following. In the analogof (A.3), it will now appear

E((Yk − E(m(x)))αk |Xk = ·) =

αk∑

β=0

(αk

β

)(m(·) − E(m(x)))βξαk−β(·)

.= ϕαk,x(·).

From Theorem 2.2, with α = 0, it is known that

ϕαk,x(x) = ξαk(x) − αkξαk−1(x)

1

2h2ψ2(m

′′(x)f(x) + 2f ′(x)m′(x)) + o(h2),

so that

E

(Kαk

(x − Xk

h

)ϕαk,x(Xk)

)

= hξαk(x)f(x)

∫Kαk(u)du

− 1

2h3

(αkξαk−1(x)μ2(m

′′(x)f(x) + 2f ′(x)m′(x))

− (ϕαk,xf)′′(x)

∫u2Kαk(u)du

)(1 + o(1)).

See that this is identically zero if αk = 1, so that the term of maximumorder in a sum such as (A.8) is uniquely the term with κ = α, that is the

100 G. Geenens

highest term allowing for vectors (α1, α2, . . . , αn) whose components are alldifferent to 1. When κ = α, the only possible vectors (α1, α2, . . . , αn) whosecomponents are all different to 1 are those with α components equal to 2,and the others equal to zero. Therefore, we get

E(N2α) = (nh)−2αhαfα(x)

(n

α

)(2α)!

2α(ξ2(x)ν0 + O(h2))α + o((nh)−α).

Continuing the proof with this expectation as it was made from (A.10) leadsto the announced result.

A.3. Proof of Corollary 2.2. The proof is exactly the same as the pre-vious one. Again, we would find that the term of maximum order in the sumsuch as (A.8) is the one when κ = α. The only vectors (α1, α2, . . . , αn) withα components greater or equal than 2, whose sum is 2α + 1, are uniquelythe ones with (α − 1) components equal to 2 and one component equal to 3.Hence, we get

E(N2α+1) = (nh)−(2α+1)hαfα(x)

(n

α

)(α

1

)(2α + 1)!

3 × 2α−1

× ξ3(x)(ξ2(x)ν0 + O(h2))α−1 + o((nh)−(α+1)).

Continuing the proof with this expectation as it was made from (A.10) leadsto the announced result.

Gery Geenens

Institut de Statistique

Universite catholique de Louvain, Louvain-la-Neuve, Belgium

Department of Mathematics and Statistics

The University of Melbourne, Melbourne, Australia

Present address

School of Mathematics and Statistics

The University of New South Wales

Sydney, NSW 2052, Australia

E-mail: [email protected]

Paper received: 14 August 2012; revised: 18 June 2013.