Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of ,...

21
Introduction to U statistics and UGEE Jianghao Li Duke University December 2, 2016 Jianghao Li (Duke) UGEE December 2, 2016 1 / 21

Transcript of Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of ,...

Page 1: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Introduction to U statistics and UGEE

Jianghao Li

Duke University

December 2, 2016

Jianghao Li (Duke) UGEE December 2, 2016 1 / 21

Page 2: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Overview

1 U StatisticsOne sample U-statisticInference on U statisticsTwo sample and Multivariate U-statistic

2 Functional Response ModelsRegression modelUGEE

Jianghao Li (Duke) UGEE December 2, 2016 2 / 21

Page 3: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Motivation

U-statistic is a effective way of constructing unbiased estimators.

For non-parametric families, U statistic usually produces uniformlyminimum variance unbiased estimator (UMVUE) .

By Lehmann-Scheff theorem, if T is an sufficient and completestatistics (c.s.s) and h(T) is unbiased estimator of θ, then h(T) isunique UMVUE of θOrdered statistic is c.s.s for non-parametric family Fr , the class ofdistribution functions having finite r th momentφ(X1,X2, ...,Xn) is a function of ordered statistic iff it’s symmetricThus, to find the UMVU estimator, it suffices to find a symmetricunbiased estimator

Jianghao Li (Duke) UGEE December 2, 2016 3 / 21

Page 4: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

One sample U statistics

For distribution family F with parameter θ and a sample{X1,X2, ...,Xr}, θ is called estimable wrt to F , if ∃ a statisticsh(X1,X2, ...,Xr ) s.t.

EFh(X1,X2, ...,Xr ) = θ,∀F ∈ F

h(x1, x2, ..., xr ) is called kernel function of θ

WLOG, suppose h(x1, x2, ..., xr ) is symmetric wrt (x1, x2, ..., xr )

Otherwise, let

h∗(x1, x2, ..., xr ) =1

r !

∑α1,...,αr

h(Xα1 ,Xα2 , ...,Xαr )

Jianghao Li (Duke) UGEE December 2, 2016 4 / 21

Page 5: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Definition

Suppose X1,X2, ...,Xn are i.i.d sample with cdf F(x). Estimableparameter θ kernel h(X1,X2, ...,Xr ). The U Statistics constructed byh(X1,X2, ...,Xr ) is:

U(X1,X2, ...,Xn) =1(nr

) ∑β1,...,βr

h(Xβ1 ,Xβ2 , ...,Xβr )

Exampleθ = E (X1), then h(x) = x,

U(X1,X2, ...,Xn) =1(n1

) n∑i=1

Xi = X̄

θ = Var(X1), then h(x1, x2) = 12 (x1 − x2)2,

U(X1,X2, ...,Xn) =1(n2

) n∑i<j

1

2(Xi − Xj)

2 =1

n − 1

n∑i=1

(Xi − X̄ )2

Jianghao Li (Duke) UGEE December 2, 2016 5 / 21

Page 6: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Variance

Var(U) = E

(nr

)−1 ∑β1,...,βr

(h(Xβ1 ,Xβ2 , ...,Xβr )− θ)

2

=

(n

r

)−2E [∑

β1,...,βr

(h(Xβ1 ,Xβ2 , ...,Xβr )− θ)

∑α1,...,αr

(h(Xα1 ,Xα2 , ...,Xαr )− θ)]

=

(n

r

)−2 ∑β1,...,βr

∑α1,...,αr

cov [h(Xβ1 ,Xβ2 , ...,Xβr ), h(Xα1 ,Xα2 , ...,Xαr )]

=

(n

r

)−2 n∑c=0

(n

r

)(n

c

)(n − r

n − c

)ζc

where ζc = cov{h(X1, ...,Xc ,Xc+1, ...,Xr ), h(X1, ...,Xc ,Xr+1, ...,X2r−c)}Jianghao Li (Duke) UGEE December 2, 2016 6 / 21

Page 7: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Variance

General form of the variance of U statistics:

Var(U) =1(nr

) r∑c=1

(r

c

)(n − r

r − c

)ζc

nVar(U)→ r2ζ1

Example

θ = E (X1), then U(X1,X2, ...,Xn) = X̄ ,

VarX̄ =1(n1

)(1

1

)(n − 1

1

)ζ1 =

σ2

n

Jianghao Li (Duke) UGEE December 2, 2016 7 / 21

Page 8: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Projection

U statistic is not in a form that is conductive to the usual treatmentby direct applications of LLN and CLT.

Basic idea to tackle the dependence of U statistics using projection

Define V = {V |V =∑n

i=1 K (Xi ),K (·)is a real function}. Then forany symmetric statistics W (X1,X2, ...,Xn) with E(W) = 0, itsprojection onto V can be proved as:

W ∗ =n∑

i=1

E{W |Xi}

This is a projection in the sense that

E (W − V ∗)2 ≤ E (W − V )2,∀V ∈ V

Jianghao Li (Duke) UGEE December 2, 2016 8 / 21

Page 9: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Asymptotic Normality

Thus the projection of U(X1,X2, ...,Xn)− θ on V is:

U∗ =r

n

n∑i=1

{h∗(Xi )− θ}

where

h∗(Xi ) = E (h(Xβ1 ,Xβ2 , ...,Xβr |Xβ1 = Xi )

= E (h(X1,X2, ...,Xr )|X1 = Xi )

E [h∗(Xi )] = E (U) = θ, Var [h∗(Xi )] = ζ1 (when n ≥ 2r)

Theorem (Hoeffding)

√n[U(X1,X2, ...,Xn)− θ] =

√nU∗ + op(1)→d N(0, r2ζ1)

Jianghao Li (Duke) UGEE December 2, 2016 9 / 21

Page 10: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Example: Wilcoxon signed rank test

Wilcoxon signed rank test statistic for testing whether distribution issymmetric around 0:

W+n =

n∑i=1

I{Xi>0}R+i

where R+i is the rank of |Xi |

Another way of expressing it:

W+n =

n∑i=1

n∑j≤i

I{Xi+Xj≥0}

=n∑

i=1

I{Xi≥0} +n∑

i=1

n∑j<i

I{Xi+Xj≥0}

Jianghao Li (Duke) UGEE December 2, 2016 10 / 21

Page 11: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Example: Wilcoxon signed rank test

1(n2

)W+n =

2

n − 1

1

n

n∑i=1

I{Xi≥0} +1(n2

) n∑i=1

n∑j<i

I{Xi+Xj≥0}

=2

n − 1U1 + U2

∴√n

(W+

n − E (W+n )(n

2

) )=√n(U2 − E (U2)) + op(n−

12 )

→ N(0, 4ζ1)

where ζ1 = cov(I{X1+X2≥0}, I{X1+X3≥0}) = 112 under the hypothesis that

the distribution of Xi is symmetric around 0.

Jianghao Li (Duke) UGEE December 2, 2016 11 / 21

Page 12: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Two sample U statistic

Suppose {X1, ...Xm} and {Y1, ...,Yn} are two independent samples,with cdf F(x) and G(y), respectively. For estimable θ if we have kernelh(·; ·) such that

E(F ,G)(h(X1, ...,Xr ;Y1, ...,Ys)) = θ

h(·; ·) is symmetric wrt {X1, ...Xm} and {Y1, ...,Yn} respectively.Then define U-statistic as:

U(X1, ...,Xm;Y1, ...,Yn)

=1(m

r

)(ns

) ∑β1,...,βr

∑α1,...,αs

h(Xβ1 , ...,Xβr ;Yα1 , ...,Yαs )

Jianghao Li (Duke) UGEE December 2, 2016 12 / 21

Page 13: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Inference for Two sample U statistic

For two independent samples {X1, ...Xm} and {Y1, ...,Yn}, supposem

n+m → λ when N = n + m→∞, then

√N[U(X1, ...,Xm;Y1, ...,Yn)− θ]→d N

(0,

r2ζ1,0λ

+s2ζ0,11− λ

)where

ζc,d =cov [h(X1, ...,Xc ,Xc+1, ...,Xr ;Y1, ...,Yd ,Yd+1, ...,Ys),

h(X1, ...,Xc ,Xr+1, ...,X2r−c ;Y1, ...,Yd ,Ys+1, ...,Y2s−d)]

Jianghao Li (Duke) UGEE December 2, 2016 13 / 21

Page 14: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Example: Mann- Whitney statistic

Mann- Whitney- Wilcoxon rank sum test is used to test whether there is alocation shift between two distributions. The statistics is written as:

U =m∑i=1

n∑j=1

I{Xi≤Yj}

Then under null hypothesis that the two samples are identicallydistributed, we have asymptotic distribution:

√N(

1

mnU − 1

2)→ N

(0,

1

12λ+

1

12(1− λ)

)

Jianghao Li (Duke) UGEE December 2, 2016 14 / 21

Page 15: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Multivariate U-statistic

Consider i.i.d sample of p × 1 vector Xi, h(X1, ...,Xr) is a symmetric ldimensional vector-valued function, then multivariate U-statistic for vactorθis:

U(X1,X2, ...,Xn) =1(nr

) ∑β1,...,βr

h(Xβ1 ,Xβ2 , ...,Xβr)

Similarly define projection of U− θ as:

U∗ =n∑

i=1

E{U|Xi} − θ =r

n

n∑i=1

{E (h|Xi )− θ}

Similarly we have asymptotic normality:

√n(Un − θ)→d N(0,m2Var(E (h|Xi )))

Jianghao Li (Duke) UGEE December 2, 2016 15 / 21

Page 16: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Functional responses model

Functional response models are generalization of traditional regressionmodels, through defining both the responses and predictors arefunctions.

The class of distribution free model regression model for functionalresponses is defined by

E [f(yi1 , ..., yir )|xi1 , ..., xir ] = h(xi1 , ..., xir ;θ)

where f is a vector-valued function, h is a smooth function withcontinuous derivatives up to second order.

This FRM subsumes all classic regression model.

Jianghao Li (Duke) UGEE December 2, 2016 16 / 21

Page 17: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Motivating Example

Distribution-free regression model restricting for higher moments.

When only restricting for the first moment, i.e. E (yi ) = µ(xi ; θ)where, GEE provides asymptotic efficient estimator when correlationis correctly specified.

But sometimes, modeling for variance may also be of interest.

E.g. test for overdispersion in Possion-like model.

Solution in traditional framework: GEE II

E

(yi

(yi − µ(xi ; θ))2

)=

(µ(xi ; θ)V (xi ;θ)

)

Jianghao Li (Duke) UGEE December 2, 2016 17 / 21

Page 18: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Motivating Example

But GEE II is not a generalization of regression model, as thedependent variables involve unknown parameter.

Alternatively, use FRM:

E

(12(yi + yj)(yi − yj)

2

)=

(µ(xi ; θ)

V (xi ; θ) + V (xj ; θ) + [µ(xi ; θ)− µ(xj ; θ)]2

)

Jianghao Li (Duke) UGEE December 2, 2016 18 / 21

Page 19: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Inference for FRM: UGEE

For FRME [f(yi1 , ..., yir)|xi1 , ..., xir ] = h(xi1 , ..., xir ;θ)

The U-statistic based generalized estimating equation is given by

Un(θ) =

(n

r

)−1 ∑i∈Cn

r

Gi(xi)(fi − hi) = 0

where i = (i1, ..., ir ), Gi(xi) is some known p × l matrix function, p is thedimension of θ and l is the dimension of fi

Choice of Gi(xi) is not unique and does not affect consistency of theUGEE estimate.

Mostly set Gi(xi) = DTi V−1i =

(∂hi∂θ

)TV−1i

Jianghao Li (Duke) UGEE December 2, 2016 19 / 21

Page 20: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Asymptotic features

Let Σ = Var(E (Gi(xi)(fi − hi)|yi1 , xi1)), B = E (GiDTi ), then under mild

regularity conditions, we have:

θ̂ is consistent

If V is correctly specified, then θ is asymptotically normal:

√n(θ̂ − θ)→d N(0, q2B−1ΣB−T )

Jianghao Li (Duke) UGEE December 2, 2016 20 / 21

Page 21: Introduction to U statistics and UGEE · statistics (c.s.s) and h(T) is unbiased estimator of , then h(T) is unique UMVUE of Ordered statistic is c.s.s for non-parametric family F

Reference

Sun, Shanze. Nonparametric Statistics. Peking University, 2000.

Kowalsi, Jeanne, and Tu, Xin M. Modern Applied U-Statistics. Wiley,2008.

Jianghao Li (Duke) UGEE December 2, 2016 21 / 21