Applications of the Randomized Probability Integral ...

Applications of the Randomized Probability IntegralTransform

Dennis D. CoxRice University

Introduction

I We consider the problems of testing goodness of fit (GoF) forindependent univariate random variables

I Comment: Sometimes there are valid scientific reasons forwanting to test goodness of fit

I For example, to try to test a theory which predicts someprobability distribution for some data

I If the data are i.i.d., then there are typically numerous GoFtests

I Like χ2 tests for discrete data

I Or Kolmogorov-Smirnov (KS) test if the data are continuous

I The KS test has a nice distibution free property (assuming thedistribution under H0 is completely specified).

I The Probability Integral Transform (PIT): Let univariate Xhave a continuous CDF F (x). Then F (X ) ∼ Unif [0, 1].

I Let X1, X2, . . . be independent with continuous CDFs F1, F2,. . . (not necessarily i.i.d.!). Define the Empirical DistributionFunctions

Gn(u) =1

n

n∑i=1

I(−∞,u](Fn(Xn)),

and Kolmogorov-Smirnov test statistic

Kn = supu∈[0,1]

|Gn(u)− u|.

I The distribution for Kn is independent of the Fi .

I We can also get an asymptotic distribution. This can be usedto construct approximate tests that have better propertiesthan the KS test.

I We would like to extend this to settings where thedistributions are not continuous

I For example, censored data. Let Yi , 1 ≤ i ≤ n, be i.i.d.lifetimes with assumed continuous distribution H

I We observeXi = min{Yi , ci}

where ci are fixed (and known).

I Then the c.d.f. for Xi is

Fi (x) =

{H(x) if x < ci ,

1 if x ≥ ci .

I Distribution of Yi need not be the same for all i .

I In many applications, a r.v. is either 0 with a positiveprobability, or with a continuous distribution if it’s positive.The setup occurs in rainfall modeling and prediction of outputfrom oil wells. The c.d.f. Fi (x) has a jump at x = 0, and iscontinuous for x > 0.

I Another class of examples: each Xi is Poisson or Bernoulli,with different parameters (e.g., parameters depending oncovariates as in log-linear Poisson or logistic regressionmodels)

RPIT:

I The Randomized Probability Integral Transform (RPIT): Letunivariate X have a CDF F (x) and let U ∼ Unif [0, 1] beindependent of X . Then

V = UF (X − 0) + (1− U)F (X )

satisfies V ∼ Unif [0, 1].

I I’ll try to explain with an example.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

F(x)

CDF F(x)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

CDF of F(X)

I Note that Law [V |X = x ] is Unif [F (x − 0),F (x)], whereUnif [a, a] is a degenerate distribution at a.

I Given X1, X2, . . . independent, we could generate the V1, V2,. . . and apply standard GoF tests to the Vi ’s, but thisintroduces unnecessary randomness.

I That of course has implications for reproducibility and powerof the GoF test

Empirical Measure Derived from RPIT:

I Assume X1, X2, . . . independent with CDFs F1, F2, . . ..Define

Ai = Fi (Xi − 0)

Bi = Fi (Xi ).

I Consider the sequence of empirical measures

Qn =1

n

n∑i=1

Unif [Ai ,Bi ],

I Corresponding (random) CDFs

Gn(x) = Qn((−∞, x ]).

I Note that if Fi is continuous at the observed value of Xi , thenthere is a discrete contribution to Qn, and if it isdiscontinuous, there is a continuous contribution.

I Gn will be an average of random functions that look like ...

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Empirical CDF Properties:

I Proposition: (Generalization of Glivenko-CantelliTheorem) For all u ∈ [0, 1], E [Gn(u)] = u and

limn→∞

supu∈[0,1]

|Gn(u)− u| = 0, a.s.

I Because of monotonicity suffices to show pointwiseconvergence

I Define σ-fields

Fn = σ(X1, . . . ,Xn), F = σ (∪∞n=1Fn) .

Then

Gn(u) = E

[1

n

n∑i=1

I(−∞,u](Vi ) | Fn

]= E [ same |F ] ,

where the Vi are the i.i.d. Unif (0, 1) RVs given by the RPIT.

Empirical CDF Properties (cont.):

I Now 1n

∑ni=1 I(−∞,u](Vi ) converges a.s. to u by the Strong

Law.

I Dominated Convergence theorem for conditional expectationimplies Gn(u)← u a.s. for all u ∈ (0, 1).

Empirical Processes:

I The corresponding Empirical Process is given by

Bn(u) =√

n (Gn(u)− u) , u ∈ [0, 1].

I If all the CDFs are continuous, then it is a classical result thatBn ⇒ Z in D[0, 1] where Z is a Brownian Bridge.

I The Brownian Bridge is a mean 0 Gaussian process withcovariance function

E [Z (s)Z (t)] = s(1− t), 0 ≤ s ≤ t ≤ 1.

I This provides (for example) an easy way to get the asymptoticnull distribution of the KS test statistic:

√nKn = sup

u∈[0,1]|Bn(u)| ⇒ sup

u∈[0,1]|Z (u)|

Theorem:

Let Xn, n ≥ 1 be independent real valued random variables withrespective c.d.f.’s Fn, n ≥ 1. Define the empirical c.d.f.’s Gn andthe empirical processes Bn as before. For 0 ≤ s ≤ t ≤ 1 put

Kn(s, t) = Cov(Bn(s),Bn(t)).

Iflimn→∞

Kn(s, t) = K (s, t),

then Bn ⇒ B where B is a Gaussian process with mean 0 andcovariance K .

Empirical Processes (cont.):

I Now it’s just a matter of working out the covariance functionof the limiting Gaussian process, although otherrepresentations are possible.

I Theorem Suppose for all n, Fn = F1 = F , so the Xn are i.i.d.Put

V = {F (x) : x ∈ R}.

Then we may represent the limit process B as

B(u) = E [Z (u)|σ({Z (v) : v ∈ V })],

where Z is a Brownian Bridge.

I Basically B(u) = Z (u) for u ∈ V , and B(u) is obtained bylinear interpolation of Z (u) for u ∈ (0, 1) \ V .

Application to Censored Data:

I Let Yi ≥ 0 be i.i.d. “lifetimes” with a continuous distributionbut suppose we observe Xi = min{Yi , c} were c is a fixed“censoring time”. Then F (c)− F (c − 0) = P[Yi > c].

I The limiting Empirical Process looks like this: B(u) is aBrownian Bridge for u ≤ F (c − 0), with linear interpolationbetween u = F (c − 0) and u = 1.

I The limiting null distribution of the KS test statistic in thiscase is √

nKn ⇒ supu∈[0,F (c−0)]

|Z (t)|.


I What if Fi ’s not all the same? Simplest nontrivial case ifXi ∼ B(1, pi ), pi ∈ (0, 1).

I It’s easier to work with qi = 1− pi . Define the empiricalCDF of the qi by

Ln(u) =1

n

n∑i=1

I(−∞,u](qi ).

Assume there is a CDF L(u) such that

supq∈[0,1]

|Ln(q)− L(q)| → 0.


Then the covariance of the limiting Gaussian process is for0 ≤ u ≤ v ≤ 1

E [B(u)B(v)]

= u(1− v)

∫ [1− u

u

q

1− qI[0,u](q) + I(u,v)(q)

+v

1− v

1− q

qI[v ,1](q)

]dL(q).


I If log[pi/(1− pi )] = azi + b (logistic regression model) wherethe zi realizations of i.i.d. RVs from a given distribution, thenone can show the existence of the limiting measure L andmaybe give a formula for its density.

I Other examples? Censored RVs. Say Yi ≥ 0 are i.i.d., but weobserve Xi = min{Yi , ci} where the ci are censoring times.Then the Xi typically have a mixed discrete/continousdistribution (continuous for x < ci and discrete at x = ci ).We should be able to get some results for this setting.

I Another model that arises frequently is Xi ∼ Poisson(λi ).

I These results show that the limiting empirical processcovariance depends in a complicated way on the values of theCDFs at their discontinuities.

Further Probability Research:

I A general characterization of the covariance of the limitingGaussian process?

I Other characterizations, e.g. in terms of conditionalexpectations.

I What about when parameters are estimated? This “messesup” the covariance of the limiting Gaussian process even inthe i.i.d. continuous CDF case and makes limit distributiontheory of test statistics very difficult and limits the practicalapplication of the limit theory.

Applications in Statistics:

I In the classical case (continuous c.d.f.’s) the finite sampledistribution of the KS test can be worked out.

I In principle, assuming the c.d.f.’s are known, one can simulatefrom them to estimate the null distribution of any teststatistic.

I Even in the classical case, if there are parameters to beestimated, this “messes up” the limiting distribution of theempirical processes.

I The covariance of the limiting Gaussian process depends in acomplicated way on the model and the true distribution.

Applications in Statistics (cont.):

I However, one can use a parametric bootstrap to get anapproximate null distribution (assuming large sample size).

I Estimate parameters in the model (e.g., by maximumlikilihood).

I Simulate new data sets from the estimated model.

I For each simulated data set, estimate parameters in the sameway.

I Construct the realized emipircal process using the estimatedparameter values.

I Evaluate the test statistic on the simulated realization.

Applications in Statistics (cont.):

I This is future work ...

I There are better tests than the KS test in the classical setting.

I For example, the tests given in “Asymptotic optimality of ofdata-driven Neyman’s tests for uniformity,” by Inglot andLedwina, Ann. Stat., 1996.

Applications of the Randomized Probability Integral ...

Documents

Transcript of Applications of the Randomized Probability Integral ...