International Encyclopedia of Statistical Science || Probit Analysis

104
P Panel Data M R Professor (emeritus) of Econometrics University of Oulu, Oulu, Finland Panel data (or longitudinal data) are data sets, where infor- mation on a number of observational units have been col- lected at several time points. ese observational units are usually called individuals or subjects. In economic applica- tions they might be households, rms, individual persons, countries, investors or other economic agents. In medi- cal, biological and social applications the subjects might be patients, test animals, individual persons etc. Panel data are most oen used to study unidirectional relationships between some explanatory variables X it = (x , i, t ... x m, i, t ) and a continuous dependent variable y it , where i (i = , . . . , N) refers to individual and t (t = t oi ,..., T i ) refers to time or period. (To simplify notation, we will assume t oi and T i T.) Panel data have many advantages over cross sectional data or single aggregated time series. One can for instance study the dynamic eects of the covariates on a micro level and one can explore het- erogeneities among the individuals. Regression models of the linear type (for suitably transformed variables) y it = α i + β X it + ε it () are especially popular as model frameworks. e error terms ε it are assumed to be independent of the explanatory variables X it , the processes {ε it } are assumed stationary with zero means and the data from dierent individu- als are assumed independent of each other. e length of the data T is oen fairly short, whereas the number of subjects N might be large. e levels of any depen- dent variable usually show considerable individual varia- tion, which makes it necessary to allow for individually varying intercept terms α i in model (). If these inter- cepts are taken as xed parameters, the model is called a xed eects model. When N is large, this will how- ever lead to some inferential problems. For instance ML estimators of the variance–covariance structure of the error processes would be inconsistent for xed T and increasing N. is is why the intercept terms are oen treated as mutually independent random variables α i IID(α, τ )(α i also independent of {ε it }), whenever the observed subjects can be interpreted as a sample from a larger population of potential subjects. ese random eects models are special cases of so-called mixed models or variance components models. If the eects of the covari- ates X it vary over individuals, the regression coecients β (i) can similarly be interpreted as random variables, α i β (i) IID m+ α β , G. If all the random elements in the model were assumed normally distributed, the whole model for subject i could be written as Y i =(y i ... y iT ) = α + X i β + Z i U i + ε i , ε i independent of U i , U i NID m+ (, G) , ε i =(ε i ... ε iT ) v NID T (, R) , () where U i =α i β (i) -α β , X i =(X i ... X iT ) , Z i =( T X i ) and T =(...) . is is a standard form of a mixed model leading to GLS estimators for α and β once the parameters incorporated in the matrices G and R have been estimated. To take account of possible unobserved changes in the general (either economic or biological) environment, one can include an additive term γ (i) F t in model (), where F t denotes an r -dimensional vector of common factors and γ (i) is a vector containing the factor loadings for indi- vidual i. ese common factors will induce dependencies between the y it -observations from dierent individuals. (See e.g., Pesaran .) In model (), all the explanatory variables were assumed exogenous. However, econometric models quite oen include also lagged values of the dependent vari- able as regressors. Much of economic theory starts from the assumption that the economic agents are optimizing an intertemporal utility function, and this assumption oen induces an autoregressive, dynamic model y it = α i + y i, t- + ... + p y i, t-p + β X it + ε it , ε it IID(, σ ) () for the dependent variable y it . In case of random inter- cepts α i , the lagged values of the dependent variable and Miodrag Lovric (ed.), International Encyclopedia of Statistical Science, DOI ./----, © Springer-Verlag Berlin Heidelberg

Transcript of International Encyclopedia of Statistical Science || Probit Analysis

Page 1: International Encyclopedia of Statistical Science || Probit Analysis

P

Panel Data

Markku RahialaProfessor (emeritus) of EconometricsUniversity of Oulu, Oulu, Finland

Panel data (or longitudinal data) are data sets, where infor-mation on a number of observational units have been col-lected at several time points.ese observational units areusually called individuals or subjects. In economic applica-tions they might be households, rms, individual persons,countries, investors or other economic agents. In medi-cal, biological and social applications the subjects might bepatients, test animals, individual persons etc.Panel data are most oen used to study unidirectional

relationships between some explanatory variables Xit =

(x,i,t . . . xm,i,t)′ and a continuous dependent variable yit ,where i (i = , . . . ,N) refers to individual and t (t =

toi, . . . ,Ti) refers to time or period. (To simplify notation,we will assume toi ≡ and Ti ≡ T.) Panel data have manyadvantages over cross sectional data or single aggregatedtime series. One can for instance study the dynamic eectsof the covariates on a micro level and one can explore het-erogeneities among the individuals. Regression models ofthe linear type (for suitably transformed variables)

yit = αi + β′Xit + εit ()

are especially popular as model frameworks. e errorterms εit are assumed to be independent of the explanatoryvariables Xit , the processes εit are assumed stationarywith zero means and the data from dierent individu-als are assumed independent of each other. e lengthof the data T is oen fairly short, whereas the numberof subjects N might be large. e levels of any depen-dent variable usually show considerable individual varia-tion, which makes it necessary to allow for individuallyvarying intercept terms αi in model (). If these inter-cepts are taken as xed parameters, the model is calleda xed eects model. When N is large, this will how-ever lead to some inferential problems. For instance MLestimators of the variance–covariance structure of theerror processes would be inconsistent for xed T and

increasing N. is is why the intercept terms are oentreated as mutually independent random variables αi ∼

IID(α, τ) (αi also independent of εit), whenever theobserved subjects can be interpreted as a sample froma larger population of potential subjects. ese randomeects models are special cases of so-called mixed modelsor variance componentsmodels. If the eects of the covari-ates Xit vary over individuals, the regression coecientsβ(i) can similarly be interpreted as random variables,(αi β′(i))

′∼ IIDm+ ((α β′)

′ ,G). If all the randomelements in themodel were assumed normally distributed,the whole model for subject i could be written as

Yi = (yi . . . yiT)′

= α + Xiβ + ZiUi + εi, εi independent of Ui,Ui ∼ NIDm+ (,G) , εi = (εi . . . εiT)

′v

∼ NIDT (,R) , ()

where Ui = (αi β′(i))′− (α β′)

′, Xi = (Xi . . . XiT)′,Zi = (T Xi) and T = ( . . . )′.is is a standard formof a mixed model leading to GLS estimators for α and β

once the parameters incorporated in the matrices G and Rhave been estimated.To take account of possible unobserved changes in

the general (either economic or biological) environment,one can include an additive term γ′(i)Ft inmodel (), whereFt denotes an r -dimensional vector of common factorsand γ(i) is a vector containing the factor loadings for indi-vidual i.ese common factors will induce dependenciesbetween the yit -observations from dierent individuals.(See e.g., Pesaran .)In model (), all the explanatory variables were

assumed exogenous. However, econometric models quiteoen include also lagged values of the dependent vari-able as regressors. Much of economic theory starts fromthe assumption that the economic agents are optimizing anintertemporal utility function, and this assumption oeninduces an autoregressive, dynamic model

yit = αi + ϕyi,t− + . . . + ϕpyi,t−p + β′Xit + εit ,

εit ∼ IID(, σ ) ()

for the dependent variable yit . In case of random inter-cepts αi, the lagged values of the dependent variable and

Miodrag Lovric (ed.), International Encyclopedia of Statistical Science, DOI ./----,© Springer-Verlag Berlin Heidelberg

Page 2: International Encyclopedia of Statistical Science || Probit Analysis

P Parametric and Nonparametric Reliability Analysis

the combined error terms (αi − α) + εit will be correlated.is would lead to inconsistent least squares estimatorsfor the ϕ -parameters.e problem can be circumventedby the GMM estimation method (Generalized Method ofMoments). (See e.g., the Appendices in Arellano .)Once the order p of the autoregressive model () has beencorrectly specied, it will be easy to nd valid instrumentsfor the GMM estimationt among the lagged dierencesof the dependent variable. Model () can be straightfor-wardly extended to the vector-valued case. (See e.g., Hsiao, Chap. ..)If the dependent variable yit is discrete (either a count

variable or measured on a nominal or ordinal scale), onecan combine the basic idea of model () and the conceptof 7generalized linear models by assuming that the covari-ates Xit and the heterogeneity terms αi aect the so-calledlinear predictors analogously to (),

ηit = g (E (yit ∣ Xit , αi)) = αi + β′Xit ()

and by assuming that conditionally on Xit and αi , yit fol-lows a distribution belonging to the exponential familyof distributions. (See e.g. Diggle et al. , Chap. , orFitzmaurice et al. , Chap. .) Function g is called thelink function, and models () are called generalized lin-ear mixed models (GLMM). If for instance yit would bedichotomous obtaining the values or , logit link func-tion would lead to the model

P (yit = ∣ Xit , αi) = exp(αi + β′Xit)/ (+ exp(αi + β′Xit)).

e resulting likelihood function contains complicatedintegral expressions, but they can be eectively approxi-mated by numerical techniques.

About the AuthorMarkku Rahiala received his Ph.D. in statistics in atthe University of Helsinki. He is Professor of economet-rics, University of Oulu since . Hewas Associate editorof the Scandinavian Journal of Statistics (–). He isMember of the Econometric Society since .

Cross References7Data Analysis7Event History Analysis7Linear Mixed Models7Medical Statistics7Multilevel Analysis7Nonsampling Errors in Surveys7Principles Underlying Econometric Estimators forIdentifying Causal Eects7Repeated Measures

7Sample Survey Methods7Social Network Analysis7Statistical Analysis of Longitudinal and Correlated Data7Testing Variance Components in Mixed Linear Models

References and Further ReadingArellano M () Panel data econometrics. Oxford University

Press, OxfordDiggle PJ, Heagerty P, Liang K-Y, Zeger SL () Analysis of

longitudinal data, nd edn. Oxford University Press, OxfordFitzmaurice GM, Laird NM, Ware JH () Applied longitudinal

analysis. Wiley, HobokenHsiao C () Analysis of panel data, nd edn. Cambridge Univer-

sity Press, CambridgePesaran MH () Estimation and inference in large heteroge-

neous panels with multifactor error structure. Econometrica :–

Parametric and NonparametricReliability Analysis

Chris P. TsokosDistinguished University ProfessorUniversity of South Florida, Tampa, FL, USA

IntroductionReliability is “the probability that a piece of equipment(component, subsystem or system) successfully performsits intended function for a given period of time under spec-ied (design) conditions” (Martz andWaller ). Failuremeans that an item does not perform its required func-tions. To evaluate the performance of an item, to predictits failure time and to nd its failure pattern is the subjectof Reliability.Mathematically we can dene reliability, R(t), as

follows:

R(t) = Pr(T > t) = − ∫t

f (τ)dτ, (t ≥ )

where T denotes the failure time of the system or compo-nent, and f (t) the failure probability distribution.

emain entity in performing accurate reliability anal-ysis depends on having properly identied a classicaldiscrete or continuous probability distribution that willcharacterize the behavior of the failure data. In practice,scientists and engineers either assume one, such as theexponential, Weibull, Poisson, etc., or a perform goodnessof t test to properly identify the failure distribution andthen proceed with the reliability analysis. It is possible thatthe assumed failure distribution is not the correct one and

Page 3: International Encyclopedia of Statistical Science || Probit Analysis

Parametric and Nonparametric Reliability Analysis P

P

furthermore, the goodness of t test methodology failed toidentify a classical probability distribution.us, proceed-ingwith the reliability analysis will result inmisleading andincorrect results.In this brief document we discuss a nonparametric

reliability procedure when one cannot identify a classicalfailure distribution, f (t), to characterize the failure dataof the system.e method is based on estimating the fail-ure density through the concept of distribution-free kerneldensity method. Utilizing such methods on the subjectarea oers signicant computation diculties.erefore,in order to use this method, one must be able to obtain theoptimal bandwidth for the kernel density estimate. Here,we recommend a six-step procedure which one can applyto compute the optimal nonparametric probability distri-bution that characterizes the failure times. Some usefulreferences on the subjectmatter are Bean andTsokos (,), Liu and Tsokos (, a, b), Qiao and Tsokos(, ), Rust and Tsokos (), Silverman (),and Tsokos and Rust (). First we briey discuss theparametric approach to reliability using the popular three-parameter Weibull probability distribution as the failuremodel. Some additional useful failuremodels can be foundin Tsokos (, ).

Parametric Approach to Reliabilitye two-parameterWeibull probability distribution is uni-versally used to characterize the failure times of a systemor component to study its reliability behavior instead ofthe three-parameter Weibull model. Recently, methodshave been developed along with eective algorithms forwhich one can obtain estimates of the three-parameterWeibull probability distribution. Here we will use thethree-parameter Weibull failure model.

e three-parameter Weibull failure model is given by

f (t) =c (t − a)c−

bcexp−(

t − a

b)c

,

where a, b and c are the maximum likelihood estimatesto the location, scale and shape parameters, respectively.For calculation of the estimates of a, b and c, see Qiao andTsokos (, ).

us, the parametric reliability estimation Rp(t) of thethree-parameter Weibull model is given by

Rp(t) = exp−(t − a

b)c

.

e goodness of t criteria that one can use in iden-tifying the appropriate classical failure probability distri-bution to characterize the failure times is the popular7Kolmogorov–Smirnov test.Briey, it tests the null hypothesis that the data tjnj=

is from some specied classical probability distributionagainst the alternative hypothesis that it is from anotherprobability distribution.at is,

⎧⎪⎪⎨⎪⎪⎩

H : tjnj= ∼ F(t),H : tjnj= /∼ F(t).

Let F∗n (t) be the empirical distribution function for thefailure data.e Kolmogorov–Smirnov statistic is denedby

Dn =∑t

∣F∗n (t) − F(t)∣ .

e statistic Dn can be easily calculated from the fol-lowing formula:

Dn = maxmax≤i≤n

[i

n− F (t(i))] , max

≤i≤n[i − n

− F (t(i))] ,

where t() ≤ t() ≤ ⋯ ≤ t(n) is the order statistic of tjn

j=.Let Dn,α be the upper α-percent point of the distribu-

tion of Dn, that is,

P Dn > Dn,α ≤ α.

Tables for the exact critical values Dn,α are available. SeeMiller () and Owen (), among others, to make theappropriate decision.

Nonparametric Approach to ReliabilityLet tjnj= be the failure data characterized by the proba-bility density function f (t).en the nonparametric prob-ability density estimation f can be written as

fh(t) =nh

n

∑j=K (t − tj

h)

whereK(t) is the kernel and assumed to beGaussian givenby

K(t) =

√πe− t

and h is the estimate of the optimal bandwidth. ereare several other choices for the kernel, namely, Epanech-nikov, Corine, biweight, triweight, triangle and uniform.For further information regarding the selection process,see Silverman ().e most important element in fh(t)being eective to characterize the failure data is the band-width, h. Given below is a procedure that works fairly wellin obtaining optimal h and then fh(t) for the failure data,

Page 4: International Encyclopedia of Statistical Science || Probit Analysis

P Parametric and Nonparametric Reliability Analysis

(Bean and Tsokos ; Silverman ).is procedure issummarized below.

Calculate S and T using the failure data t, t, . . . , tn:

S= (n − )−

n

∑j=

(tj − t) and T =

n

n

∑j=

(tj − t) .

Determine a value for U and U as dened below:

U ≈ S and U ≈

n

(n − ) (n − n + )

(T −(n − )(n − )

nS) .

Find estimates of the parameters µ and σ :

µ =

√U −U

and σ =√U − µ.

Calculate ∫∞−∞ f

′′(t)dt from the following:

∫∞

−∞f′′(t)dt =

πσ

+

πσ e− µ

σ (−µ

σ +µ

σ ) .

Find hopt from the following:

hopt = − π

− n

− ∫∞

−∞f′′(t)dt

− .

Obtain the estimate of the nonparametric failure dis-tribution of the data:

f (t) =

nhopt

n

∑j=K (t − tj

hopt) .

ere are other methods for dealing with the opti-mal bandwidth selection, see Bean and Tsokos () andSilverman ().

e nonparametric estimate of the failure probabilitydistribution Rnp(t) can be obtained,

Rnp(t) = ∫∞

tf (τ)dτ

= ∫∞

t

nh

n

∑j=K (

τ − tj

h)dτ

=nh

n

∑j=∫

tK (

τ − tj

h)dτ

=n

n

∑j=∫

t−tj

h

−∞K(τ)dτ.

To evaluate the integral in the above equation, let

Φ(t) = ∫t

−∞K(x)dx =

√π ∫

t

−∞e− x

dx.

en we have

Rnp(t) =n

n

∑j=

[ −Φ (t − tj

h)]

= −n

n

∑j=Φ (t − tj

h) .

To calculate Φ(t), let

∫u

e−xdx = e

−u∞∑k=

k ⋅ uk+

(k + )!!.

It follows that we can write

Φ(t) = ∫t

K(x)dx + .

=

√π ∫

t

e− x

dx + .

=

√π

√∫

t√

e−τdτ + .

=

√π⋅ e− t ⋅

∞∑k=

k ( t√)k+

(k + )!!+ .

=

√π

⋅ e− t ⋅

∞∑k=

tk+

(k + )!!+ ..

Note that Φ(t) ≈ when t < −, and Φ(t) ≈ whent > . en we need to carry out the summation in theinterval ∣t∣ ≤ .Since the sum converges quite fast, when ∣t∣ < , we

have overcome the numerical diculty in the calculationof the nonparametric reliability. An ecient numericalprocedure is given below to evaluate the above summation.

Step Construct a subroutine for calculating Φ(t) asfollows:

. Notations: At input, t stores the point; at output, pstores Φ(t). Other values for the computation: cc, eachterm of the sum; tt stores the value t ∗ t, to savecomputer time.

. Let p = t, cc = t, tt = t ∗ t.. For k = , , , . . . , perform () through () that follows.. If ∣cc∣ < tolerance (we use − for tolerance), then

p =

√πe− tt ⋅ p + .,

output p and exit. Otherwise continue with ().. cc = cc ⋅ tt/(k + ), p = p + cc.

Step Find the optimal bandwidth h from the six-stepprocedure introduced in section “7Parametric Approachto Reliability”.

Page 5: International Encyclopedia of Statistical Science || Probit Analysis

Parametric Versus Nonparametric Tests P

P

Step e reliability function at any givenpoint t is givenby

Rnp(t) = −n

n

∑j=Φ (t − tj

h) .

Several applications of real data, along with MonteCarlo simulations and the nonparametric kernel prob-ability estimate of reliability, give very good results incomparison with the parametric version of the reliabilityfunction.

About the AuthorFor biography see the entry 7Mathematical and StatisticalModeling of Global Warming.

Cross References7Bayesian Reliability Modeling7Degradation Models in Reliability and Survival Analysis7Imprecise Reliability7Kolmogorov-Smirnov Test7Nonparametric Density Estimation7Ordered Statistical Data: Recent Developments7Tests of Fit Based one Empirical DistributionFunction7Weibull Distribution

References and Further ReadingBean S, Tsokos CP () Developments in nonparametric density

estimation. Int Stat Rev ():–Bean S, Tsokos CP () Bandwidth selection procedures for kernel

density estimates. Comm Stat A Theor ():–Liu K, Tsokos CP () Kernel estimates of symmetric density

function. Int J Appl Math ():–Liu K, Tsokos CP (a) Nonparametric reliability modeling for

parallel systems. Stochastic Anal Appl ():–Liu K, Tsokos CP (b) Optimal bandwidth selection for a non-

parametric estimate of the cumulative distribution function.Int J Appl Math ():–

Martz HF, Waller RA () Bayesian reliability analysis. WileySeries in probability and mathematical statistics: applied prob-ability and statistics. Wiley, Chichester

Miller LH () Table of percentage points of Kolmogorov statistics.J Am Stat Assoc :–

Owen DB () Handbook of statistical tables. Addison-Wesley,Reading, MA

Qiao H, Tsokos CP () Parameter estimation of the Weibullprobability distribution. Math Comput Simulat ():–

Qiao H, Tsokos CP () Estimation of the three parameter Weibullprobability distribution. Math Comput Simulat (–):–

Rust A, Tsokos CP () On the convergence of kernel estimators ofprobability density functions. Ann Inst Stat Math ():–

Silverman BW () Density estimation for statistics and dataanalysis. Monographs on statistics and applied probability.Chapman and Hall, London

Tsokos CP () Reliability growth: nonhomogeneous Poisson pro-cess. In: Balakrishnan N, and Cohen AC (eds) Recent advancesin life-testing and reliability. CRC, Boca Raton, pp –

Tsokos CP () Ordinary and Bayesian approach to life testingusing the extreme value distribution. In: Basu AP, Basu SK,Mukhopadhyay S (eds) Frontiers in reliability analysis volume of Series on Quality, Reliability and Engineering Statistics,World Scientific, Singapore, pp –

Tsokos CP, Rust A () Recent developments in nonparamet-ric estimation of probability density. In: Applied stochasticprocesses (Proc. Conf., Univ. Georgia, Athens, Ga., ),Academic, New York, pp –

Parametric Versus NonparametricTestsDavid J. SheskinProfessor of PsychologyWestern Connecticut State University, Danbury, CT, USA

A common distinction made with reference to statisti-cal tests/procedures is the classication of a procedure asparametric versus nonparametric.is distinction is gen-erally predicated on the number and severity of assump-tions regarding the population that underlies a specictest. Although some sources use the term assumption free(as well as distribution free) in reference to nonparametrictests, the latter label is misleading, in that nonparametrictests are not typically assumption free. Whereas paramet-ric statistical tests make certain assumptions with respectto the characteristics and/or parameters of the underly-ing population distribution upon which a test is based,nonparametric tests make fewer or less rigorous assump-tions.us, as Marascuilo and McSweeney () suggest,nonparametric tests should be viewed as assumption freertests. Perhaps the most common assumption associatedwith parametric tests that does not apply to nonparamet-ric tests is that data are derived from a normally distributedpopulation.Many sources categorize a procedure as parametric

versus nonparametric based on the level of measure-ment a set of data being evaluated represents. Whereasparametric tests are typically employed when interval orratio data are evaluated, nonparametric tests are usedwith rank-order (ordinal) and categorical data. Com-mon example of parametric procedures employed withinterval or ratio data are 7Student’s t tests, analysis ofvariance procedures (see 7Analysis of Variance), andthe Pearson product moment coecient of correlation(see 7Correlation Coecient). Examples of commonly

Page 6: International Encyclopedia of Statistical Science || Probit Analysis

P Pareto Sampling

described nonparametric tests employed with rank-orderdata are theMann–Whitney U test,Wilcoxon’s signed–ranksand matched–pairs signed ranks tests, the Kruskal–Wal-lis one-way analysis of variance by ranks, the Friedmantwo–way analysis of variance by ranks, and Spearman’srank order correlation coecient. Examples of commonlydescribed nonparametric tests employed with categori-cal data are 7chi-square tests such as the goodness-of-ttest, test of independence, and test of homogeneity and theMcNemar test.Researchers are in agreement that since ratio and inter-

val data contain a greater amount of information thanrank-order and categorical data, if ratio or interval data areavailable it is preferable to employ a parametric test for ananalysis. One reason for preferring a parametric test is thatthe latter type of test generally has greater power than itsnonparametric analog (i.e., a parametric test is more likelyto reject a false null hypothesis). If, however, one has reasonto believe that one or more of the assumptions underly-ing a parametric test have been saliently violated (e.g., theassumption of underlying normal population distributionsassociated with the t test for independent samples), manysources recommend that a nonparametric test (e.g., a rank-order test that does not assume population normality suchas the Mann–Whitney U test, which can also be used toevaluate data involving two independent samples)will pro-vide a more reliable analysis of the data. Yet other sourcesargue it is still preferable to employ a parametric test underthe latter conditions, on the grounds that parametric testsare, for the most part, robust. A robust test is one that stillallows a researcher to obtain reasonably reliable conclu-sions even if one or more of the assumptions underlyingthe test are violated. As a general rule, when a paramet-ric test is employed in circumstances when one or moreof its assumptions are thought to be violated, an adjustedprobability distribution is employed in evaluating the data.Sheskin (, p. ) notes that in most instances, thedebate concerning whether a researcher should employa parametric or nonparametric test for a specic experi-mental design turns out to be of little consequence, sincein most cases data evaluated with both a parametric testand its nonparametric analog will result in a researcherreaching the same conclusions.

Cross References7Analysis of Variance7Analysis of Variance Model, Eects of Departures fromAssumptions Underlying7Asymptotic Relative Eciency in Testing7Chi-Square Test: Analysis of Contingency Tables7Explaining Paradoxes in Nonparametric Statistics

7Fisher Exact Test7Frequentist Hypothesis Testing: A Defense7Kolmogorov-Smirnov Test7Measures of Dependence7Mood Test7Multivariate RankProcedures: Perspectives andProspec-tives7Nonparametric Models for ANOVA and ANCOVADesigns7Nonparametric Rank Tests7Nonparametric Statistical Inference7Permutation Tests7Randomization Tests7Rank Transformations7Robust Inference7Scales ofMeasurement andChoice of StatisticalMethods7Sign Test7Signicance Testing: An Overview7Signicance Tests: A Critique7Statistical Inference7Statistical Inference: An Overview7Student’s t-Tests7Validity of Scales7Wilcoxon–Mann–Whitney Test7Wilcoxon-Signed-Rank Test

References and Further ReadingMarascuilo LA, McSweeney M () Nonparatmetric and

distribution-free methods for the social sciences. Brooks/Cole,Monterey, CA

Sheskin DJ () Handbook of parametric and nonparametric sta-tistical procedures, th edn. Chapman and Hall/CRC, BocaRaton

Pareto Sampling

Lennart BondessonProfessor EmeritusUmeå University, Umeå, Sweden

IntroductionLet U = , , . . . ,N be a population of units. To getinformation about the population total Y of some inter-esting y-variable, unequal probability sampling is a widelyapplied method. Oen a random sample of a xed num-ber, n, of distinct units is to be selected with prescribedinclusion probabilities πi, i = , , . . . ,N, for the units inU .ese πi should sum to n and are usually chosen to be pro-portional to some auxiliary variable. In this form unequal

Page 7: International Encyclopedia of Statistical Science || Probit Analysis

Pareto Sampling P

P

probability sampling is called xed size πps sampling (ps= proportional to size). By the help of a πps sample andrecorded y-values for the units in the sample, the total Ycan be estimated unbiasedly by the 7Horvitz–ompsonestimator YHT = ∑i∈s yi/πi, where s denotes the sample.ere are many possible xed size πps sampling designs,see, e.g., Brewer and Hanif () and Tillé ().

is article treats Rosén’s (a,b) Pareto order πpssampling design. Contrary to many other xed size πpsdesigns, it is very easy to implement. However, it is onlyapproximate. Independently of Rosén, Saavedra ()suggested the design. Both were inspired by unpublishedwork of Ohlsson, cf. Ohlsson ().

The Pareto DesignLet Ui, i = , , . . . ,N, be independent U(, )-distributedrandom numbers. Further, let

Qi =Ui/( −Ui)pi/( − pi)

, i = , , . . . ,N,

be so-called ranking variables. Put pi = πi, i = , , . . . ,N,and select as sample those n units that have the smallestQ-values.e factual inclusion probabilities π∗i do not equalthe desired πi but approximately. For d = ∑Ni= πi( − πi)

not too small (d > ), the agreement is surprisingly goodand better the larger d is.

e main advantages of the method are its simplic-ity, its high 7entropy, and that the Uis can be used aspermanent random numbers, i.e., can be reused when ata later occasion the population, more or less altered, isre-sampled. In this way changes can be better estimated.

e name of the method derives from the fact thatu/( − u) = F−(u), where F− is the inverse of the spe-cial Pareto distribution function F(x) = x/( + x) on(,∞). A general order sampling procedure uses insteadQi = F

−(Ui)/F−(πi) for any specied F. As d →∞, cor-

rect inclusion probabilities are obtained but Rosén showedthat the Pareto choice gives smallest possible asymptoticbias for them. Ohlsson () uses Qi = Ui/πi, i.e., thedistribution function F(x) of a uniform distribution over(,).Probabilistically the Pareto design is very close to

Sampford’s () design for which the factual inclusionprobabilities agree with the desired ones. Let x be a binaryN-vector such that xi = if unit i is sampled and otherwise.e probability function p(x) of the Sampford design isgiven by

p(x) = CN

∏i=

πxii ( − πi)

−xi ×N

∑k=

( − πk)xk,N

∑i=xi = n,

where C is a constant. For the Pareto design the proba-bility function has the same form but the factor − πk isreplaced by ck, where ck is given by an integral that is closelyproportional to − πk if d is large (Bondesson et al. ).For the Pareto design the factual inclusion probabil-

ities π∗i can be calculated in dierent ways (Aires ;Bondesson et al. ). By an iterative procedure based onrecalculated factual inclusion probabilities, it is possible toadjust the parameters pi = πi for the Pareto procedure,so that the desired inclusion probabilities πi are exactlyobtained (Aires ).e iterative procedure is time con-suming. It is also possible to get good improvement by asimple adjustment.e ranking variables Qi are replacedby the adjusted ranking variables

QAdj

i = Qi exp(πi( − πi) (πi −) /d

) .

For N = and n = the following table illustrates theimprovement:

πi . . . . . .

Pareto π∗i . . . . . .

AdjPar π∗i . . . . . .

Restricted Pareto SamplingPareto sampling can be extended to cases where there arefurther restrictions on the sample than just xed samplesize (Bondesson ). Such restrictions appear if the pop-ulation is stratied in dierent ways.e restrictions areusually of the form Ax = b, where A is anM × N matrix.en

N

∑i=xi logQi =

N

∑i=xi (log

Ui

−Ui− log

πi

− πi)

is minimized with respect to x given the linear restrictions.is minimization can be done by using a program forcombinatorial optimization but it usually also suces touse linear programming and the simplex algorithm withthe additional restrictions ≤ xi ≤ for all i. Undersome conditions asymptotically correct inclusion proba-bilities are obtained.However, in practice some adjustmentis oen needed. A simple adjustment is to replace Qi by

QAdj

i = Qi exp(πi( − πi) (πi −)(aTi Σ

−ai)) ,

where Σ = ADAT with D = diag(π( − π), . . . , πN(−πN)), and ai is the ith column vector inA.ismethod

Page 8: International Encyclopedia of Statistical Science || Probit Analysis

P Partial Least Squares Regression Versus Other Methods

is suggested by Bondesson (), who also presentsanothermethod for improvement, conditional Pareto sam-pling. For the latter method, the random numbers areconditioned to satisfy AU =

A.

About the AuthorLennart Bondesson (academically a grand-son of Har-ald Cramér) received his Ph.D. in . In hebecame professor of mathematical statistics in forestryat the Swedish University of Agricultural Sciences inUmeå, and in professor of mathematical statisticsat Department of Mathematics and Mathematical Statis-tics, Umeå University. He has published more than papers and one research book Generalized Gamma Con-volutions and Related Classes of Distributions and Densities

(Lecture Notes in Statistics , Springer, ). ProfessorBondesson was Editor of Scandinavian Journal of Statistics(–).

Cross References7Horvitz–ompson Estimator7Sampling Algorithms7Uniform Random Number Generators

References and Further ReadingAires N () Algorithms to find exact inclusion probabilities for

conditional Poisson sampling and Pareto πps sampling designs.Meth Comput Appl Probab :–

Aires N () Comparisons between conditional Poisson samplingand Pareto πps sampling designs. J Stat Plann Infer :–

Bondesson L () Conditional and restricted Pareto sampling; twonew methods for unequal probability sampling. Scand J Stat ,doi: ./j.-...x

Bondesson L, Traat I, Lundqvist A () Pareto sampling versusSampford and conditional Poisson sampling. Scand J Statist :–

Brewer KRW, Hanif M () Sampling with unequal probabilities.Lecture Notes in Statistics, No. . Springer, New York

Ohlsson E () Sequential Poisson sampling. J Official Stat :–

Rosén’s B (a) Asymptotic theory for order sampling. J Stat PlannInfer :–

Rosén’s B (b) On sampling with probability proportional to size.J Stat Plann Infer :–

Saavedra P () Fixed sample size PPS approximations with a per-manent random number. Joint Statistical Meetings AmericanStatistical Association, Orlando, Florida

Sampford MR () On sampling without replacement withunequal probabilities of selection. Biometrika :–

Tillé Y () Sampling algorithms. Springer series in statistics.Springer science + Business Media, New York

Partial Least Squares RegressionVersus Other MethodsSmailMahdiProfessor of Mathematical StatisticsMathematics and Physics, University ofe West Indies,Cave Hill Campus, Barbados

Introductione concept of regression originated from genetics and theword regression was introduced into statistical literature inthe published paper by Sir Francis Galton () on therelationship between the stature of children and their par-ents.e relationshipwas found to be approximately linearwith an approximate gradient of /, and this suggestedthat very tall parents tend to have children shorter thanthemselves and vice versa.is phenomena was referredto as regression or return to mediocrity or to an aver-age value.e general framework of the linear regressionmodel (see 7Linear Regression Models)

Y = Xβ + є

where Y is an n× qmatrix of observations on q dependentvariables,X is an n×p explicativematrix on p variables, β isa p×qmatrix of unknown parameters, and є is n×qmatrixof errors. e rows of є are independent and identicallydistributed, oen assumed to be Gaussian. e justica-tion for the use of a linear relationship comes from thefact that the conditional mean of a Gaussian random vec-tor given the value of another Gaussian random vectoris linear when the joint distribution is Gaussian. Withoutloss of generality, we will assume throughout that X and Yare mean centered and scaled.e aim is to estimate theunknown matrix β.e standard approach is to solve inL the optimization problem:

L(β) = minβ ∥Y − βX∥.

If X has a full rank, then a unique solution exists and isgiven by

βˆOLS = (XTX)−XTY .

When the errors are assumed Gaussian, βˆOLS is alsothe maximum likelihood estimator and therefore, thebest unbiased linear estimator (BLUE) of β. Unfor-tunately, when some of the dependent variables arecollinear, the matrix X does not have full rank andthe ordinary least squares estimator may not exist or

Page 9: International Encyclopedia of Statistical Science || Probit Analysis

Partial Least Squares Regression Versus Other Methods P

P

becomes inappropriate. To overcome this multicollinear-ity problem (see 7Multicollinearity), many other estima-tors have been proposed in literature.is includes ridgeregression estimator (RR), principal component regres-sion estimator (PCR), and more recently the partial leastsquares regression estimator (PLSR).

Ridge Regressione slightest multicollinearity of the independent vectorsmay make the matrix XTX ill conditioned and increasethe variance of the components of βˆOLS, which can leadto an unstable estimation of β. Hoerl () proposed theridge regression method (see 7Ridge and Surrogate RidgeRegressions) that consists in adding a small positive con-stant λ to the diagonal of the standardized matrix XTX toobtain the RR estimator as

βˆRR = (XTX + λI)−XTY

where I is the p×p identity matrix.e matrix XTX+ λI isalways invertible since it is positive denite. Several tech-niques are available in literature for nding the optimumvalue of λ. Another solution for the collinearity prob-lem is given by the principal component regression (PCR)technique that is outlined below.

Principal Component RegressionPrincipal component regression is a 7principal compo-nent analysis (PCA) followed by a linear regression. Inthis case, the response variables are regressed on the lead-ing principal components of the matrix X, which can beobtained from its singular value decomposition (SVD).PCA is a compressing data technique that consists of nd-ing ak-rank, k = , . . . , p, projection matrix Pk such that thevariance of the projected data X × Pk is maximized.ecolumns of the matrix Pk consist of the k unit-norm lead-ing eigenvectors of XTX. Using the matrix Pk, we regressY on the orthogonal score matrix Tk satisfying X = TkPk;this yields the k latent-component regression coecientsmatrix

βˆPCR = Pk (TTkTk)

−TTkY .

However, twomajor problemsmay occur: rstly the choiceof k and secondly, how, if the ignored principal compo-nents that correspond to the small eigenvalues are in factrelevant for explaining the covariance structure of the Yvariables. For this reason, PLS comes into play to betterdealwith themulticollinearity problemby creatingX latentcomponents for explaining Y , through the maximizationof the covariance structure between the X and Y spaces.

Partial Least Squares RegressionPartial least squares (PLS), also known as projectionmethod to latent structures, is applied to a broad area ofdata analysis including7generalized linearmodels, regres-sion, classication, and discrimination. It is a multivariatetechnique that generalizes and combines ideas from prin-cipal component analysis (PCA) and ordinary least squares(OLS) regression methods. It is designed to not only con-front situations of correlated predictors but also relativelysmall samples and even the situation where the numberof dependent variables exceeds the number of cases.eoriginal idea came in the work of the Swedish statisti-cian Herman Wold () in the s and became pop-ular in computational chemistry and sensory evaluationby the work of his son Svante who developed the popu-lar soware so independent modeling of class analogies(SIMCA) (Wold ). PLS nd rst two sets of weightsω = (w, . . . ,wm) and U = (u, . . . ,um) for X and Y suchthat Cov (tXl , t

Yl ), where t

Xl = X × ωl, tYl = Y × U l and l =

, . . . ,m, is maximal.en, theY variables are regressed onthe latent matrixT whose columns are the latent vectors t i,see, e.g., (Vinzi et al. ) for more details. Classically, theordinary least squares regression (OLS) is used but othermethods have been considered along with the correspond-ing algorithms.We describe the PLS technique through thealgorithm, outlined, for instance in Mevik and Wehrens(), which is based on the singular value decomposi-tion (SVD) of the cross product XTY types. First, we setE = X and F = Y .en, we perform the singular decom-position of ETF and take the rst le-singular vector ω andthe right-singular vector q ofETF to obtain the scores t andu as follows:

t = Eω

and

u = Fq.

e rst score then is obtained as t = t =t

(tTt)/.e

eect of this score t on E and F is obtained by regress-ing E and F on t.is gives p = ETt and q = FTt.iseect is then subtracted fromE and F to obtain the deatedmatrices, with one rank less, E′ and F′, see, for example,Wedderburn ().e matrices E′ and F′ are then sub-stituted for E and F and the process is reiterated again fromthe singular value decomposition of ETF to obtain the sec-ond normalized latent component t.e process is con-tinued until the required number m of latent componentsis obtained. e optimal value of m is oen obtained bycross-validation or bootstrap, see, for example, Tenenhaus().e obtained latent components t l, l = , . . . ,m are

Page 10: International Encyclopedia of Statistical Science || Probit Analysis

P Pattern Recognition, Aspects of

saved aer each iteration as column of the latent matrix T.Finally, the original Y variables are regressed on the latentcomponents T to obtain the regression estimator

βPLS = (TTT)−TTY

and the estimator

Y = T(TTT)−TTY

of Y . Note that several of the earlier PLS algorithms,available in literature, lack robustness. Recently, Simonettiet al. () substituted systematically the least median ofsquares regression, Rousseeuw (), for the least squaresregression in Garthwaite () PLS setup to obtain arobust estimation for the considered data set.

Cross References7Chemometrics7Least Squares7Linear Regression Models7Multicollinearity7Multivariate Statistical Analysis7Principal Component Analysis7Ridge and Surrogate Ridge Regressions

References and Further ReadingGarthwaite PH () An interpretation of partial least squares.

J Am Stat Assoc ():–Galton F () Regression toward mediocrity in hereditary stature.

Nature :–Hoerl AE () Application of ridge analysis to regression prob-

lems. Chem Eng Prog :–Mevik B, Wehrens R () The pls package: principal component

and partial least squares regression in R. J Stat Softw ():–Rousseeuw PJ () Least median of squares regression. J Am Stat

Assoc :–Simonetti B, Mahdi S, Camminatiello I () Robust PLS regression

based on simple least median squares regression. MTISD’Conference, Procida, Italy

Tenenhaus M () La Régréssion PLS: Théorie et Pratique.Technip, Paris

Vinzi VE, Lauro CN, Amato S () PLS typological regression:algorithms, classification and validation issues. New develop-ments in classification and data analysis: Vichi M, Monari P,Mignani S, Montanari A (eds) Proceedings of the Meeting of theClassification and Data Analysis Group (CLADAG) of the Ital-ian Statistical Society, University of Bologna, Springer, Berlin,Heidelberg

Wedderburn JHM () Lectures on matrices. AMS, vol . Collo-quium, New York

Wold H () Estimation of principal components and related mod-els by iterative least squares. In: Krishnaiah PR (ed) Multivariateanalysis. Academic, New York, pp –

Wold S () Pattern recognition by means of disjoint principalcomponents models. Pattern Recogn :–

Pattern Recognition, Aspects of

Dražen Domijan, Bojana Dalbelo BašicAssociate ProfessorUniversity of Rijeka, Rijeka, CroatiaProfessorUniversity of Zagreb, Zagreb, Croatia

IntroductionRecognition is regarded as a basic attribute of humanbeings and other living organisms. A pattern is a descrip-tion of an object (Tou and Gonzalez ). Pattern recog-nition is a process of assigning category labels to a set ofpatterns. For instance, visual patterns “A,” “a,” and “A” aremembers of the same category, which is labeled as “let-ter A” and can easily be distinguished from the patterns,“B,” “b,” and “B,” which belong to another category labeledas “letter B.” Humans perform pattern recognition verywell and the central problem is how to design a systemto match human performance. Such systems nd practicalapplications in many domains such as medical diagnosis,image analysis, face recognition, speech recognition, hand-written character recognition, and more (Duda et al. ;Fukunaga ).

e problem of pattern recognition can be tackledusing handcraed rules or heuristics for distinguishing thecategory of objects, though in practice such an approachleads to proliferation of the rules and exceptions to therules, and invariably gives poor results (Bishop ).Some classication problems can be tackled by syntactic(linguistic) pattern recognition methods, but most real-word problems are tackled using the machine learningapproach. Pattern recognition is an interdisciplinary eldinvolving statistics, probability theory, computer science,machine learning, linguistics, cognitive science, psychol-ogy, etc. Pattern recognition systems involve the followingphases: sensing, feature generation, feature selection, clas-sier design, and system performance evaluation (Tou andGonzalez ).In the previous example, the pattern was a set of

black and white pixels. However, patterns might be a setof continuous variables. In statistical pattern recognition,features are treated as random variables. erefore, pat-terns are random vectors that are assigned to a class orcategory with certain probability. In this case, patternscould be conceived as points in a high-dimensional fea-ture space. Pattern recognition is the task of estimating afunction that divides the feature space into regions, whereeach region corresponds to one of the categories (classes).

Page 11: International Encyclopedia of Statistical Science || Probit Analysis

Pattern Recognition, Aspects of P

P

Such a function is called the decision or discriminant func-tion and the surface that is realized by the function isknown as the decision surface.

Feature Generation and FeatureSelectionBefore the measurement data obtained from the sensorcould be utilized for the design of the pattern classier,sometimes it is necessary to perform several preprocess-ing steps such as outlier removal, data normalization,and treatment of the missing data. Aer preprocessing,features are generated from measurements using datareduction techniques, which exploits and removes redun-dancies in the original data set. A popular way of gen-erating features is to use linear transformations such asthe Karhunen–Loeve transformation (7principal compo-nent analysis), independent component analysis, discreteFourier transform, discrete cosine and sine transforms,and Hadamard and Haar transforms. Important con-sideration in using these transformations is that theyshould preserve as much of the information that is cru-cial for classication task as possible, while removing asmuch redundant information as possible (eodoridis andKoutroumbas ).Aer the features are generated, they could be inde-

pendently tested for their discriminatory capability. Wemight select features based on the statistical hypothesistesting. For instance, we may employ t-test or Kruskall–Wallis test to investigate the dierence in mean featurevalues for two classes. Another approach is to construct thecharacteristic receiver operating curve and to explore howmuch overlap exists between distributions of feature val-ues for two classes. Furthermore, we might compute classseparability measures that take into account correlationsbetween features such as Fisher’s discriminant ratio anddivergence. Besides testing individual features, we mightask what is the best feature vector or combination of fea-tures that gives the best classication performance.ereare several searching techniques such as sequential back-ward or forward selection that can be employed in orderto nd an answer (eodoridis and Koutroumbas ).

Design of a Pattern Classifiere principled way to design a pattern classier wouldinvolve characterization of the class probability densityfunctions in the feature space and nding an appro-priate discriminant function to separate the classes inthis space. Every classier can make an error by assign-ing the wrong class label to the pattern. e goal is tond the classier with the minimal probability of clas-sication error. e best classier is based on the Bayes

decision rule (Fukunaga ).e basic idea is to assigna pattern to the class having the highest a posteriori prob-ability for a given pattern. A posteriori probabilities for agiven pattern are computed fromapriori class probabilitiesand conditional density functions. However, in practice, itis oen dicult to compute a posteriori probabilities as apriori class probabilities are not known in advance.

erefore, although the Bayesian classiers are opti-mal they are rarely used in practice. Instead, classiers aredesigned directly from the data. A simple and computa-tionally ecient approach to the design of the classier isto assume that the discriminant function is linear. In thatcase, we can construct a decision hyperplane through thefeature space dened by

g(x) = wTx + w =

where w = [w,w, . . .]T are unknown coecients orweights, x = [x, x, . . .] is a feature vector, and w isa threshold or bias. Finding unknown weights is calledlearning or training. In order to nd an appropriate valuefor the weights, we can use iterative procedures such as theperceptron learning algorithm.e basic idea is to com-pute error or cost function, which measures the dierencebetween actual classier output and desired output. Errorfunction is used to adjust current values of the weights.is process is repeated until perceptron converges, thatis, until all patterns are correctly classied.is is possibleif patterns are linearly separable. Aer the perceptron con-verges, new patterns could be classied according to thesimple rule:

If wTx + w > assign x to class ω

If wTx + w < assign x to class ω

where ω and ω are class labels.e problem with linear classiers is that they lead to

suboptimal performance when classes are not linearly sep-arable. An example of pattern recognition problem that isnot linearly separable is the logical predicate XOR wherethe patterns (,) and (,) belong to class ω, and the pat-terns (,) and (,) belong to ω. It is not possible todraw a straight line (linear decision boundary) in two-dimensional feature space that will discriminate betweenthese two classes. One approach to deal with such prob-lems is to build a linear classier that will minimize themean square error between the desired and the actualoutput of the classier. is can be achieved using leastmean square (LMS) orWidrow–Ho algorithm for weight

Page 12: International Encyclopedia of Statistical Science || Probit Analysis

P Permanents in Probability Theory

adjustment. Another approach is to design a nonlinearclassier (Bishop ). Examples of nonlinear classiersare multilayer perceptron trained with error backpropa-gation, radial basis functions network, k-nearest neighborclassier, and decision trees. In practice, it is possible tocombine outputs from several dierent classiers in orderto achieve better performance. Classication is an exam-ple of so-called supervised learning in which each featurevector has a preassigned target class.

Performance Evaluation of the PatternClassifierAn important task in the design of a pattern classier ishow well it will perform when faced with new patterns.is is an issue of generalization. During learning, the clas-sier builds a model of the environment to which it isexposed.e model might vary in complexity. A complexmodel might oer a better t to the data, but might alsocapturemore noise or irrelevant characteristics in the data,and thus be poor in the classication of new patterns. Sucha situation is called over-tting. On the other hand, if themodel is of low complexity, it might not t the data well.e problem is how to select an appropriate level of com-plexity that will enable the classier to t the observeddata well, while preserving enough exibility to classifyunobserved patterns. is is known as a bias-variancedilemma (Bishop ).

e performance of the designed classier is evalu-ated by counting the number of errors committed dur-ing a testing with a set of feature vectors. Error countingprovides an estimation of classication error probability.e important question is how to choose a set of fea-ture vectors that will be used for building the classierand a set of feature vectors that will be used for test-ing. One approach is to exclude one feature vector fromthe sample, train the classier on all other vectors, andthen test the classier with the excluded vector. If mis-classication occurs, error is counted. is procedure isrepeated N times with dierent excluded feature vectorsevery time. e problem with this procedure is that it iscomputationally demanding. Another approach is to splitthe data set into two subsets: () a training sample usedto adjust (estimate) classier parameters and () a test-ing sample that is not used during training but is appliedto the classier following completion of the training.eproblem with this approach is that it reduces the size ofthe training and testing samples, which reduces the reli-ability of the estimation of classication error probability(eodoridis and Koutroumbas ).

AcknowledgmentWe would like to thank Professor Sergios eodoridisfor helpful comments and suggestions that signicantlyimproved our presentation.

Cross References7Data Analysis7Fuzzy Sets: An Introduction7ROC Curves7Signicance Testing: An Overview7Statistical Pattern Recognition Principles

References and Further ReadingBishop CM () Neural networks for pattern recognition. Oxford

University Press, Oxford, UKBishop CM () Pattern recognition and machine learning.

Springer, BerlinDuda RO, Hart PE, Stork DG () Pattern classification, nd edn.

Wiley, New YorkFukunaga K () Introduction to statistical pattern recognition,

nd edn. Academic, San Diego, CATheodoridis S, Koutroumbas K () Pattern recognition, th edn.

ElsevierTou JT, Gonzalez RC () Pattern recognition Principles. Addison-

Wesley, Reading, MA

Permanents in Probability Theory

Ravindra B. BapatProfessor, Head New Delhi CentreIndian Statistical Institute, New Delhi, India

PreliminariesIf A is an n × n matrix, then the permanent of A, denotedby per A, is dened as

per A = ∑σ∈Sn

n

∏i=aiσ(i),

where Sn is the set of permutations of , , . . . ,n.us thedenition of the permanent is similar to that of the deter-minant except that all the terms in the expansion have apositive sign.

Example: Consider the matrix

A =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

.

Page 13: International Encyclopedia of Statistical Science || Probit Analysis

Permanents in Probability Theory P

P

en

per A = + + − + − = .

Permanents nd numerous applications in probabil-ity theory, notably in the theory of discrete distributionsand 7order statistics. ere are two main advantages ofemploying permanents in these areas. Firstly, permanentsserve as a convenient notational device, which makes itfeasible to write complex expressions in a compact form.e second advantage, which is more important, is thatsome theoretical results for the permanent lead to state-ments of interest in probability theory.is is true mainlyof the properties of permanents of entrywise nonnegativematrices.Although the denition of the permanent is similar

to that of the determinant, many of the nice propertiesof the determinant do not hold for the permanent. Forexample, the permanent is not well behaved under elemen-tary transformations, except under the transformation ofmultiplying a row or column by a constant. Similarly, thepermanent of the product of two matrices does not equalthe product of the permanents in general. e Laplaceexpansion for the determinant holds for the permanentas well and is a convenient tool for dealing with the per-manent.us, if A(i, j) denotes the submatrix obtained bydeleting row i and column j of the n × nmatrix A, then

per A =n

∑k=aikper A(i, k) =

n

∑k=akiper A(k, i), i = , , . . . ,n.

We refer to van Lint andWilson () for an introduc-tion to permanents.

Combinatorial ProbabilityMatrices all of whose entries are either or , the so called(, )-matrices, play an important role in combinatorics.Several combinatorial problems can be posed as prob-lems involving counting certain permutations of a niteset of elements and hence can be represented in termsof (, )-matrices. We give two well-know examples incombinatorial probability.Consider n letters and n envelopes carrying the corre-

sponding addresses. If the letters are put in the envelopesat random, what is the probability that none of the lettersgoes into the right envelope?e probability is easily seento be

n!per

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

⋮ ⋮ ⋱ ⋮

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

,

which equals − ! +

! −

! +⋯+(−)

n n! .e permanent in

the above expression counts the number of derangementsof n symbols.Another problem, whichmay also be posed as a proba-

bility question, is the problème desménages (Kaplansky andRiordan ). In how many ways can n couples be placedat a round table so that men and women sit in alternateplaces and no one is sitting next to his or her spouse?isnumber equals n! times the permanent of the matrix Jn −In−Pn, where Jn is the n×nmatrix of all ones, In is the n×nidentitymatrix, andPn is the full cycle permutationmatrix,having ones at positions (, ), (, ), . . . , (n − ,n), (n, )and zeroes elsewhere.e permanent can be expressed as

per (Jn − In − Pn) =n

∑i=

(−)inn − i

(n − ii

)(n − i)!.

Discrete Distributionse densities of some discrete distributions can be conve-niently expressed in terms of permanents. We illustrate byan example of multiparameter 7multinomial distribution.We rst consider the multiparameter binomial. Sup-

pose n coins, not necessarily identical, are tossed once, andletX be the number of heads obtained. Let pi be the proba-bility of heads on a single toss of the i-th coin, i = , , . . . ,n.Let p be the column vector with components p, . . . , pn,and let be the column vector of all ones.en it can beveried that

Prob(X = x) =

x!(n − x)!per

[p, . . . ,p´¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¶

r

( − p), . . . , ( − p)´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

n−r

]. ()

Now consider an experiment which can result in anyof r possible outcomes, and suppose n trials of the exper-iment are performed. Let pij be the probability that theexperiment results in the j-th outcome at the i-th trial,i = , , . . . ,n; j = , , . . . , r. Let P denote the n × r matrix(pij)which is row stochastic. Let P, . . . ,Pr be the columnsof P. LetXj denote the number of times the j-th outcome isobtained in the n trials, j = , , . . . , r. If k, . . . , kr are non-negative integers summing to n, then as a generalization of() we have (Bapat )

Prob(X = k, . . . ,Xr = kr) =

k!⋯kr!per

[P, . . . ,P´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

k

, . . . ,Pr , . . . ,Pr´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

kr

].

Similar representations exist for the multiparameternegative binomial.

Page 14: International Encyclopedia of Statistical Science || Probit Analysis

P Permutation Tests

Order StatisticsPermanents provide an eective tool in dealing withorder statistics corresponding to independent randomvariables, which are not necessarily identically distributed.Let X, . . . ,Xn be independent random variables withdistribution functions F, . . . ,Fn and densities f, . . . , fnrespectively. Let Y ≤ ⋯ ≤ Yn denote the correspondingorder statistics. We introduce the following notation. Fora xed y, let

f (y) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

f(y)

fn(y)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

and F(y) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

F(y)

Fn(y)

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

.

For ≤ r ≤ n, the density function of Yr is given byVaughan and Venables ()

gr(y) =

(r − )!(n − r)!per

[ f (y)±

F(y)±r−

−F(y)´¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¶n−r

],∞ < y <∞

For ≤ r ≤ n, the distribution function of Yr is givenby Bapat and Beg ()

Prob(Yr ≤ y) =n

∑i=r

i!(n − i)!

per [ F(y)±i

− F(y)´¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¶n−i

],

∞ < y <∞

e permanental representation can be used to extendseveral recurrence relations for order statistics from thei.i.d. case to the case of nonidentical, independent ran-dom variables. Using the Alexandro inequality for thepermanent of a nonnegativematrix, it can be shown (Bapat) that for any y, the sequence Prob(Yi ≤ y∣Yi− ≤ y),i = , . . . ,n is nonincreasing.For additional material on applications of permanents

in order statistics we refer to Balakrishnan () and thereferences contained therein.

Acknowledgmentse support of the JC Bose Fellowship, Department of Sci-ence and Technology, Government of India, is gratefullyacknowledged.

About the AuthorPast President of the Indian Mathematical Society (–), Professor Bapat joined the Indian Statistical Insti-tute, New Delhi, in , where he holds the position of

Head, Delhi Centre at the moment. He held visiting posi-tions at various Universities in the U.S., including Univer-sity of Connecticut and Oakland University, and visitedseveral Institutes abroad in countries including France,Holland, Canada, China and Taiwan for collaborativeresearch and seminars.e main areas of research inter-est of Professor Bapat are nonnegative matrices, matrixinequalities, matrices in graph theory and generalizedinverses. He has published more than research papersin these areas in reputed national and international jour-nals. He has written books on Linear Algebra, published byHindustan BookAgency, Springer and Cambridge Univer-sity Press. He has recently written a book on Mathematicsfor the general readers, in Marathi, which won the stategovernment award for best literature in Science for .He is Elected Fellow of the Indian Academy of Sciences,Bangalore and Indian National Science Academy, Delhi.

Cross References7Multinomial Distribution7Order Statistics7Probabilityeory: An Outline7Random Permutations and Partition Models7Univariate Discrete Distributions: An Overview

References and Further ReadingBalakrishnan N () Permanents, order statistics, outliers, and

robustness. Rev Mat Complut ():–Bapat RB () Permanents in probability and statistics. Linear

Algebra Appl :–Bapat RB, Beg MI () Order statistics for nonidentically dis-

tributed variables and permanents. Sankhya A ():–Kaplansky I, Riordan J () The probléme des ménages. Scripta

Math :–van Lint JH, Wilson RM () A course in combinatorics, nd edn.

Cambridge University Press, CambridgeVaughan RJ, Venables WN () Permanent expressions for order

statistic densities J R Stat Soc B :–

Permutation Tests

Markus NeuhäuserProfessorKoblenz University of Applied Sciences, Remagen,Germany

A permutation test is illustrated here for a two-samplecomparison.e notation is as follows: Two independentgroups with sample sizes n andm have independently and

Page 15: International Encyclopedia of Statistical Science || Probit Analysis

Permutation Tests P

P

identically distributed values X, . . . ,Xn and Y, . . . ,Ym,respectively, n +m = N.e means are denoted by X andY , and the distribution functions by F and G.ese distri-bution functions of the two groups are identical with theexception of a possible location shi: F(t) = G(t − θ) forall t,−∞ < θ < ∞.e null hypothesis states H : θ = ,whereas θ ≠ under the alternative H.In this case Student’s t test (see 7Student’s t-Tests)

can be applied. However, if F and G were not normaldistributions, it may be better to avoid using the t distri-bution. An alternative method is to use the permutationnull distribution of the t statistic.In order to generate the permutation distribution all

possible permutations under the null hypothesis have tobe generated. In the two-sample case, each permutation isa possible (re-)allocation of the N observed values to two

groups of sizesn andm. Hence, there are⎛⎜⎜⎝

N

n

⎞⎟⎟⎠

possible per-

mutations.e test statistic is calculated for each permuta-tion.e null hypothesis can then be accepted or rejectedusing the permutation distribution of the test statistic, thep-value being the probability of the permutations giving avalue of the test statistic as or more supportive of the alter-native than the observed value. us, inference is basedupon how extreme the observed test statistic is relative toother values that could have been obtained under the nullhypothesis.Under H all permutations have the same probability.

Hence, the p-value can simply be computed as the propor-tion of the permutations with a test statistic’s value as ormore supportive of the alternative than the observed value.

e order of the permutations is important, rather thanthe exact values of the test statistic.erefore,modied teststatistics can be used (Manly , pp. –). For example,the dierence X − Y can be used instead of the t statistic

t =X − Y

S ⋅

√n+m

where S is the estimated standard deviation.is permutation test is called Fisher–Pitman permu-

tation test or randomization test (see 7RandomizationTests), it is a nonparametric test (Siegel ; Manly ).However, at least an interval measurement is required forthe Fisher-Pitman test because the test uses the numericalvalues X, . . . ,Xn and Y, . . . ,Ym (Siegel ).

e permutation distribution depends on the observedvalues, therefore a permutation test is a conditional test,and a huge amount of computing is required. As a

result, the Fisher–Pitman permutation test was hardly everapplied before the advent of fast PCs, although it wasproposed in the s and its high eciency was knownsince decades. Nowadays, the test is oen recommendedand implemented in standard soware such as SAS (forreferences see Lehmann , or Neuhäuser and Manly).Please note that the randomization model of inference

does not require randomly sampled populations. For apermutation test it is only required that the groups or treat-ments have been assigned to the experimental units atrandom (Lehmann ).A permutation test can be appliedwith other test statis-

tics, too. Rank-based statistics such asWilcoxon’s rank sumcan also be used as test statistic (see the entry about theWilcoxon–Mann–Whitney test). Rank tests can also beapplied for ordinal data. Moreover, rank tests had advan-tages in the past because the permutation null distributioncan be tabulated. Nowadays, with modern PCs and fastalgorithms, permutation tests can be carried out with anysuitable test statistic. However, rank tests are relativelypowerful in the commonly-occurring situation where theunderlying distributions are non-normal (Higgins ).Hence, permutation tests on ranks are still useful despitethe fact that more complicated permutation tests can becarried out (Neuhäuser ).When the sample sizes are very large, the number

of possible permutations can be extremely large. en,a 7simple random sample of M permutations can bedrawn in order to estimate the permutation distribution.A commonly used value is M = , . Please note thatthe original observed values must be one of theM selectedpermutations (Edgington and Onghena , p. ).It should be noted that permutation procedures

do have some disadvantages. First, they are computer-intensive, although the computational eort seems to beno longer a severe objection against permutation tests. Sec-ond, conservatism is oen the price for exactness. For thisreason, the virtues of permutation tests continue to bedebated in the literature (Berger ).When a permutation test is performed for other situa-

tions than the comparison of two samples the principle isanalogue.e Fisher-Pitman test can also be carried outwith the ANOVA F statistic. Permutation tests can alsobe applied in case of more complex designs. For example,the residuals can be permuted (ter Braak ; Anderson).Permutation tests are also possible for other situa-

tions than the location-shi model. For example, Janssen() presents a permutation test for the7Behrens-Fisherproblem where the population variances may dier.

Page 16: International Encyclopedia of Statistical Science || Probit Analysis

P Pharmaceutical Statistics: Bioequivalence

About the AuthorFor biography see the entry 7Wilcoxon-Mann-WhitneyTest.

Cross References7Behrens–Fisher Problem7Nonparametric Statistical Inference7Parametric Versus Nonparametric Tests7Randomization7Randomization Tests7Statistical Fallacies: Misconceptions, and Myths7Student’s t-Tests7Wilcoxon–Mann–Whitney Test

References and Further ReadingAnderson MJ () Permutation tests for univariate or multivariate

analysis of variance and regression. Can J Fish Aquat Sci :–

Berger VW () Pros and cons of permutation tests in clinicaltrials. Stat Med :–

Edgington ES, Onghena P () Randomization tests, th edn.Chapman and Hall/CRC, Boca Raton

Higgins JJ () Letter to the editor. Am Stat , Janssen A () Studentized permutation tests for non-i.i.d.

hypotheses and the generalized Behrens–Fisher problem. StatProbab Lett :–

Lehmann EL () Nonparametrics: statistical methods based onranks. Holden-Day, San Francisco

Manly BFJ () Randomization, bootstrap and Monte Carlo meth-ods in biology, rd edn. Chapman and Hall/CRC, London

Neuhäuser M () Efficiency comparisons of rank and permuta-tion tests. Stat Med :–

Neuhäuser M, Manly BFJ () The Fisher-Pitman permutationtest when testing for differences in mean and variance. PsycholRep :–

Siegel S () Nonparametric statistics for the behavioural sciences.McGraw-Hill, New York

ter Braak CJF () Permutation versus bootstrap significance testsin multiple regression and ANOVA. In: Jöckel KH, Rothe G,Sendler W (eds) Bootstraping and related techniques. Springer,Heidelberg, pp –

Pharmaceutical Statistics:Bioequivalence

FrancisHsuanProfessor EmeritusTemple University, Philadelphia, PA, USA

In the terminology of pharmacokinetics, the bioavailabil-ity (BA) of a drug product refers to the rate and extent of itsabsorbed active ingredient or active moiety that becomesavailable at the site of action. A new/test drug product

(T) is considered bioequivalent to an existing/referenceproduct (R) if there is no signicant dierence in thebioavailabilities between the two products when they areadministered at the same dose under similar conditionsin an appropriately designed study. Studies to demonstrateeither in-vivo or in-vitro bioequivalence are required forgovernment regulatory approvals of generic drug prod-ucts, or new formulations of existing products with knownchemical entity.Statistically an in-vivo bioequivalence (BE) study

employs a crossover design with T and R drug productsadministered to healthy subjects on separate occasionsaccording to a pre-specied randomization schedule, withample washout between the occasions. e concentra-tion of the active ingredient of interest in blood is mea-sured over time per subject in each occasion, resulting inmultiple concentration-time proles curves for each sub-ject. From which a number of bioavailability measures,such as area under the curve (AUC) and peak concentra-tion (Cmax) in either raw or logarithmic scale, are thencomputed, statistically modeled, and analyzed for bioe-quivalence. Let Yijkt be a log-transformed bioavailabilitymeasure of subject j in treatment sequence i at period t,having the treatment k. In a simplest × crossover designwith two treatment sequences T/R (i = ) and R/T (i = ),is assumed to follow a mixed-eects ANOVA model

Yijkt = µ + ωi + ϕk + πt + Sij + εijkt

where µ is an overall constant, ωi, ϕk and πt are, respec-tively, (xed) eects of sequence i, formulation k andperiod t, Sij is the (random) eect of subject j in sequencei, and εijkt the measurement error. In this model, thebetween-subject variation is captured by the random com-ponent Sij ∼ N (, σ B) and the within-subject variationεijkt ∼ N (, σ Wk) is allowed to depend on formulationk. Bioequivalence of T and R is declared when the nullhypothesisH : ϕT ≤ ϕR − ε or ϕT ≥ ϕR + ε can be rejectedagainst the alternative H : −ε ≤ ϕT − ϕR ≤ ε at somesignicance level α, where ε = log(.) = . andα = . are set by regulatory agencies.is last sentenceis referred to as the criterion for average BE. A commonmethod to establish average BE is to calculate the shortest(−α)×% condence interval of δ = ϕT−ϕR and showthat it is contained in the equivalence interval (−ε, ε).Establishing average BE using any nonreplicated k × k

crossover design can be conducted along the same lineas described above. For certain types of drug products,such as those with Narrowerapeutic Index (NTI), ques-tions were raised regarding the adequacy of the average BEmethod (e.g., Blakesley et al. ). To address this issue,

Page 17: International Encyclopedia of Statistical Science || Probit Analysis

Philosophical Foundations of Statistics P

P

other criteria and/or statisticalmethods for bioequivalencehave been proposed and studied in the literature. In partic-hular, population BE (PBE) assesses the dierence betweenT and R in both means and variances of bioavailabilitymeasures, and individual BE (IBE) assesses, in addition,the variation in the average T and R dierence amongindividuals. Some of these new concepts, notably indi-vidual bioequivalence, would require high-order crossoverdesigns such as TRT/RTR. Statistical designs and analysesfor population BE and individual BE are described in detailin a statistical guidance for industry (USA FDA, ) andseveral monographs (Chow and Shao ; Wellek ).Hsuan and Reeve () proposed a unied procedure toestablish IBE using any high-order crossover design anda multivariate ANOVA model. Recently the USA FDA() proposes a process of making available the publicguidance(s) on how to design bioequivalence studies forspecic drug products.

About the AuthorDr. Francis Hsuan has been a faculty member in theDepartment of Statistics at the Fox School, Temple Uni-versity, for more than years. He received his Ph.D. inStatistics from Cornell University and his B.S. in Physicsfrom theNational TaiwanUniversity. He was also a visitingscholar at the Harvard School of Public Health from to , working on projects related to the analysis of lon-gitudinal categorical data with informative missingness.

Cross References7Biopharmaceutical Research, Statistics in7Equivalence Testing7Statistical Analysis of Drug Release Data Within thePharmaceutical Sciences

References and Further ReadingBlakesley V, Awni W, Locke C, Ludden T, Granneman GR,

Braverman LE () Are bioequivalence studies of levothy-roxine sodium formulations in euthyroid volunteers reliable?Thyroid :–

Chow SC, Shao J () Statistics in drug research: methodologiesand recent developments. Marcell Dekker, New York

Hsuan F, Reeve R () Assessing individual bioequivalence withhigh-order crossover designs: a unified procedure. Stat Med:–

FDA () Guidance for industry on statistical approaches toestablishing bioequivalence. Center for Drug Evaluation andResearch, Food and Drug Administration, Rockville, Mary-land, USA, http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm.pdf

FDA () Guidance for industry on bioequivalence recommen-dations for specific products. Center for Drug Evaluation and

Research, Food and Drug Administration, Rockville, Mary-land, USA, http://www.fda.gov/downloads/Drugs/GuidancehComplianceRegulatoryInformation/Guidances/ucm.pdf

Wellek S () Testing statistical hypotheses of equivalence.Chapman and Hall/CRC, Florida, USA

Philosophical Foundationsof StatisticsInge S. HellandProfessorUniversity of Oslo, Oslo, Norway

e philosophical foundations of statistics involve issuesin theoretical statistics, such as goals and methods to meetthese goals, and interpretation of the meaning of infer-ence using statistics.ey are related to the philosophy ofscience and to the 7philosophy of probability.As with any other science, the philosophical foun-

dations of statistics are closely connected to its history,which again is connected to the men with whom dier-ent philosophical directions can be associated. Some ofthe most important names in this connection areomasBayes (–), Ronald A. Fisher (–) and JerzyNeyman (–).

e standard statistical paradigm is tied to the conceptof a statisticalmodel, an indexed family of probabilitymea-sures Pθ(⋅) on the observations, indexed by the parametersθ. Inference is done on the parameter space.is paradigmwas challenged by Breiman (), who argued for an algo-rithmical, more intuitive model concept. Breiman’s treemodels are still much in use, together with other algorith-mical bases for inference, for instance within chemometry.For an attempt to explain some of these within the frame-work of the ordinary statistical paradigm, see Helland().On the other hand, not all indexed families of distribu-

tions lead to sensible models. McCullagh () showedthat several absurd models can be produced.

e standard statisticalmodel concept can be extendedby implementing some kind of model reduction(Wittgenstein : “e process of induction is the pro-cess of assuming the simplest law that can be made toharmonize with our experience”) or by, e.g., adjoining asymmetry group to the model (Helland , ).To arrive at methods of inference, the model concept

must be supplemented by certain principles. In this con-nection, an experiment is ordinarily seen as given by astatisticalmodel togetherwith some focus parameter.Most

Page 18: International Encyclopedia of Statistical Science || Probit Analysis

P Philosophical Foundations of Statistics

statisticians agree at least to some variants of the followingthree principles:e conditionality principle (When youchoose an experiment randomly, the information in thislarge experiment, including the 7randomization, is notmore than the information in the selected experiment.),the suciency principle (Experiments with equal values ofa sucient statistics have equal information.) and the like-lihood principle (All the information about the parameteris contained in the likelihood for the parameter, given theobservations.). Birnbaum’s famous theorems says that like-lihood principle follows from the conditionality principletogether with the suciency principle (for some preciselydened version of these principles). is, and the prin-ciples themselves are discussed in detail in Berger andWolpert ().Berger and Wolpert () also argue that the likeli-

hood principle “nearly” leads to Bayesian inference, as theonly mode of inference which really satises the likelihoodprinciple. is whole chain of reasoning has been coun-tered by Kardaun et al. () and by leCam (), whostates that he prefers to be a little “unprincipled.”To arrive at statistical inference, whether it is point

estimates, condence intervals (credibility intervals forBayesians) or hypothesis testing, we need some deci-sion theory (see 7Decisioneory: An Introduction, and7Decision eory: An Overview). Such decision theorymay be formulated dierently by dierent authors. efoundation of statistical inference from a Bayesian pointof view is discussed by Good (), Lindley () andSavage (). From the frequentist point of view it isargued by Efron () that one should be a little moreinformal; but note that a decision theory may be veryuseful also in this setting.

e philosophy of foundations of statistics involvesmany further questions which have direct impact on thetheory and practice of statistics: Conditioning, random-ization, shrinkage, subjective or objective priors, referencepriors, the role of information, the interpretation of prob-abilities, the choice of models, optimality criteria, non-parametric versus parametric inference, the principle ofan axiomatic foundation of statistics etc. Some papersdiscussing these issues are Cox () with discussion,Kardaun et al. () and Efron (, ).e last papertakes up the important issue of the relationship betweenstatistical theory and the ongoing revolution of computers.

e struggle between7Bayesian statistics and frequen-tist statistics is not so hard today as it used to be some yearsago, partly since it has been realized that the two schoolsoen lead to similar results. As hypothesis testing and con-dence intervals are concerned, the frequentist school andthe Bayesian school must be adjoined by the Fisherianor ducian school, although largely out of fashion today.

e question of whether these three schools in some sensecan agree on testing, is addressed by Berger ().

e eld of design of experiments also has its ownphilosophical foundation, touching upon practical issueslike randomization, blocking and replication, and linkedto the philosophy of statistical inference. A good referencehere is Cox and Reid ().

About the AuthorInge Svein Helland is Professor, University of Oslo (–present). He was Head of Department of MathematicalSciences, Agricultural University of Norway (–)and Head of Statistics Division, Department of Mathe-matics, University of Oslo (–). He was AssociateEditor, Scandinavian Journal of Statistics (–). Pro-fessor Helland has (co-)authored about papers, reports,manuscripts, including the book Steps Towards a UniedBasis for Scientic Models and Methods (World Scientic,Singapore, ).

Cross References7Bayesian Analysis or Evidence Based Statistics?7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs. Classical Point Estimation: A ComparativeOverview7Decisioneory: An Introduction7Decisioneory: An Overview7Foundations of Probability7Likelihood7Model Selection7Philosophy of Probability7Randomization7Statistical Inference: An Overview

References and Further ReadingBerger JO () Could fisher, jeffreys and neyman have agreed on

testing? Stat Sci :–Berger JO, Wolpert RL () The likelihood principle. Lecture

Notes, Monograph Series, vol , Institute of MathematicalStatistics, Hayward

Breiman L () Statistical modelling: the two cultures. Stat Sci:–

Cox DR () The current position of statistics: a personal view. IntStat Rev :–

Cox DR, Reid N () The theory of design of experiments.Chapman and Hall/CRC, Boca Raton, FL

Efron B () Controverses in the foundations of statistics.Am Matem Month :–

Efron B () Computers and the theory of statistics: thinking theunthinkable. Siam Rev :–

Efron B () Why isn’t everyone Bayesian? Am Stat :–Good IJ () The interface between statistics and philosophy of

science. Stat Sci :–Helland IS () Statistical inference under symmetry. Int Stat Rev

:–

Page 19: International Encyclopedia of Statistical Science || Probit Analysis

Philosophy of Probability P

P

Helland IS () Steps towards a unified basis for scientific modelsand methods. World Scientific, Singapore

Kardaun OJWF, Salomé D, Schaafsma W, Steerneman AGM,Willems JC, Cox DR () Reflections on fourteen crypticissues concerning the nature of statistical inference. Int Stat Rev: –

leCam L () Discussion in Berger and Wolpert. –Lindley DV () The philosophy of statistics. Statistician

: –McCullagh P () What is a statistical model? Ann Stat

: –Savage LJ () The foundations of statistics. Dover, New YorkWittgenstein L () Tractatus Logico-Philosophicus (tr. Pears and

McGuinnes). Routledge Kegan Paul, London

Philosophy of Probability

Martin PetersonAssociate ProfessorEindhoven University of Technology, Eindhoven,Netherlands

e probability calculus was created in by Pierre deFermat and Blaise Pascal.e philosophy of probability isthe philosophical inquiry into the semantic and epistemicproperties of this mathematical calculus.e question atthe center of the philosophical debate is what it means tosay that the probability of an event or proposition equals acertain numerical value, or in other words, what the truth-conditions for a probabilistic statement are. ere is sig-nicant disagreement about this, and there are two majorcamps in the debate, viz., objectivists and subjectivists.Objectivists maintain that statements about probabil-

ity refer to some features of the external world, such as therelative frequency of some event. Probabilistic statementsare thus objectively true or false, depending on whetherthey correctly describe the relevant features of the exter-nal world. For example, some objectivists maintain that itis true that the probability that a coin will land heads is / ifand only if the relative frequency of this type of event is /(Other objective interpretations will be discussed below).Subjectivists disagree with this picture.ey deny that

statements about probability should be understood asclaims about the external world. On their view, they shouldrather be understood as claims about the speaker’s degreeof belief that an event will occur. Consider, for exam-ple, Mary’s probability that her suitor will propose. Whatis Mary’s probability that this event will occur? It seemsrather pointless to count the number ofmarriage proposalsthat other people get, because this does not tell us anythingabout the probability that Mary will be faced with a mar-riage proposal. If it is true thatMary’s probability is, say, /

then it is true because of her mental state, i.e., her degreeof belief that her suitor will propose. Unfortunately, noth-ing follows from this about whether her suitor actually willpropose or not. Mary’s probability that her suitor will pro-pose may be high even if the suitor feels that marriage istotally out of the question.Both the objective and subjective interpretations are

compatible withKolmogorov’s axioms.ese axioms comeout as true, irrespective of whether we interpret themalong the lines suggested by objectivists and subjectivists.However, more substantial questions about what the prob-ability of an event is may depend on which interpretationis chosen. For example, Mary’s subjective belief that hersuitor will propose might be low, although the objectiveprobability is quite high.

e notion of subjective probability is closely related to7Bayesian statistics, presented elsewhere in this book. (Inrecent years some authors have, however, also developedobjectivist accounts of Bayesian statistics.)In what follows we shall give a more detailed overview

of some of the most well-known objective and subjec-tive interpretations, viz. the relative-frequency view, thepropensity interpretation, the logical interpretation, andthe subjective interpretation. We shall start, however, withthe classical interpretation. It is strictly speaking neither anobjective nor a subjective interpretation, since it is salientabout many of the key questions discussed by objectivistsand subjectivists.

e classical interpretation, advocated by Laplace,Pascal, Bernoulli and Leibniz, holds the probability of anevent to be a fraction of the total number of possible waysin which the event can occur. Hence, the probability thatyouwill get a six if you roll a six-sided die is /. However, ittakes little reection to realize that this interpretation pre-supposes that all possible outcomes are equally likely.isis not always a plausible assumption, as Laplace and oth-ers were keen to stress. For an extreme example, considerthe weather in the Sahara desert. It seems to bemuchmoreprobable that it will be sunny in the Sahara desert tomor-row than not, but according to the classical interpretationthe probability is / (since there are two possible outcomes,sun or no sun). Another problem with the classical inter-pretation is that it is not applicable when the number ofpossible outcomes is innite.en the probability of everypossibility would be zero, since the ratio between any nitenumber and innity is zero.

e frequency interpretation, briey mentioned above,holds that the probability of an event is the ratio betweenthe numbers of times the event has occurred dividedby the total number of observed cases. Hence, if youtoss a coin , times and it lands heads up timesthen the relative frequency, and thus the probability,

Page 20: International Encyclopedia of Statistical Science || Probit Analysis

P Philosophy of Probability

would be / = .. A major challenge for anyoneseeking to defend the frequency interpretation is to spec-ify which reference class is the relevant one and why. Forexample, suppose I toss the coin another , times, andthat it lands heads up on occasions. Does this implythat the probability has changed from . to .? Oris the “new” probability + /.? e physicalconstitution of the coin is clearly the same.

e problem of identifying the relevant reference classbecomes particularly pressing as the frequency interpreta-tion is applied to unique events, i.e., events that only occuronce, such as the US presidential election in in whichGeorge W Bush won over Al Gore. e week before theelection the probability was – according to many politicalcommentators – about % that Bush would win. Now, towhich reference class does this event belong? If this was aunique event the reference class has, by denition, just oneelement, viz., the event itself. So according to the frequencyinterpretation the probability that Bush was going to winwas /, since Bush actually won the election.is cannotbe the right conclusion.Venn famously argued that the frequency interpreta-

tion makes sense only if the reference class is taken to beinnitely large. More precisely put, he pointed out that oneshould distinguish sharply between the underlying limit-ing frequency of an event and the frequency observed sofar. e limiting frequency is the proportion of success-ful outcomes would get if one were to repeat one and thesame experiment innitely many times. So even thoughthe US presidential election in did actually take placejust once, we can nevertheless imagine what would havehappened had it been repeated innitely many times.Of course, we cannot actually toss a coin innitely

many times, but we could imagine doing so. erefore,the limiting frequency is oen thought of as an abstrac-tion, rather than as a series of events that take place inthe real world.is point has some important philosoph-ical implications. First, it seems that one can never besure that a limiting relative frequency exists. When toss-ing a coin, the relative frequency of heads will perhapsnever converge towards a specic number. In principle, itcould oscillate forever. Moreover, the limiting relative fre-quency seems to be inaccessible from an epistemic point ofview, even in principle. If you observe that the relative fre-quency of a coin landing heads up is close to / in a seriesof ten million tosses, this does not exclude that the truelong-run frequency is much lower or higher than /. Nonite sequence of observations can prove that the limitingfrequency is even close to the observed frequency.According to the propensity interpretation, probabil-

ities should be identied with another feature of theexternal world, namely the propensity (or disposition or

tendency) of an object to give rise to a certain eect. Forinstance, symmetrical coins typically have a propensity toland heads up about every second time they are tossed,whichmeans that their probability of doing so is about onein two.

e propensity interpretationwas developed byPopperin the s. His motivation for developing this view wasthat it avoids the problem of assigning probabilities tounique events faced by the frequency view. Even an eventthat cannot take place more than once can neverthelesshave a certain propensity (or tendency) to occur. How-ever, Popper’s version of the theory also sought to connectpropensities with long-run frequencies whenever the lat-ter existed.us, his theory is perhaps best thought of asa hybrid between the two views. Contemporary philoso-phers have proposed “pure” versions of the propensityinterpretation, which make no reference what so ever tolong-run frequencies.A well-known objection to the propensity interpreta-

tion is Humphreys’ paradox. To state this paradox, recallthat conditional probabilities can be “inverted” by using7Bayes’ theorem. us, if we know the probability of Agiven B we can calculate the probability of B given A, giventhat we know the priors.e point made by Humphreysis that propensities cannot be inverted in this sense. Sup-pose, for example, that we know the probability that a trainwill arrive on time at its destination given that it departson time.en it makes sense to say that if the train departson time, it has a propensity to arrive on time at its desti-nation. However, even though it makes sense to speak ofthe inverted probability, i.e., the probability that the traindeparted on time given that it arrived on time, it makesno sense to speak of the corresponding inverted propen-sity. No one would admit that the on-time arrival of thetrain has a propensity tomake it depart on time a few hoursearlier.

e thrust of Humphreys’ paradox is thus the fol-lowing: Even though we may not know exactly what apropensity (or disposition or tendency) is, we do knowthat propensities have a temporal direction. If A has apropensity to give rise to B, then A cannot occur aerB. In this respect, propensities function very much likecausality; if A causes B, then A cannot occur aer B.However, probabilities lack this temporal direction. Whathappens now can tell us something about the probabilityof past events, and reveal information about past causesand propensities, although the probability in itself is anon-temporal concept. Hence, it seems that it would be amistake to identify probabilities with propensities.

e logical interpretation of probability was developedby Keynes and Carnap. Its basic idea is that probability isa logical relation between a hypothesis and the evidence

Page 21: International Encyclopedia of Statistical Science || Probit Analysis

Philosophy of Probability P

P

supporting it. More precisely put, the probability relationis best thought of as a generalization of the principles ofdeductive logic, from the deterministic case to the indeter-ministic one. For example, if an unhappy housewife claimsthat the probability that her marriage will end in a divorceis ., this means that the evidence she has at hand (noromantic dinners, etc.) entails the hypothesis that the mar-riage will end in a divorce to a certain degree, which can berepresented by the number .. Coin tossing can be ana-lyzed along the same lines.e evidence one has about theshape of the coin and past outcomes entails the hypothe-sis that it will land heads up to a certain degree, and thisdegree is identical with the probability of the hypothesisbeing true.Carnap’s analysis of the logical interpretation is quite

sophisticated, and cannot be easily summarized here.However, a general diculty with logical interpretations isthat they run a risk of being too dependent on evidence.Sometimeswewish to use probabilities for expressingmereguesses that have no correlation whatsoever to any evi-dence. For instance, I think the probability that it will besunny in Rio de Janeiro tomorrow is ..is guess is notbased on anymeteorological evidence. I am just guessing –the set of premises leading up to the hypothesis that it willbe sunny is empty; hence, there is no genuine “entailment”going on here. So how can the hypothesis that it will besunny in Rio de Janeiro tomorrow be entailed to degree., or any other degree?It could be replied that pure guesses are irrational, and

that it is therefore not a serious problem if the logical inter-pretation cannot handle this example. However, it is notevident that this is a convincing reply. People do use prob-abilities for expressing pure guesses, and the probabilitycalculus can easily be applied for checking whether a setof such guesses are coherent or not. If one thinks that theprobability for sun is . it would for instance be correctto conclude that the probability that it will not be sunny is..is is no doubt a legitimate way of applying the prob-ability calculus. But if we accept the logical interpretationwe cannot explain why this is so, since this interpretationdenes probability as a relation between a (non-empty) setof evidential propositions and a hypothesis.Let us now take closer look at the subjective interpreta-

tion.e main idea is that probability is a kind of mentalphenomenon. Probabilities are not part of the externalworld; they are entities that human beings somehow cre-ate in their minds. If you claim that the probability for suntomorrow is, say, . this merely means that your subjec-tive degree of belief that it will be sunny tomorrow is strongand that the strength of this belief can be represented by thenumber .. Of course, whether it actuallywill rain tomor-row depends on objective events in the external world,

rather than on your beliefs. So it is probable that it will raintomorrow just in case you believe that it will rain to a cer-tain degree, irrespective of what the weather is actually liketomorrow. However, this should not be taken to mean thatany subjective degree of belief is a probability. Advocatesof the subjective approach stress that for a partial beliefto qualify as a probability, one’s degrees of belief must becompatible with the axioms of the probability calculus.Subjective probabilities can vary across people. Mary’s

degree of belief that it will rain tomorrowmight be strong,at the same time as your degree of belief is much lower.is just means that your mental dispositions are dier-ent. When two decision makers hold dierent subjectiveprobabilities, they just happen to believe something to dif-ferent degrees. It does not follow that at least one personhas to be wrong. Furthermore, if there were no humansaround at all, i.e., if all believing entities were to be extinct,it would simply be false that some events happen with acertain probability, including quantum-mechanical events.According to the pioneering subjectivist Bruno de Finetti,“Probability does not exist.”Subjective views have been around for almost a cen-

tury. de Finetti’s pioneering work was published in .Ramsey presented a similar subjective theory in a paperwritten in and published posthumously in . How-ever, most modern accounts of subjective probability starto from Savage’s theory, presented in , which is moreprecise from a technical point of view. e key idea inall three accounts is to introduce an ingenious way inwhich subjective probabilities can be measured.e mea-surement process is based on the insight that the degreeto which a decision maker believes something is closelylinked to his or her behavior. Imagine, for instance, thatwe wish to measure Mary’s subjective probability that thecoin she is holding in her hand will land heads up the nexttime it is tossed. First, we ask her which of the followingvery generous options she would prefer.

(a) “If the coin lands heads up you win a trip to Bahamas;otherwise you win nothing”

(b) “If the coin does not land heads up you win a trip toBahamas; otherwise you win nothing”

Suppose Mary prefers A to B. We can then safely concludethat she thinks it is more probable that the coin will landheads up rather than not.is follows from the assump-tion that Mary prefers to win a trip to Bahamas ratherthan nothing, and that her preference between uncer-tain prospects is entirely determined by her beliefs anddesires with respect to her prospects of winning the tripto Bahamas. If she on the other hand prefers B to A, shethinks it ismore probable that the coin will not land heads

Page 22: International Encyclopedia of Statistical Science || Probit Analysis

P Philosophy of Probability

up, for the same reason. Furthermore, ifMary is indierentbetween A and B, her subjective probability that the coinwill land heads up is exactly /.is is because no otherprobability would make both options come out as equallyattractive, irrespective of how strongly she desires a trip toBahamas, and irrespective of which decision rule she usesfor aggregating her desires and beliefs into preferences.Next, we need to generalize the measurement pro-

cedure outlined above such that it allows us to alwaysrepresent Mary’s degree of belief with precise numericalprobabilities. To do this, we need to ask Mary to statepreferences over amuch larger set of options and then rea-son backwards. Here is a rough sketch of the main idea:Suppose that Mary wishes to measure her subjective prob-ability that her etching by Picasso worth $, will bestolen within one year. If she considers $, to be a fairprice for insuring her Picasso, that is, if that amount is thehighest price she is prepared to pay for a gamble in whichshe gets $, if the event S: “e Picasso is stolenwithina year” takes place, and nothing otherwise, then Mary’ssubjective probability for S is ,

, = ., given that sheforms her preferences in accordance with the principle ofmaximizing expected monetary value. If Mary is preparedto pay up to $, for insuring her Picasso, her subjectiveprobability is ,

, = ., given that she forms her pref-erences in accordance with the principle of maximizingexpected monetary value.Now, it seems that we have a general method for mea-

suring Mary’s subjective probability: We just ask her howmuch she is prepared to pay for “buying a contract” thatwill give her a xed income if the event we wish to assigna subjective probability to takes place. e highest priceshe is prepared to pay is, by assumption, so high that sheis indierent between paying the price and not buyingthe contract. (is assumption is required for representingprobabilities with precise numbers; if buying and sellingprices are allowed to dier we can sometimes use intervalsfor representing probabilities. See e.g., Borel and Baudain and Walley .)

e problem with this method is that very few peopleform their preferences in accordance with the principle ofmaximizing expected monetary value. Most people havea decreasing marginal utility for money. However, sincewe do not know anything about Mary’s utility function formoney we cannot replace the monetary outcomes in theexampleswith the corresponding utility numbers. Further-more, it alsomakes little sense to presuppose thatMary usesa specic decision rule, such as the expected utility prin-ciple, for forming preferences over uncertain prospects.Typically, we do not know anything about howpeople formtheir preferences.

Fortunately, there is a clever solution to all these prob-lems. e main idea is to impose a number of struc-tural conditions on preferences over uncertain options.e structural conditions, or axioms, merely restrict whatcombinations of preferences it is legitimate to have. Forexample, if Mary strictly prefers option A to option B inthe Bahamas example, then she must not strictly preferB to A.en, the subjective probability function is estab-lished by reasoning backwards while taking the structuralaxioms into account: Since the decision maker preferredsome uncertain options to others, and her preferences overuncertain options satisfy a number of structural axioms,the decision maker behaves as if she were forming herpreferences over uncertain options by rst assigning sub-jective probabilities and utilities to each option, and there-aer maximizing expected utility. A peculiar feature ofthis approach is, thus, that probabilities (and utilities) arederived from “within” the theory.e decisionmaker doesnot prefer an uncertain option to another because shejudges the subjective probabilities (and utilities) of the out-comes to be more favorable than those of another. Instead,thewell-organized structure of the decisionmaker’s prefer-ences over uncertain options logically imply that they canbe described as if her choices were governed by a subjec-tive probability function and a utility function, constructedsuch that a preferred option always has a higher expectedutility than a non-preferred option.ese probability andutility functions need not coincide with the ones outlinedabove in the Bahamas example; all we can be certain ofis that there exist some functions that have the desiredtechnical properties.

Cross References7Axioms of Probability7Bayes’eorem7Bayesian Statistics7Foundations of Probability7Fuzzy Set eory and Probability eory: What is theRelationship?7Philosophical Foundations of Statistics7Probabilityeory: An Outline7Probability, History of

References and Further ReadingBorel E, Baudain M () Probabilities and life. Dover Publishers,

New Yorkde Finetti B (/) Probabilism: a critical essay on the theory of

probability and on the value of science. Erkenntnis :–(trans: de Finetti B () Probabilismo. Logos :–)

Humphreys P () Why propensities cannot be probabilities.Philos Rev :–

Page 23: International Encyclopedia of Statistical Science || Probit Analysis

Point Processes P

P

Jeffrey R () The logic of deciosion, nd edn. (significantimprovements from st edn). University of Chicago Press,Chicago

Kreps DM () Notes on the theory of choice, Westview Press,Boulder

Laplace PS () A philosophical essay on probabilities (Englishtransl ). Dover Publications Inc, New York

Mellor DH () The Matter of Chance. Cambridge UniversityPress, Cambridge

Popper K () The propensity interpretation of the calculus ofprobability and the quantum theory. In: Körner S (ed) Thecolston papers, vol , pp –

Ramsey FP () Truth and probability in Ramsey, . In:Braithwaite RB (ed) The foundations of mathematics and otherlogical essays, Ch. VII, Kegan, Paul, Trench, Trubner & Co,London, Harcourt, Brace and Company, New York, pp –

Savage LJ () The foundations of statistics. Wiley, New York.nd edn. , Dover

Walley P () Statistical reasoning with imprecise probabilities,monographs on statistics and applied probability. Chapman &Hall, London

Point Processes

David Vere-JonesProfessor EmeritusVictoria University of Wellington, Wellington,New Zealand

“Point Processes” are locally nite (i.e., no nite accumula-tion points) families of events, typically occurring in time,but oen with additional dimensions (marks) to describetheir locations, sizes and other characteristics.

e subject originated in attempts to develop life-tables.e rst such studies included one by Newton andanother by his younger contemporary Halley.Newton’s study was provoked by his life-long religious

concerns. In his book, “Chronology of Ancient KingdomsAmended” he set out to estimate the dates of various bib-lical events by counting the number of kings or otherrulers between two such biblical events and allowing eachruler a characteristic length of reign. e value he usedseems to have been arrived at by putting together all theobservations from history that he could nd (he includedrulers from British, classical, European and biblical histo-ries and even included Methuselah’s quoted age among hisdata points) and taking some average of their lengths ofrules, but he acknowledged himself that his methods wereinformal and personal.Halley, by contrast, was involved in a scientic exer-

cise, namely the development of actuarial tables for usein calculating pensions and annuities. Indeed, he was

requested by the newly established Royal Society to pre-pare such a table from records in Breslau, a city selectedbecause it had been less severely aected by the plague thanmost of the larger cities in Europe, so that its mortalitydata were felt more likely to be typical of those of cities andperiods in normal times.Another important early stimulus for point process

studies was the development of telephone engineering.is application prompted Erlang’s early studies of thePoisson process (see7PoissonProcesses), which laid downmany of the concepts and procedures, such as 7renewalprocesses, and forward and backward recurrence times,subsequently entering into point process theory. It wasalso the context of Palm’s () deep studies of telephone-trac issues. Palm was the rst to use the term “pointprocess” (Punkt-Prozesse) itself.A point process can be treated in many dierent ways:

for example, as a sequence of delta functions, as a sequenceof time intervals between event occurrences; as an integer-valued randommeasure; or as the sum of a regular increas-ing component and a jump-type martingale.Treating each point as a delta-function in time yields a

time series which has generalized functions as realizations,but in other respects has many similarities with a contin-uous time series. In particular, stationarity, ergodicity, anda “point process spectrum” can be developed through thisapproach.If attention is focused on the process of intervals

between successive points, the paradigm example is therenewal process, where the intervals between events areindependent, and, save possibly for the rst interval,identically distributed.Counting the number of events, say N(A) falling into

a pre-specied set A (an interval, or more general Borelset), leads to treating the point process as an integer-valuedrandom measure.An important underlying theorem, rst enunciated

and analyzed by Slivnyak (), asserts the equivalence ofthe counting and interval based approaches.Random measures form a generalization of point pro-

cesses which are of considerable importance in their ownright.eir rst and second order moment measures

M(A) = E[N(A)]

M(A × B) = E[N(A)N(B)]

and the associated signedmeasure, the covariancemeasure

C(A × B) =M(A × B) −M(A)M(B)

form non-random measures which have been exten-sively studied.

Page 24: International Encyclopedia of Statistical Science || Probit Analysis

P Point Processes

A third point of view originated more recently frommartingale ideas applied to point processes, andhas proveda rich source of both newmodels andnewmethods of anal-ysis.e key here is to dene the point process in terms ofits conditional intensity (or conditional hazard) function,representing the conditional rate of occurrence of events,given the history (record of previously occurring eventsand any other relevant information) up to the present.

ese ideas are closely linked to the martingale rep-resentation of point processes. is takes the form of anincreasing step function, with unit steps at the time pointswhen the events occur, less a continuous increasing partgiven by the integral of the conditional intensity.

e Hawkes’ processes [introduced by Hawkes (a,b)] form an important class of point processes introducedand dened by their conditional intensities, which take thegeneral form

λ(t) = µ + ∑i:ti−t

g(t − ti)

where µ is a non-negative constant (“the arrival rate”) andg is a non-negative integrable function (“the infectivityfunction”).

e archetypal point process is the simple Poisson pro-cess, where the intervals between successive events intime are independent, and, (with the possible exceptionof the initial interval) are identically and exponentiallydistributed with a common mean, say m. For this pro-cess, the conditional intensity is constant, equal to therate of occurrence (intensity) of the Poisson process,here /m. Processes with continuous conditional inten-sities are sometimes referred to as “processes of Poissontype” since they behave locally like Poisson processes overintervals small enough for the conditional intensity to beconsidered approximately constant.For the Poisson process itself, and also Poisson clus-

ter processes, where cluster centers follow a simple Poissonprocess, and the clusters are independent subprocesses,identically distributed relative to their cluster centers, it ispossible to write down simple expressions for the charac-teristic functional

Φ[h] = E[ei ∫ h(t)dN(t)]

where the carrying function h is integrable againstM, orthe essentially equivalent probability generating functional

G[ξ] = E[Πξ(ti)]

where ξ plays the role of h in the characteristic functional.

For example, the Poisson process with continuousintensitym has probability generating functional

G[h] = exp −m∫ [ − h(u)]du.

Such functionals provide a comprehensive summary of theprocess and its attributes, especially the moment structure.However, the usefulness of characteristic or generat-

ing functionals in practice is restricted by the relativelyfew examples for which they can be obtained in con-venient closed form. Nevertheless, where available, theirform is essentially independent of the space (phase space)in which the points are located. By contrast, nding exten-sions of the conditional intensity to spaces of more thanone dimension has proved extremely dicult, on accountof the absence of a clear linear ordering, a problem whichequally aects other attempts to extendmartingale ideas tohigher dimensions.

About the AuthorDr. David Vere-Jones is Emeritus Professor, VictoriaUniversity of Wellington (since ). He is Past Presi-dent, New Zealand Statistical Association (–), PastPresident of Interim Executive of International Associa-tion for Statistical Education (–), Founding Presi-dent of the New Zealand Mathematical Society (). Hereceived the Rutherford Medal, New Zealand’s top scienceaward, in for “outstanding and fundamental contri-butions to research and education in probability, statis-tics and the mathematical sciences, and for services tothe statistical and mathematical communities, both withinNew Zealand and internationally.” Professor Vere-Joneswas also awarded the NZ Science and Technology GoldMedal (), and the NZSA Campbell Award for , inrecognition for his contributions to the statistical sciences.He has (co-)authored over refereed publications, andthree books. In (April –), a Symposium in Honorof David Vere-Jones on the Occasion of His th Birthdaywas held in Wellington.“David Vere-Jones is New Zealand’s leading resident

mathematical statistician. He has made major contribu-tions to the theory of statistics, its applications, and to theteaching of statistics inNewZealand.He is highly regardedinternationally and is involved in numerous internationalactivities. One of his major research areas has been con-cerned with Point Processes... . A substantial body of theexisting theory owes its origins to him, either directly orvia his students. Of particular importance and relevanceto New Zealand is his pioneering work on the applicationsof point process theory to seismology” (NZMS Newsletter, August ).

Page 25: International Encyclopedia of Statistical Science || Probit Analysis

Poisson Distribution and Its Application in Statistics P

P

Cross References7Khmaladze Transformation7Martingales7Non-Uniform Random Variate Generations7Poisson Processes7Renewal Processes7Spatial Point Pattern7Spatial Statistics7Statistics of Extremes7Stochastic Processes

References and Further ReadingCox DR, Isham V () Point Processes. Chapman and Hall,

LondonCox DR, Lewis PAW () The statistical analysis of series of

events. Methuen, LondonDaley DJ, Vere-Jones D () An introduction to the theory of point

processes, st edn. Springer, New York; nd. edn. () vols and , Springer, New York

Hawkes AG (a) Spectra of some self-exciting and mutuallyexciting point processes. Biometrika :–

Hawkes AG (b) Point spectra of some mutually exciting pointprocesses. J R Stat Soc :–

Palm C () Intensitasschwankungen im fernsprechverkehr.Ericsson Technics :–

Slivnyak IM () Some properties of stationary flows of homoge-neous random events. Teor. Veroyantnostei I Primen :–,Translation in Theory of Probability and Applications, :–

Stoyan D, Stoyan H () Fractals, random shapes and point fields.Wiley, Chichester

Stoyan D, Kendall WS, Mecke J () Stochastic geometry. nd edn.Wiley, Chichester; st edn. Academie, Berlin,

Poisson Distribution and ItsApplication in Statistics

Lelys Bravo de GuenniProfessorUniversidad Simón Bolívar, Caracas, Venezuela

e Poisson distribution was rst introduced by theFrenchMathematician Siméon-Denis Poisson (–)to describe the probability of a number of events occur-ring in a given time or space interval, with the probabilityof occurrence of these events being very small. However,since the number of trials is very large, these events doactually occur.It was rst published in in his work Recherches

sur la probabilité des jugements en matières criminelles et

matière civile (Research on the Probability of Judgmentsin Criminal and Civil Matters). In this work, the behaviorof certain random variables X that count the number of

occurrences (or arrivals) of such events in a given intervalin time or space was described. Some examples of theseevents are infant mortality in a city, the number of mis-prints in a book, the number of bacteria on a plate, thenumber of activations of a geiger counter, and so on.Assuming that λ is the expected value of such arrivals

in a time interval of xed length, the probability of observ-ing exactly k events is given by the probability massfunction

f (k∣λ) =λke−λ

k!for k = , , , . . ..e parameter distribution λ is a posi-tive real number, which represents the average number ofevents occurring during a xed time interval. For exam-ple, if the event occurs on average three times per second,in s the event will occur on average times and λ = .When the number of trials n is large and the probabil-ity of occurrence of the event λ/n approaches to zero, the7binomial distribution with parameters n and p = λ/n canbe approximated to a Poisson distribution with parame-ter λ.e binomial distribution gives the probability of xsuccesses in n trials.If X is a random variable with a Poisson distribution,

the expected value ofX and the variance ofX are both equalto λ. To estimate λ bymaximum likelihood, given a samplek, k, . . . , kn, the log-likelihood function

L(λ) = logn

∏i=

λki e−λ

ki!

is maximized with respect to λ and the resulting estimateis λ =

∑ni= kin. is is an unbiased estimator since the

expected value of each ki is equal to λ; and it is also anecient estimator since its estimator variance achieves theCramer–Rao lower bound. From the Bayesian inferenceperspective, a conjugate prior distribution for the parame-ter λ is the Gamma distribution. Suppose that λ follows aGamma prior distribution with parameters α and β, suchthat

p(λ∣α, β) =βα

Γ(α)λα−

e−βλ λ >

If a sample k, k, . . . , kn of size n is observed, the posteriorprobability distribution for λ is given by

p(λ∣k, k, . . . , kn) ∼ Gamma(α +n

∑i=ki, β + n) .

When α → and β → , we have a diuse prior distribution,and the posterior expected value of λ (E[λ∣k, k, . . . , kn])approximates to the maximum likelihood estimator λ.A use for this distribution was not found until ,

when an individual named Bortkiewicz (O’Connor and

Page 26: International Encyclopedia of Statistical Science || Probit Analysis

P Poisson Distribution and Its Application in Statistics

Robertson ) was asked by the Prussian Army to inves-tigate accidental deaths of soldiers attributed to beingkicked by horses. In , he publishede Law of SmallNumbers. In this work he was the rst to note that eventswith low frequency in a large population followed a Pois-son distribution even when the probabilities of the eventsvaried. Bortkiewicz studied the distribution of menkicked to death by horses among Prussian army corpsover years.is famous data set has been used in manystatistical textbooks as a classical example on the use of thePoisson distribution (see Yule and Kendall [] or Fisher[]). He found that in about half of every army corps–year combination, there were no deaths for horse kicking.For other combinations of corps–years, the number ofdeaths were from to . Although the probability of horsekick deaths might vary from corps and years, the over-all observed frequencies were very close to the expectedfrequencies estimated by using a Poisson distribution.In epidemiology, the Poisson distribution has been

used as amodel for deaths. In one of the oldest textbooks ofstatistics published by Bowley in (cited byHill []),he tted a Poisson distribution to deaths from splenic feverin the years – and showed a reasonable agreementwith the theory. At that time, splenic fever was a synonymfor present-day anthrax.A more extensive use of the Poisson distribution can

be found within the Poisson generalized linear models,usually called the Poisson regression models (see 7PoissonRegression).ese models are used to model count dataand contingency tables. For contingency tables, the Pois-son regressionmodel is best known as the log-linear model.In this case, the response variable Y has a Poisson distribu-tion and usually, the logarithm of its expected value (E[Y])is expressed as a linear predictor Xβ where X is a n × pmatrix of explanatory variables and β is a parameter vectorof size p. In this case, the link function g(.), which relatesthe expected value of the response variable Y with the lin-ear predictor is the logarithm function, in such a way thatthe mean of the response variable µ = g−(Xβ).Poisson regression can also be used to model what is

called the relative risk.is is the ratio between the countsand an exposure factor E. For example, tomodel the relativerisk of disease in a region, wemake η = Y/E, whereY is theobserved number of cases and E is the expected number ofcases, which depends on the number of persons at risk.eusual model for Y is

Y ∣η ∼ Poisson(Eη)

where η is the true relative risk of disease (Banerjee et al.()). When applying the Poisson model to data, themain assumption is that the variance is equal to the mean.

In many cases, this may not be assumed since the varianceof counts are usually greater than themean. In this case, wehave overdispersion. One way to deal with this problem isto use the negative binomial distribution, which is a two-parameter family that allows the mean and variance to betted separately. In this case, the mean of the Poisson dis-tribution λ is assumed a random variable as drawn from aGamma distribution.Another common problem with Poisson regression is

excess zeros: if there are two processes at work, one deter-mining whether there are zero events or any events, anda Poisson process (see 7Poisson Processes) determininghow many events there are, there will be more zeros thana Poisson regression would predict. An example would bethe distribution of cigarettes smoked in an hour by mem-bers of a group where some individuals are nonsmokers.ese data sets can be modeled as zero inated Poissonmodels, where p is the probability of observing zero counts,and − p is the probability of observing a count variablemodeled as a Poisson(λ).

About the AuthorFor biography see the entry 7Normal Scores.

Cross References7Contagious Distributions7Dispersion Models7Expected Value7Exponential Family Models7Fisher Exact Test7Geometric and Negative Binomial Distributions7Hypergeometric Distribution and Its Application inStatistics7Modeling Count Data7Multivariate Statistical Distributions7Poisson Processes7Poisson Regression7RelationshipsAmongUnivariate Statistical Distributions7Spatial Point Pattern7Statistics, History of7Univariate Discrete Distributions: An Overview

References and Further ReadingBanerjee S, Carlin BP, Gelfand AE () Hierarchical modeling and

analysis of spatial data. Chapman and Hall/CRC, Boca RatonFisher RA () Statistical methods for research workers. Oliver

and Boyd, EdinburghHill G () Horse kicks, antrax and the poisson model for deaths.

Chronic Dis Can ():O’Connor JJ, Robertson EF () http://www-groups.dcs.st-

and.ac.uk/history/Mathematicians/Bortkiewicz.htmlYule GU, Kendall MG () An introduction to the theory of

statistics. Charles Griffin, London

Page 27: International Encyclopedia of Statistical Science || Probit Analysis

Poisson Processes P

P

Poisson Processes

MR LeadbetterProfessorUniversity of North Carolina, Chapel Hill, NC, USA

IntroductionPoisson Processes are surely ubiquitous in the modeling ofpoint events in widely varied settings, and anything resem-bling a brief exhaustive account is impossible. Rather weaim to survey several “Poisson Habitats” and properties,with glimpses of underlying mathematical framework forthese processes and close relatives. We refer to three (ofmany) authoritative works (Cox and Lewis ; Daley andVere-Jones ; Kallenberg ) for convenient detailedaccounts of Poisson Process and general related theory,tailored to varied mathematical tastes.In this entrywe rst consider PoissonProcesses in their

classical setting as series of random events (7point pro-cesses) on the real line (e.g., in time), their importance inone dimension for stochastic modeling being rivaled onlyby the Wiener Process - both being basic (for dierentpurposes) in their own right, and as building blocks formore complex models. Classical applications of the Pois-son process abound - exemplied by births, radioactivedisintegrations, customer arrival times in queues, instantsof new cases of disease in an epidemic, and (a favorite ofours), the crossings of a high level by a stationary pro-cess. e classical format for Poisson Processes will bedescribed in section “7PoissonProcesses on theReal Line”,and important variants in section “7Important Variants”.Unlike many situations in which generalizations to

more than one dimension seem forced, the Poisson processhas important and natural extensions to two and higherdimensions, as indicated in section “7Poisson Processes inHigher Dimensions”. Further, an even more attractive fea-ture (at least to those with theoretical interests) is the factthat Poisson Processes can be dened on spaces with verylittle structure, as indicated in section “7Poisson Processeson Abstract Spaces”.

Poisson Processes on the Real LineA Point Process on the real line is simply a sequence ofevents occurring in time (or some other -dimensional(e.g., distance) parameter) according to a probabilisticmechanism. One way to describe the probability structureis to dene a sequence ≤ τ < τ < τ . . . < ∞ wherethe τi are the “times” of occurrences of interest (“events” ofthe point process).ey are assumed to be random vari-ables (written here as distinct, i.e., strictly increasing, when

the point process is termed “simple” but successive τi canbe taken to be equal if “multiple events” are to be consid-ered.) It is assumed that the points τi tend to innity asi→∞ so that there are no “accumulation points”. Onemaydene a point process as a random set τi : i = , , . . .of such points (cf Ryll-Nardzewski ). Alternatively theoccurrence times τi are a family of random variables fori = , , . . . and may be discussed within the frameworkof random sequences (discrete parameter stochastic pro-cesses). However oen the important quantities for a pointprocess on the real line are the random variables N(B)

which are the (random) numbers of events (τi) in setsB of interest (usually Borel sets). When B = (, t] wewrite Nt for N((, t]), i.e., the number of events occur-ring from time zero up to and including time t. Nt isthus a family of random variables for positive t – or a non-negative integer-valued continuous parameter stochasticprocess on the positive real line. Likewise N(B) denesa non-negative integer-valued stochastic process indexedby the (Borel) sets B (nite for bounded B).Here (as is customary) we focus on the “counting”

r.v.s Nt or N(B) rather than the consideration ofmore geometric properties of the sets τi. Note that thesetwo families are essentially equivalent since knowledge ofN(B) for each Borel set determines that of the sub-familyNt(B = (, t]) and the converse is also true since N(B)

is a measure dened on the Borel sets and is determinedby its values on the intervals (, t]. Further their distribu-tions are determined by those of the occurrence times τkand conversely, in view of equality of the events (τk > t),(Nt < k).We (nally!) come to the subject of this article – the

Poisson Process. In its simplest context this is dened asa family N(B), (or Nt) as above on the positive real lineby the requirement that each Nt be Poisson, PNt = r =

e−λt(λt)r/r!, r = , , , . . . and that “increments” (Nt −Nt), (Nt − Nt), are independent for < t < t ≤

t < t. Equivalently N(B) is Poisson with mean λm(B)

where m denotes Lebesgue measure, and N(B), N(B)

are independent for disjointB, B. λ = EN is the expectednumber of events per unit time, or the intensity of thePoisson process.

at the Poisson– rather than someother – distributionplays a central role is due in part to the long history of itsuse to describe rare events – such as the classical numberof deaths by horse kicks in the Prussian army. But moresignicantly the very simplest modeling of a point processwould surely require independence of increments whichholds as noted above for the Poisson Process. Further thisprocess has the stationarity property that the distributionof the “increment” (Nt+h−Nt) depends only on the length

Page 28: International Encyclopedia of Statistical Science || Probit Analysis

P Poisson Processes

h of the interval, not on its starting point t. Moreover theprobability of an event in a small interval of length h isapproximately λh and the probability of more than one inthat interval is of smaller order, i.e.,

P(Nt+h −Nt) = = λh + o(h),

P(Nt+h −Nt) ≥ = o(h) as h→ o.

It turns out that the Poisson Process as dened is theonly point process exhibiting stationarity and these twolatter properties [see e.g., Durrett () for proof]which demonstrates what Kingman aptly describes as the“inevitability” of the Poisson distribution, in his volume(Kingman ).

e Poisson Process has extensive properties which arewell described in many works [e.g., Cox and Lewis ()and Daley and Vere-Jones ()] For example the equiva-lence of the events (τk > t), (Nt < k) noted above readilyshows that the inter-arrival times (τk−τk−) are i.i.d. expo-nential random variables with mean λ−(τ = ), and τkitself is the sum of the rst k of these, thus distributed as(λ)− χk. On the other handwe have the famous apparentparadox that the random interval which contains a xedpoint t is distributed as the sum of two independent suchexponential variables – the time from the preceding eventplus that to the following event.is and a host of otheruseful properties may be conveniently found in Cox andLewis () and Daley and Vere-Jones (). Finally, theabove discussion has focused on Poisson Processes on thepositive real line. It is a simple matter to add an indepen-dent Poisson Process on the negative real line to obtain oneon the entire real line (−∞,∞).

Important Variantse stationarity requirement of a constant intensity λmaybe generalized to include a time varying intensity functionλ(t) for which the number of events N(B) in a (Borel) setB is still Poisson but with mean Λ(B) = ∫

B

λ(t)dt, keeping

independence ofN(B), N(B) for disjointB,B.enNtis Poisson with mean ∫

t

λ(t)dt. More generally one may

take Λ to be a “measure” on the Borel sets but not neces-sarily of this integral (absolutely continuous) form whichunlike the simple Poisson Process above, does not neces-sarily prohibit the occurrence of more than one event atthe same instant, (multiple events) and may allow posi-tive probability of an event occurring at a given xed timepoint. Further one may consider random versions of theintensity e.g., with λ(t) being itself a stochastic process(“stochastic intensity”) as for example the blood pressureof an individual (varying randomly in time) leading to

(conditionally) Poisson chest pain incidents. e result-ing point processes are termed doubly stochastic Poisson orCox Processes, and are widely used in medical trials e.g., ofnew treatments. For other widely used variants of PoissonProcesses (e.g., “Mixed” and “Compound” Poisson pro-cesses) as well as extensive theory of point process proper-ties, we recommend the very readable volume (Daley andVere-Jones ()).

Poisson Processes in Higher DimensionsPoint processes (especially Poisson) have also been tra-ditionally very useful in modeling point events in spaceand space-time dimensions.e locations of ore depositsin two or three spatial dimensions and the occurrencesof earthquakes in two dimensions and perhaps time(“spatio-temporal”) are important examples. e mathe-matical framework extends naturally from one dimension,N(B) being the number of point events in the two- or-dimensional (Borel) set B, and notational extensionssuch as Nt(B) for the number of events in a spatial set Bup to time t.Not infrequently a time parameter is considered sim-

ply as equivalent to the addition of just one more spatialdimension, but the obvious dierences in the questionsto be asked for space and time suggest that the notationreect the dierent character of the parameters. Furthernatural dependence structure (correlation assumptions,mixing conditions, long range dependence) may dier forspatial and time parameters. Further “coordinatewise mix-ing” (introduced in Leadbetter and Rootzen ()) seemspromising in current investigation to facilitate point pro-cess theory in higher dimensions, where the parametershave dierent roles. A reader interested in the theory andapplications in higher dimensions should consult (Daleyand Vere-Jones ) and the wealth of references therein.

Poisson Processes on Abstract Spacesere is substantial development of point process (andmore general “random measure”) theory in more abstractspaces, usually with an intricate topological structure(see Kallenberg ()). However for discussion of exis-tence and useful basicmodeling properties, the topologicalassumptions are typically solely used for denition of anatural simple measure-theoretic structure without anynecessary underlying topology - though useful for deeperconsiderations such as weak convergence, beyond puremodeling. Further a charming property of Poisson pro-cesses in particular is that they may be dened simply onspaces with very little structure as we now indicate.Specically let S be a space, and S a σ-ring (here a

σ-eld for simplicity) of subsets of S. For a givenprobability

Page 29: International Encyclopedia of Statistical Science || Probit Analysis

Poisson Regression P

P

space (Ω,F ,P), a random measure is dened to be anyfamily of non-negative- (possibly innite) valued randomvariablesNω(B) for each B ∈ S which is a measure (count-ably additive) on S for each ω ∈ Ω. A point process is arandom measure for which each Nω(B) is integer-valued(or +∞). In this very general context one may construct aPoisson Process (see 7Poisson Processes) by simple steps(cf Kallenberg () and Kingman ()) which we indi-cate. Dene i.i.d. random elements τ, τ, . . . , τn on S foreach positive integer nwith common distribution ν = Pτ−jyielding a point process on S consisting of a nite number(n) of points. By regarding n as random having a Pois-son distribution with mean a > (or mixing the (joint)distributions of this point process with Poisson weights)one obtains a nite valued Poisson process N(B) withthe nite intensity measure EN(B) = aν(B). Finally if λ

is a σ-nite measure on S we may write S =∞⋃Si where

λ(Si) < ∞. Let Ni(B), i = , , . . . be point processeswith the nite intensity measures ENi(B) = λ(B ∩ Si).e superposition of these point processes gives a PoissonProcess with intensity λ.Relatives of the Poisson Process such as those above

(Mixed, Compound, Doubly Stochastic. . .) may be con-structed in a similar way to the one-dimensional frame-work.ese sometimes require small or modest additionalassumptions about the space S such as measurability ofsingleton sets, and the separation of two of its points bymeasurable sets. Onemay also obtainmany general resultsanalogous results to those of one dimension by assumingthe existence of a countable semiring which covers S, andplays the role of bounded sets on which the point pro-cess is nite-valued, in this general non-topological con-text. Finally, as noted, the reference Kingman () givesan account of Poisson processes primarily in this generalframework, alongwith the basic early paper (Kendall ).In a topological setting the monograph (Kallenberg )gives a comprehensive development of random measures,motivating our own non-topological approach.

About the AuthorM.R. Leadbetter obtained his B.Sc. (), M.Sc. (),University ofNewZealand, andPh.D. (),UNC-ChapelHill. He is a Fellow of American Statistical Association,Institute of Mathematical Statistics and member of theInternational Statistical Institute. He has published a bookwithHarald Cramér, Stationary and Related Stochastic Pro-cesses (Wiley, ). Professor Leadbetter was awarded anHonorary Doctorate, from Lund University. His name isassociated with theeorem of Cramér and Leadbetter.

Cross References7Erlang’s Formulas7Lévy Processes7Markov Processes7Non-Uniform Random Variate Generations7Point Processes7Poisson Distribution and Its Application in Statistics7Probability on Compact Lie Groups7Queueingeory7Radon–Nikodýmeorem7Renewal Processes7Spatial Point Pattern7Statistical Modelling in Market Research7Stochastic Models of Transport Processes7Stochastic Processes7Stochastic Processes: Classication7Univariate Discrete Distributions: An Overview

References and Further ReadingCox DR, Lewis PAW () The statistical analysis of series of events.

Methuen Monograph, LondonDaley D, Vere-Jones D () An introduction to the theory of point

processes. Springer, New YorkDurrett R () Probability: theory and examples, rd edn.

Duxbury, Belmont, CAKallenberg O () Random measures, th edn. Academic,

New YorkKendall DG () Foundations of a theory of random sets. In:

Kendall DG, Harding EF (eds) Stochastic geometry. Wiley,London

Kingman JFC () Poisson processes. Clarendon, OxfordLeadbetter MR, Rootzen H () On extreme values in station-

ary random fields. In: Karatzas et al. (eds) Stochastic processesand related topics, volume in memory of Stamatis Cambanis,Birkhäuser

Ryll-Nardzewski C () Remarks on processes of calls. Proceed-ings of th Berkeley Symposium on Mathematical Statistics andProbability :–

Poisson Regression

Gerhard TutzProfessorLudwig-Maximilians-Universität München, Germany

Introductione Poisson regression model is a standard model forcount data where the response variable is given in the formof event counts such as the number of insurance claimswithin a given period of timeor the number of cases with a

Page 30: International Encyclopedia of Statistical Science || Probit Analysis

P Poisson Regression

specic disease in epidemiology. Let (Yi, xi)denoten inde-pendent observations, where xi is a vector of explanatoryvariables and Yi is the response variable. It is assumed thatthe response given xi follows a Poisson distribution whichhas probability function

P(Yi = r) =

⎧⎪⎪⎨⎪⎪⎩

λrir! e

−λi for r ∈ , , , . . . otherwise.

Mean and variance of the Poisson distribution are given byE(Yi) = var(Yi) = λi. Equality of the mean and variancesis oen referred to as the equidispersion property of thePoisson distribution.us, in contrast to the normal dis-tribution, for which mean and variance are unlinked, thePoisson distribution implicitly models stronger variabilityfor largermeans, a property which is oen found in real lifedata.e support of the Poisson distribution is , , , . . .,whichmakes it an appropriate distributionmodel for countdata.A Poisson regression model assumes that the condi-

tional mean µi = E(Yi∣xi) is determined by

µi = h (xTi β) or equivalently g(µi) = xTi β,

where g is a known link function and h = g− denotes theresponse function. Since the Poisson distribution is fromthe simple exponential family the model is a generalizedlinear model (GLM, see7Generalized LinearModels).emost widely used model uses the canonical link functionby specifying

µi = exp (xTi β) or log(µi) = xTi β.

Since the logarithm of the conditional mean is linear inthe parameters the model is called a log-linear model.e log-linear version of the model is particularly attrac-tive because interpretation of parameters is very easy.emodel implies that the conditional mean given xT =

(x, . . . , xp) has a multiplicative form given by

µ(x) = exp(xTβ) = exβ . . . expβp .

us eβj represents the multiplicative eect on µ(x) ifvariable xj changes by one unit to xj + .

InferenceSince the model is a generalized linear model inferencemay be based on the methods that are available for thatclass of models (see for example McCullagh and Nelder). One obtains for the derivative of the log-likelihood,which is the so-called score function

s(β) =n

∑i=

xih′ (xTi β)h(xiβ)

(yi − h (xTi β)) ,

and the Fisher matrix F(β) = E(−∂h/∂β∂βT) =

∑ni= xix

Tih′(xT

iβ)

h(xTi

β) . Under regularity conditions, β dened

by s(β) = is consistent and asymptotically normaldistributed,

β∼N(β,F(β)−),

where F(β) may be replaced by F(β) to obtain standarderrors.Goodness-of t and tests on the signicance of param-

eters based on deviance are providedwithin the frameworkof GLMs.

ExtensionsIn many applications count data are overdispersed, withconditional variance exceeding conditional mean. Severalextensions of the basic model that account for overdisper-sion are available, in particular quasi-likelihood methodsand more general distribution models like the Gamma–Poisson or negative binomial model. Quasi-likelihood usesthe same estimation equations as maximum likelihoodestimates, which are computed by solving

n

∑i=

xi∂µi

∂η

yi − µi

v(µi)= ,

where µi = h(ηi) and v(µi) is the variance function. Butinstead of assuming the variance function of the Poissonmodel v(µi) = µi one uses a more general form. For exam-ple, Poissonwith overdispersion uses v(µi) = ϕµi for someunknown constant ϕ.e case ϕ > represents overdisper-sion of the Poisson model, the case ϕ < , which is rarelyfound in applications, is calledunderdispersion. Alternativevariance functions usually continue to model the varianceas a function of the mean.e variance function v(µi) =

µi + γµi with additional parameter γ corresponds to thevariance of the negative binomial distribution.It may be shown that the asymptotic properties of

quasi-likelihood estimates are similar to that for GLMs.In particular one obtains asymptotically a normal distri-bution with the covariance given in the form of a pseudo-Fisher matrix, see McCullagh (), and McCullagh andNelder ().A source book for the modeling of count data which

includesmany applications is Cameron and Trivedi ().An econometric view on count data is outlined inWinkelmann () and Kleiber and Zeileis ().

About the AuthorProf. Dr. Gerhard Tutz works at the Department of Statis-tics, Ludwig-MaximiliansUniversityMunich. He served asHead of the department for several years. He coauthored

Page 31: International Encyclopedia of Statistical Science || Probit Analysis

Population Projections P

P

the book Multivariate Statistical Modeling Based on Gen-eralized Linear Models (with Ludwig Fahrmeir, Springer,).

Cross References7Dispersion Models7Generalized Linear Models7Geometric and Negative Binomial Distributions7Modeling Count Data7Poisson Distribution and Its Application in Statistics7Robust Regression Estimation in Generalized LinearModels7Statistical Methods in Epidemiology

References and Further ReadingCameron AC, Trivedi PK () Regression analysis of count data.

econometric society monographs no. . Cambridge UniversityPress, Cambridge

Kleiber C, Zeileis A () Applied Econometrics with R. Springer,New York

McCullagh P () Quasi-likelihood functions. Ann Stat :–McCullagh P, Nelder JA () Generalized linear models, nd edn.

Chapman and Hall, New YorkWinkelmann R () Count data models: econometric theory and

application to labor mobility, nd edn. Springer, Berlin

Population Projections

JanezMalacicProfessor, Faculty of EconomicsUniversity of Ljubljana, Ljubljana, Slovenia

Population projections are a basic tool that demographersuse to forecast a future population.ey can be producedin the form of a prognosis or as prospects.e rst is themost likely future development according to the expec-tations of the projections’ author(s) and is produced in asingle variant. e second type is based more on an “if-then” approach and is calculated in more variants. Usually,there are three or four variants, namely, low,medium, high,and constant variants. In practice, the medium variant isthe most widely used and is taken as the most likely oraccurate variant, that is, as a prognosis.Population projections can be produced bymathemat-

ical or analytical methods.e mathematical methods useextrapolation(s) of various mathematical functions. Forexample, a census population can be extrapolated for a cer-tain period into the future based on a linear, geometric,exponential, or someother functional form.e functionalform is chosen on the basis of () past population develop-ment(s), () the developments of a neighboring and other

similar populations, as well as () on the basis of gen-eral and particular demographic knowledge. In the greatmajority of cases, the mathematical methods are used forshort- and midterm periods in the future. For populationprojections of small settlements and regions, only mathe-maticalmethods can be used in any reasonableway. Excep-tionally, for very long-term periods (several centuries)a logistic curve can be used for a projection of the totalpopulation.Analytical methods for population projections are

much more complex.e population development in theprojection period is decomposed at the level of basic com-ponents. ese components are mortality, fertility, andmigration. For each component, a special hypothesis offuture development is produced. Very rich and complexstatistical data are needed for analytical population projec-tions. ey are considered suitable for a period of –years into the future.ey cannot be used to project pop-ulations in small areas because such life tables (see 7LifeTable) and some other data will not be available. Analyti-cal population projections oer very detailed informationon particular population structures at present and in thefuture as well as data on the development of basic popula-tion components (e.g., mortality, fertility, and migration).ey are also the basis for several other derived projectionslike those of households, of the active, rural, urban, andpensioned population.Suppose that we have the census population divided by

gender and age (in ve-year age groups, say) as our startingpoint: x+V tm,x and x+V tf ,x, where V stands for popula-tion size, x for age, t for time, and m and f for male andfemale. To make a projection, we need three hypotheses.e mortality hypothesis is constructed on the basis of lifetable indicators.We take survival ratios x+Pm,x and x+Pf ,xfor all ve-year age groups ( − , − , . . . , − , +).Our hypothesis can be thatmortality is constant, declining,or increasing, or we can have a combination of all three foreach gender and each age group.en we apply a popu-lation projection model aging procedure in the followingform (spelled out for males):

Vtm, ∗ Pm, = V

t+m, → V

t+m, ∗ Pm, = V

t+m, , etc.

Evidently, in this case constant mortality hypothesis wasused.e aging procedure would be applied for both gen-ders and for all ve-year age groups.In the next step, we would construct a fertility hypothe-

sis.is one can also be that fertility is constant, declining,increasing, or a combination of all three. It provides thenewborn population for each year in the projection period.A set of dierent fertility indicators are available.e mostconvenient indicators are age-specic fertility rates, x+fx,

Page 32: International Encyclopedia of Statistical Science || Probit Analysis

P Portfolio Theory

where x is equal to , , , , , , and . In princi-ple, we calculate the future number of births (N) for sevenve-year age-groups with the formula:

Nt−(t+)x = ((V tf ,x + V

t+f ,x ) / ∗ f

tx) .

e number of births, N t−(t+), should be subgroupedby gender. We can apply the demographic “constant”alternative and suppose that girls are born per births.en we calculate

Vt+m, = Pm,r ∗N

t−(t+)m and V t+f , = Pf ,r ∗N

t−(t+)f

Pm, and Pf ,r are survival ratios for newborn boys andgirls. In the case of a closed population or a populationwithzero migration, our projection is nished.However, real populations have in- and out-migration.

To cover this case, we should construct amigration hypoth-esis.e procedure is similar to the mortality and fertilityhypotheses. Suppose we use net migration rates for ve-year age groups, separately by gender. In principle, wecalculate age-specic netmigration factors (formales), nmis net migration:

Fm,x = + (nmm,x/, ).

e population aging procedure changes slightly:

Vt+m,x+ = V

tm,x∗Pm,x∗Fm,x.

e procedure for the female population is parallel to theprocedure for themale population.emost serious prob-lem in practice is that age-specic migration data may beunavailable or of poor quality.Such simple analytical population projection proce-

dures have been improved considerably in the litera-ture during recent decades. Probably the most importantimprovement is the construction of probabilistic popula-tion projections, for which considerable progress has beenmade in recent decades (Lutz et al. ). Many analyticalpopulation projections for countries and regions are nowsupplemented by several national probabilistic populationforecasts.

About the AuthorDr. Janez Malacic is a Professor of demography and laboreconomics, Faculty of Economics, Ljubljana University,Slovenia. He is a Former President of the Slovenian Statisti-cal Society (–), a Former President of the Societyof Yugoslav Statistical Societies (–), and a mem-ber of the IUSSP (from ). He has authored two booksand more than papers. His papers have been publishedin eight languages.

Cross References7Actuarial Methods7African Population Censuses7Census7Demographic Analysis: A Stochastic Approach7Demography7Life Table7Survival Data

References and Further ReadingEurostat, European Commission () Work session on demo-

graphic projections. Bucharest, – October . Methodolo-gies and working papers. p

Lutz W, Goldstein JR (guest eds) () How to deal with uncer-tainty in population forecasting? Int Stat Rev (–):–,–

Lutz W, Vaupel JW, Ahlburg DA (eds) () Frontiers of popula-tion Forecasting. A Supplement to vol , , population andDevelopment Review. The Population Council, New York

Portfolio Theory

HarryM. MarkowitzProfessor, Winner of the Nobel Memorial Prize inEconomic Sciences in University of California, San Diego, CA, USA

Portfolio eory considers the trade-o between somemeasure of risk and some measure of return on theportfolio-as-a-whole.e measures used most frequentlyin practice are expected (or mean) return and varianceor, equivalently, standard deviation.is article discussesthe justication for the use of mean and variance, sourcesof data needed in a mean-variance analysis, how mean-variance tradeo curves are computed, and semi-varianceas an alternative to variance.

Mean-Variance Analysis and itsJustificationWhile the idea of trade-o curves goes back at least toPareto, the notion of a trade-o curve between risk andreturn (later dubbed the ecient frontier) was introducedinMarkowitz (). Markowitz proposed expected returnand variance as both a hypothesis about how investors actand as a rule for guiding action in fact. ByMarkowitz ()he had given up the notion of mean and variance as ahypothesis but continued to propose them as criteria foraction.Tobin () said that the use of mean and variance

as criteria assumed either a quadratic utility function or a

Page 33: International Encyclopedia of Statistical Science || Probit Analysis

Portfolio Theory P

P

Gaussian probability distribution.is view is sometimesascribed to Markowitz, but he never justied the use ofmean and variance in this way. His views evolved consid-erably from Markowitz () to Markowitz (). Con-cerning thesemattersMarkowitz () should be ignored.Markowitz () accepts the views of Von Neumannand Morgenstern () when probability distributionsare known, and L.J. Savage () when probabilities arenot known.e former asserts that one should maximizeexpected utility; the latter asserts that when probabili-ties are not known one should maximize expected utilityusing probability beliefs when objective probabilities arenot known.Markowitz () conjectures that a suitably chosen

point from the ecient frontier will approximately max-imize expected utility for the kinds of utility functions thatare commonly proposed for cautious investors, and for thekinds of probability distributions that are found in practice.Levy and Markowitz () expand on this notion con-siderably. Specically, Levy and Markowitz show that forsuch probability distributions and utility functions there istypically a correlation between the actual expected utilityand the mean-variance approximation between . andof .. ey also show that the Pratt () and Arrow() objection to quadratic utility does not apply to thekind of approximations used by Levy and Markowitz, orin Markowitz ().

Models of CovarianceIf covariances are computed from historical returns withmore securities than there are observations, e.g., ,securities and months of observations, then the covari-ance matrix will be singular. A preferable alternative is touse a model of covariance where the return on the ithsecurity is assumed to obey the following relationship

ri = αi +∑ βik fk + ui

where the ui are independent of each other and the fk.e fk may be either factors or scenarios or some of each.ese ideas are carried out in, for example, Sharpe (),Rosenberg () andMarkowitz and Perold (a, b).

Estimation of ParametersCovariancematrices are sometimes estimated from histor-ical returns and sometimes from factor or scenario modelssuch as the one-factor model of Sharpe, the many-factormodel of Rosenberg, or the scenario models of Markowitzand Perold cited above.Expected returns are estimated in a great variety of

ways. I do not believe that anyone suggests that, in practice,historical average returns should be used as the expected

returns of individual stocks.e Ibbotson and Sinqueeld() series are frequently used to estimate expectedreturns for asset classes. Black and Litterman (, )propose a very interesting Bayesian approach to the esti-mation of expected returns. Richard Michaud () pro-poses to use estimates for asset classes based on whathe refers to as a “resampled frontier”. Additional meth-ods for estimating expected return are based on statis-tical methods for “disentangling” various anomalies, seeJacobs and Levy (), or estimates based on factorsthat Graham and Dodd () might use: see Lakon-ishok et al. (), Ohlson (), and Bloch et al. ().e last mentioned paper is based on results obtained byback-testing many alternate hypotheses concerning howto achieve excess returns.Whenmany estimationmethodsare tested, the expected future return for the best of the lotshould not be estimated as if this were the only proceduretested. Estimates should be corrected for “datamining.” SeeMarkowitz and Xu ().

Computation of M-V Efficient Setse set of mean-variance ecient portfolios is piecewiselinear.e critical line algorithm (CLA) traces out this set,one linear piece at a time, without having to search foroptima. CLA is described in Appendix A of Markowitz() and, less compactly, in Markowitz and Todd ().

Downside Risk“Semi-variance” or downside risk is like variance, but onlyconsiders deviations below the mean or below some tar-get return. It is proposed by Markowitz () Chap. andchampioned by Sortino and Satchell (). It is used lessfrequently in practice than variance.

About the AuthorProfessor Markowitz has applied computer and math-ematical techniques to various practical decision mak-ing areas. In nance: in an article in and a bookin he presented what is now referred to as MPT,“modern portfolio theory.” is has become a standardtopic in college courses and texts on investments, andis widely used by institutional investors for asset alloca-tion, risk control and attribution analysis. In other areas:Dr. Markowitz developed “sparse matrix” techniques forsolving very large mathematical optimization problems.ese techniques are now standard in production sowarefor optimization programs. Dr. Markowitz also designedand supervised the development of the SIMSCRIPT pro-gramming language. SIMSCRIPT has been widely used forprogramming computer simulations of systems like facto-ries, transportation systems and communicationnetworks.

Page 34: International Encyclopedia of Statistical Science || Probit Analysis

P Posterior Consistency in Bayesian Nonparametrics

In Dr. Markowitz received e John von NeumannAward from the Operations Research Society of Americafor his work in portfolio theory, sparse matrix techniquesand SIMSCRIPT. In he shared e Nobel Prize inEconomics for his work on portfolio theory.

Cross References7Actuarial Methods7Business Statistics7Copulas in Finance7Heteroscedastic Time Series7Optimal Statistical Inference in Financial Engineering7Semi-Variance in Finance7Standard Deviation7Statistical Modeling of Financial Markets7Variance

References and Further ReadingArrow K () Aspects of the theory of risk bearing. Markham

Publishing Company, ChicagoBlack F, Litterman R () Asset allocation: combining investor

views with market equilibrium. J Fixed Income ():–Black F, Litterman R () Global portfolio optimization. Financ

Anal J ():–Bloch M, Guerard J, Markowitz H, Todd P, Xu G () A compari-

son of some aspects of the US and Japanese equity markets. JpnWorld Econ :–

Graham B, Dodd DL () Security analysis, nd edn. McGraw-Hill, New York

Ibbotson RG, Sinquefield RA () Stocks, bonds, bills and infla-tion yearbook. Morningstar, New York

Jacobs BI, Levy KN () Disentangling equity return regulari-ties: new insights and investment opportunities. Financ AnalJ ():–

Lakonishok J, Shleifer A, Vishny RW () Contrarian investment,extrapolation and risk. J Financ ():–

Levy H, Markowitz HM () Approximating expected utility by afunction of mean and variance. Am Econ Rev ():–

Markowitz HM () Portfolio selection. J Financ ():–Markowitz HM () Portfolio selection: efficient diversification of

investments. Wiley, New York (Yale University Press, , ndedn, Basil Blackwell, )

Markowitz HM, Perold AF (a) Portfolio analysis with factors andscenarios. J Financ ():–

Markowitz HM, Perold AF (b) Sparsity and piecewise linearityin large portfolio optimization problems. In: Duff IS (ed) Sparsematrices and their uses. Academic Press, New York, pp –

Markowitz HM, Todd P () Mean-variance analysis in portfoliochoice and capital markets. Frank J. Fabozzi Associates, NewHope [revised reissue of Markowitz () (with chapter byPeter Todd)]

Markowitz HM, Xu GL () Data mining corrections. J PortfolioManage :–

Michaud RO () The Markowitz optimization enigma: is opti-mized optimal? Financ Anal J ():–

Ohlson JA () Risk return, security-valuation and the stochas-tic behavior of accounting numbers. J Financ Quant Anal():–

Pratt JW () Risk aversion in the small and in the large. Econo-metrica :–

Rosenberg B () Extra-market components of covariance in secu-rity returns. J Financ Quant Anal ():–

Savage LJ () The foundations of statistics. Wiley, New York(Second revised edn. Dover, New York)

Sharpe WF () A simplified model for portfolio analysis. ManageSci ():–

Sortino F, Satchell S () Managing downside risk in financialmarkets: theory, practice and implementation. Butterworth-Heinemann, Burlington

Tobin J () Liquidity preference as behavior towards risk. RevEcon Stud ():–

Von Neumann J, Morgenstern O () Theory of games and eco-nomic behavior. rd edn. (), Princeton University Press,Princeton

Posterior Consistency in BayesianNonparametrics

Jayanta K. Ghosh, R. V. RamamoorthiProfessor of StatisticsPurdue University, West Lafayette, IN, USAProfessorMichigan State University, East Lansing, MI, USA

Bayesian Nonparametrics (see 7Bayesian NonparametricStatistics) took o with two papers of Ferguson (Ferguson, ) and followed by Antoniak (Antoniak ).However consistency or asymptotics were not major issuein those papers, which were more concerned with tak-ing the rst steps towards a usable, easy to interpret priorwith easy to choose hyperparameters and a rich support.Unfortunately, the fact that the Dirichlet sits on discretedistributions diminished the early enthusiasm.

e idea of consistency came from Laplace and infor-mally may be dened as : Let P be a set of probabilitymeasures on a sample space X , Π be a prior on P . eposterior is said to be consistent at a true value P if thefollowing holds: For sample sequences with P probability, the posterior probability of any neighborhood U of Pconverges to .

e choice of neighborhoods U determines thestrength of consistency.One choice, when the sample spaceis separable metric, is weak neighborhoodsU of P. Whenelements of P have densities, L neighborhoods of P isoen the relevant choice. If the family P is parametrizedby θ, then these notions easily translate to θ, via continuityrequirements of the map θ ↦ Pθ .

e early papers on consistency were by Freedman(Freedman , ) on multinomials with (countably)innite classes.ey were very interesting but provided a

Page 35: International Encyclopedia of Statistical Science || Probit Analysis

Posterior Consistency in Bayesian Nonparametrics P

P

somewhat negative picture, namely, that in a topologicalsense, for most priors (i.e., outside a class of rst cat-egory) consistency fails to hold. is lead Freedman toconsider tail-free priors including the Dirichlet prior forthe countably many probabilities of the countably innitemultinomial. Around the same time, in her posthumouspaper (Schwartz ) of , arising from her thesis atBerkeley, written under Le Cam, Schwartz showed amongother things that, if the prior assigned positive probabilityto all Kullback-Leibler neighborhoods of the true density,then consistency holds in the sense of weak convergence.is is an important result which showed that this is theright notion of support in these problems, not the moreusual one adopted by Freedman.

ese early papers were followed by Ferguson’s Dirich-let process on the set of all probability measures alongwith Antoniak’s study of mixtures. Even though the setof discrete measures had full measure under the Dirich-let process, it still enjoyed the property of consistency atall distributions. e paper by Diaconis and Freedman(Diaconis and Freedman ), along with the discussionsrevived interest in consistency issues. Diaconis and Freed-man showed that with Dirichlet process prior consistencycan go awry in the presence of a location parameter. Bar-ron, in his discussion of the paper provided insight asto why it would be unreasonable to expect consistencyin this example. Ghosh and Ramamoorthi (Ghosh andRamamoorthi ) discuss several dierent explanationfor lack of consistency if one uses the Dirichlet Process ina semiparametric problem with a location parameter.Other major contributions have been made by Barron

(Barron ), Barron, Schervish and Wasserman (Barronet al. ), Walker (Walker ) and Coram and Lalley(Coram and Lalley ). A thorough review up to is available in Choi and Ramamoorthi (Choi andRamamoorthi ). Contributions to rates of conver-gence have been made by Ghosal, Ghosh and van derVaart, (Ghosal et al. ), Ghosal and van der Vaart(Ghosal and van der Vaart ), Shen and Waseerman(), Kleijn and van der Vaart (Kleijn and van derVaart ) and (van der Vaart and van Zanten ),(van der Vaart and van Zanten ). see also the bookby Ghosh and Ramamoorthi (Ghosh and Ramamoorthi) for many basic early results on consistency andother aspects of, like choice of priors and consistency fordensity estimation, semiparametric problems and survivalanalysis.Ghosh and Ramamoorthi (Ghosh and Ramamoorthi

) deal only with nonparametric estimation problems.Work on nonparametric testing and consistency problemsthere have begun only recently. A survey is available inTokdar, Chakravarti and Ghosh (Tokdar et al. ).

About the AuthorsFor biography of Professor Ghosh see the entry 7BayesianP-Values.Professor R.V. Ramamoorthi obtained his doctoral

degree from the Indian Statistical Institute. His early workwas on suciency and decision theory, a notable contri-bution there being the joint result with Blackwell showingthat Bayes suciency is not equivalent to Fisherian suf-ciency. Over the last few years his research has beenon Bayesian nonparamterics where J.K. Ghosh and heauthored one of the rst books on the topic, BayesianNonparametrics (Springer ).“is is the rst book to present an exhaustive

and comprehensive treatment of Bayesian nonparamet-rics. Ghosh and Ramamoorthi present the theoreticalunderpinnings of nonparametric priors in a rigorous yetextremely lucid style…It is indispensable to any seriousBayesian. It is bound to become a classic in Bayesian non-parametrics.” (Jayaram Sethuraman, Review Of BayesianNonparametrics, Sankhya, , , –).

Cross References7Bayesian Nonparametric Statistics7Bayesian Statistics7Nonparametric Statistical Inference

References and Further ReadingAntoniak CE () Mixtures of Dirichlet processes with applica-

tions to Bayesian nonparametric problems. Ann Stat :–Barron A () The exponential convergence of posterior probabil-

ities with implications for Bayes estimators of density functions.Technical Report , Department Statistics, University of Illinois,Champaign

Barron A, Schervish MJ, Wasserman L () The consistency ofposterior distributions in nonparametric problems. Ann Stat:–

Choi T, Ramamoorthi RV () Remarks on consistency of pos-terior distributions. In: Pushing the limits of contemporarystatistics: contributions in honor of Ghosh JK, vol of Inst MathStat Collect. Institute of Mathematical Statistics, Beachwood,pp –

Coram M, Lalley SP () Consistency of Bayes estimators of abinary regression function. Ann Stat :–

Diaconis P, Freedman D () On the consistency of Bayes esti-mates. Ann Stat :–. With a discussion and a rejoinder bythe authors

Ferguson TS () Prior distributions on spaces of probabilitymeasures. Ann Stat :–

Ferguson TS () Bayesian density estimation by mixtures of nor-mal distributions. In: Recent advances in statistics. Academic,New York, pp –

Freedman DA () On the asymptotic behavior of Bayes’ estimatesin the discrete case. Ann Math Stat :–

Freedman DA () On the asymptotic behavior of Bayes estimatesin the discrete case II. Ann Math Stat :–

Ghosal S, Ghosh JK, van der Vaart AW () Convergence rates ofposterior distributions. Ann Stat :–

Page 36: International Encyclopedia of Statistical Science || Probit Analysis

P Power Analysis

Ghosal S, van der Vaart AW () Entropies and rates of con-vergence for maximum likelihood and Bayes estimation formixtures of normal densities. Ann Stat :–

Ghosh JK, Ramamoorthi RV () Bayesian nonparametrics.Springer Series in Statistics. Springer, New York

Kleijn BJK, van der Vaart AW () Misspecification in infinite-dimensional Bayesian statistics. Ann Stat :–

Schwartz L () On Bayes procedures. Z Wahrscheinlichkeitsthe-orie und Verw Gebiete :–

Shen X, Wasserman L () Rates of convergence of posteriordistributions. Ann Stat ():–

Tokdar S, Chakrabarti A, Ghosh J () Bayesian nonparametricgoodness of fit tests. In: M-H Chen, DK Dey, P Mueller, DSun, K Ye (eds) Frontiers of statistical decision making andBayesian analysis. Inst Math Stat Collect. Institute of Mathe-matical Statistics, Beachwood

van der Vaart AW, van Zanten JH () Reproducing kernel Hilbertspaces of Gaussian priors. In: Pushing the limits of contem-porary statistics: contributions in honor of Ghosh JK, vol of Inst Math Stat Collect. Institute of Mathematical Statistics,Beachwood, pp –

van der Vaart AW, van Zanten JH () Adaptive Bayesian esti-mation using a Gaussian random field with inverse gammabandwidth. Ann Stat :–

Walker S () New approaches to Bayesian consistency. Ann Stat:–

Power Analysis

Kevin R. MurphyProfessorPennsylvania State University, University Park, PA, USA

One of the most common applications of statistics in thesocial and behavioral science is in testing null hypotheses.For example, a researcher wanting to compare the out-comes of two treatments will usually do so by testing thehypothesis that in the population there is no dierence inthe outcomes of the two treatments. e power of a sta-tistical test is dened as the likelihood that a researcherwill be able to reject a specic null hypothesis when it isin fact false.Cohen (), Lipsey (), and Kraemer and

iemann () provided excellent overviews of themethods, assumptions, and applications of power analysis.Murphy and Myors () extended traditional methodsof power analysis to tests of hypotheses about the size oftreatment eects, not merely tests of whether or not suchtreatment eects exist.

e power of a null hypothesis test is a function ofsample size (n), eect size (ES), and the standard used todene statistical signicance (α), and the equations that

dene this relation can be easily rearranged to solve forany of four quantities (i.e., power, n, ES, and α), given theother three. e two most common applications of sta-tistical power analysis are in: () determining the powerof a study, given n, ES, and α, and () determining howmany observations will be needed (i.e., n required), givena desired level of power, an ES estimate, and the α value.Both of these methods are widely used in designing stud-ies; one widely-accepted convention is that studies shouldbe designed so that they achieve power levels of . orgreater (i.e., so that they have at least an % chance ofrejecting a false null hypothesis; Cohen ; Murphy andMyors ).

ere are two other applications of power analysis thatare less common, but no less informative. First, power anal-ysis may be used to evaluate the sensitivity of studies.atis, power analysis can indicate what sorts of eect sizesmight be reliably detected in a study. If one expects theeect of a treatment to be small, it is important to knowwhether the study will detect that eect, or whether thestudy as planned only has sucient sensitivity to detectlarger eects. Second, one may use power analysis to makerational decisions about the criteria used to dene “statis-tical signicance.”Power analyses are included as part of several statis-

tical analysis packages (e.g., SPSS provides Sample Power,a exible and powerful program) and it is possible to usenumerous websites to perform simple power analyses. Twonotable soware packages designed for power analysis are:

G∗Power (Faul et al. ; http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower/ is distribu-ted as a freeware program that is available for bothMacintosh and Windows environments. It is simple,fast, and exible.

Power and Precision, distributed by Biostat, was devel-oped by leading researchers in the eld (e.g., J. Cohen).is program is very exible, covers a large arrayof statistical tests, and provides power analyses andcondence intervals for most tests.

About the AuthorKevin Murphy is a Professor of Psychology and Informa-tion Sciences and Technology at the Pennsylvania StateUniversity. He has served as President of the Society forIndustrial and Organizational Psychology and Editor ofJournal of Applied Psychology. He is the author if elevenbooks, including Statistical Power Analysis: A Simple andGeneral Model for Traditional andModern Hypothesis Tests

(with Brett Myors, Erlbaum ), and over articlesand chapters.

Page 37: International Encyclopedia of Statistical Science || Probit Analysis

Preprocessing in Data Mining P

P

Cross References7Eect Size7Presentation of Statistical Testimony7Psychology, Statistics in7Sample Size Determination7Signicance Testing: An Overview7Statistical Fallacies: Misconceptions, and Myths7Statistical Signicance

References and Further ReadingCohen J () Statistical power analysis for the behavioral sciences,

nd edn. Erlbaum, HillsdaleFaul F, Erdfelder E, Lang A-G, Buchner A () G∗Power : a flexi-

ble statistical power analysis program for the social, behavioral,and biomedical sciences. Behav Res Meth :–

Kraemer HC, Thiemann S () How many subjects? Sage,Newbury Park

Lipsey MW () Design sensitivity. Sage, Newbury ParkMurphy K, Myors B () Statistical power analysis: a simple

and general model for traditional and modern hypothesis tests,rd edn. Erlbaum, Mahwah

Preprocessing in Data Mining

Edgar AcuñaProfessorUniversity of Puerto Rico at Mayaguez, Mayaguez,Puerto Rico

Introduction7Data mining is the process of extracting hidden patternsin a large dataset. Azzopardi () breaks the dataminingprocess into ve stages:

(a) Selecting the domain – datamining should be assessedto determine whether there is a viable solution to theproblem at hand and a set of objectives should bedened to characterize these problems.

(b) Selecting the target data – this entails the selectionof data that is to be used in the specied domain;for example, selection of subsets of features or datasamples from larger databases.

(c) Preprocessing the data – this phase is primarily aimedat preparing the data in a suitable and useable format,so that a knowledge extraction process can be applied.

(d) Extracting the knowledge/information – during thisstage, the types of datamining operations (associationrules, regression, supervised classication, clustering,etc.), the data mining techniques, and data miningalgorithms are chosen and the data is then mined.

(e) Interpretation and evaluation – this stage of the datamining process is the interpretation and evaluationof the discoveries made. It includes ltering informa-tion that is to be presented, visualizing graphically,or locating the useful patterns and translating thepatterns discovered into an understandable form.

In the process of data mining, many patterns are foundin the data. Patterns that are interesting for the miner arethose that are easily understood, valid, potentially useful,and novel (Fayyad et al. ).ese patterns should val-idate the hypothesis that the user seeks to conrm. equality of patterns obtained depends on the quality ofthe data analyzed. It is common practice to prepare databefore applying traditional data mining techniques suchas regression, association rules, clustering, and supervisedclassication.Section “7Reasons for Applying Data Preprocessing”

of this article provides a more precise justication forthe use of data preprocessing techniques.is is followedby a description in section “7Techniques for Data Pre-processing” of some of the data preprocessing techniquescurrently in use.

Reasons for Applying Data PreprocessingPyle () suggests that about % of the total timerequired to complete a datamining project should be spenton data preparation since it is one of the most importantcontributors to the success of the project. Transformingthe data at hand into a format appropriate for knowl-edge extraction has a signicant inuence on the nalmodels generated, as well as on the amount and qualityof the knowledge discovered during the process. At thesame time, the eect caused by changes made to a datasetduring data preprocessing can either facilitate or compli-cate even further the knowledge discovery process; thuschanges made must be selected with care.Today’s real-world datasets are highly susceptible to

noise, missing and inconsistent data due to human errors,mechanical failures, and to their typically large size. Dataaected in this manner is known as “dirty.” During thepast decades, a number of techniques have been developedto preprocess data gathered from real-world applicationsbefore the data is further processed for other purposes.Cases where data mining techniques are applied

directly to raw data without any kind of data preprocess-ing are still frequent; yet, data preprocessing has beenrecommended as an obligatory step. Data preprocessingtechniques should never be applied blindly to a dataset,however. Prior to any data preprocessing eort, the datasetshould be explored and characterized. Two methods for

Page 38: International Encyclopedia of Statistical Science || Probit Analysis

P Preprocessing in Data Mining

exploring the data prior to preprocessing are data charac-terization and data visualization.

Data CharacterizationData characterization describes data in ways that are use-ful to the miner and begins the process of understandingwhat is in the data. Engels andeusinger () describethe following characteristics as standard for a given dataset:the number of classes, the number of observations, thepercentage of missing values in each attribute, the num-ber of attributes, the number of features with numeric datatype, and the number of features with symbolic data type.ese characteristics can provide a rst indication of thecomplexity of the problem being studied.In addition to the above-mentioned characteristics,

parameters of location and dispersion can be calcu-lated as single-dimensional measurements that describethe dataset. Location parameters are measurements suchas minimum, maximum, arithmetic mean, median, andempirical quartiles. On the other hand, dispersion param-eters such as range, standard deviation, and quartile devia-tion provide measurements that indicate the dispersion ofvalues of the feature.Location and dispersion parameters can be divided

in two classes: those that can deal with extreme valuesand those that are sensitive to them. A parameter thatcan deal well with extreme values is called robust. Somestatistical soware packages provide the computation ofrobust parameters in addition to the traditional non-robustparameters. Comparing robust and non-robust parame-ter values can provide insight to the existence of 7outliersduring the data characterization phase.

Data VisualizationVisualization techniques can also be of assistance dur-ing this exploration and characterization phase. Visual-izing the data before preprocessing it can improve theunderstanding of the data, thereby increasing the likeli-hood that new and useful information will be gained fromthe data. Visualization techniques can be used to identifythe existence of missing values, and outliers, as well as toidentify relationships among attributes.ese techniquescan, in eect, assist in ranking the “impurity” of the dataand in selecting the most appropriate data preprocessingtechnique to apply.

Techniques for Data PreprocessingApplying the correct data preprocessing techniques canimprove the quality of the data, thereby helping to improvethe accuracy and eciency of the subsequent mining pro-cess. Lu et al. (), Pyle (), and Azzopardi ()

present descriptions of common techniques for preparingdata for analysis.e techniques described by both authorscan be summarized as follows:

(a) Data cleaning – lling in missing values, smoothingnoisy data, removing outliers, and resolving inconsis-tencies.

(b) Data reduction – reducing the volume of data (butpreserving the patterns) by removing repeated obser-vations and applying instance selection as well as fea-ture selection techniques.Discretization of continuousattributes is also a way of data reduction.

(c) Data transformation – converting text and graphicaldata to a format that can be processed, normalizingor scaling the data, aggregation, and generalization.

(d) Data integration – correcting dierences in codingschemes due to the combining of several sourcesof data.

Data CleaningData cleaning provides methods to deal with dirty data.Since dirty datasets can cause problems for data explo-ration and analysis, data cleaning techniques have beendeveloped to clean data by lling in missing values (valueimputation), smoothing noisy data, identifying and/orremoving outliers, and resolving inconsistencies. Noise isa random error or variability in a measured feature, andseveral methods can be applied to remove it. Data can alsobe smoothed by using regression to nd a mathematicalequation to t the data. Smoothing methods that involvediscretization are alsomethods of data reduction since theyreduce the number of distinct values per attribute. Cluster-ing methods can also be used to remove noise by detectingoutliers.

Data IntegrationSome studies require the integration of multiple databases,or les. is process is known as data integration. Sinceattributes representing a given concept may have dier-ent names in dierent databases, care must be taken toavoid causing inconsistencies and redundancies in thedata. Inconsistencies are observations that have the samevalues for each of the attributes but that are assigned todierent classes. Redundant observations are observationsthat contain the same information.Attributes that have been derived or inferred from oth-

ersmay create redundancy problems. Again, having a largeamount of redundant and inconsistent datamay slowdownthe knowledge discovery process for a given dataset.

Page 39: International Encyclopedia of Statistical Science || Probit Analysis

Preprocessing in Data Mining P

P

Data TransformationMany data mining algorithms provide better results if thedata has been normalized or scaled to a specic rangebefore these algorithms are applied. e use of normal-ization techniques is crucial when distance-based algo-rithms are applied, because the distance measurementstaken on by attributes that assume many values will gen-erally outweigh distancemeasurements taken by attributesthat assume fewer values. Other methods of data transfor-mation include data aggregation and generalization tech-niques.ese methods create new attributes from existinginformation by applying summary operations to data orby replacing raw data by higher-level concepts. For exam-ple, monthly sales data may be aggregated to computeannual sales.

Data Reductione increased size of current real-world datasets has ledto the development of techniques that can reduce the sizeof the dataset without jeopardizing the datamining results.e process known as data reduction obtains a reducedrepresentation of the dataset that is much smaller in vol-ume, yet maintains the integrity of the original data.ismeans that data mining on the reduced dataset should bemore ecient yet produce similar analytical results. Hanand Kamber () mention the following strategies fordata reduction:

(a) Dimension reduction, where algorithms are appliedto remove irrelevant, weakly relevant, or redundantattributes.

(b) Data compression, where encoding mechanisms areused to obtain a reduced or compressed representa-tion of the original data. Two common types of datacompression are wavelet transforms and 7principalcomponent analysis.

(c) Numerosity reduction, where the data are replacedor estimated by alternative, smaller data representa-tions such as parametricmodels (which store only themodel parameters instead of the actual data), or non-parametric methods such as clustering and the use ofhistograms.

(d) Discretization and concept hierarchy generation, whereraw data values for attributes are replaced by ranges orhigher conceptual levels. For example, concept hierar-chies can be used to replace a low-level concept suchas age, with a higher-level concept such as young,middle-aged, or senior. Some detail may be lost bysuch data generalizations.

(e) Instance selection, where a subset of best instances ofthewhole dataset is selected. Some of the instances are

more relevant than others to perform a data mining,and working only with an optimal subset of instances,it will be more cost-and time-ecient. Variants of theclassical sampling techniques can be used.

Final RemarksAcuna () has developed Drep, an R package for datapreprocessing and visualization. Drep performs most ofthe data preprocessing techniques mentioned in this arti-cle. Currently, research is being done in order to apply pre-processing methods to data streams, see Aggarwal ()for more details.

About the AuthorDr. Edgar Acuña, is a Professor, Department of Mathe-matical Sciences, University of Puerto Rico at Mayaguez.He is also the leader of the group in Computational andstatistical Learning from databases at the University ofPuerto Rico. He has authored and co-authored more than papers mainly on data preprocessing. He is the authorof book (in Spanish) Analisis Estadistico De Datos usandoMinitab (John Wiley & Sons, ). In , he was hon-ored with the Power Hitter Award in Business and Tech-nology. In , he was selected as a Fullbright visitingScholar. Currently, he is an Associate editor for the RevistaColombiana de Estadistica.

Cross References7Box–Cox Transformation7Data Mining7Multi-Party Inference and Uncongeniality7Outliers

References and Further ReadingAcuna E () Dprep: data preprocessing and visualization

functions for classification. URL http://cran.r-hproject.org/package=dprep. R package version .

Aggarwal CC (ed) () Data streams: models and algorithms.Springer, New York

Azzopardi L () “Am I Right?” asked the classifier: preprocessingdata in the classification process. Comput Inform Syst :–

Engels R, Theusinger C () Using a data metric for preprocess-ing advice for data mining applications. In: Proceedings of thEuropean conference on artificial intelligence, pp –

Fayyad UM, Piatetsky-Shapiro G, Smyth P () From data miningto knowledge discovery: an overview. In: Advances in knowl-edge discovery and data mining, Chapter , AAAI Press/MITPress, pp –

Han J, Kamber M () Data mining: concepts and techniques, ndedn. Morgan Kaufman Publishers

Lu H, Sun S, Lu Y () On preprocessing data for effective classifi-cation. ACM SIGMOD’ workshop on research issues on datamining and knowledge discovery, Montreal, QC

Pyle D () Data preparation for data mining. Morgan Kaufmann,San Francisco

Page 40: International Encyclopedia of Statistical Science || Probit Analysis

P Presentation of Statistical Testimony

Presentation of StatisticalTestimony

Joseph L. GastwirthProfessor of Statistics and EconomicsGeorge Washington University, Washington, DC, USA

IntroductionUnlike scientic research, where we have the luxury ofcarrying out new studies to replicate and determine thedomain of validity of prior investigations, the law also con-siders other social goals. For example, in many nationsone spouse cannot be forced to testify against the other.A main purpose of the law is to resolve a dispute, so adecision needs to be made within a reasonable amount oftime aer the charge is led. Science is primarily concernedwith determining the true mechanism underlying a phe-nomenon. Typically no limits are placed on the nature ofthe experiment or approach an investigator may take to aproblem. Inmost uses of statistics in legal proceedings, therelevant events happened several years ago; rarely will yoube able to collect additional data. (One exception occurs incases concerned with violations of laws protecting intellec-tual property, such as the Lanham Act in the USA, whichprohibits a rm frommaking a product that infringes on anestablished one. Surveys of potential consumers are con-ducted to estimate the percentage who might be confusedas to the source of the product in question.) Oen, the database will be one developed for administrative purposes,e.g., payroll or attendance records, which you will need torely on.Civil cases dier from criminal cases in that the penalty

for violating a civil statue is not time in prison but rathercompensation for the harm done. Consequently, the bur-den of proof a plainti has in discrimination or tort case isto show that the defendant caused the alleged harm by “thepreponderance of the evidence” rather than the stricterstandard of “beyond a reasonable doubt” used in crimi-nal cases.us, many statistical studies are more useful incivil cases.A major dierence between presenting testimony in

court or a regulatory hearing and giving a talk at a majorscientic conference is that expert witnesses, like all oth-ers, are only allowed to answer questions put to them bythe lawyers.us, particular ndings or analyses that youbelieve are very important may not be submitted as evi-dence if the lawyer who hired you does not ask you aboutthem when you are testifying. Unless the judge, or in somejurisdictions a juror, asks you a question related to thattopic you are not allowed to discuss it.

Courts have also adopted criteria to assess the reliabil-ity of scientic evidence as well as some traditional ways ofpresenting and analyzing some types of data. (e leadingcase is Daubert v. Merrell-Dow Pharmaceuticals Inc., U.S. ().e impact of this case and two subsequentones on scientic evidence is described by Berger ()and Rosenblum (). Some of the criteria courts con-sider are: can the methodology was subject to peer review,can it be replicated and tested and whether the poten-tial error rates are known and considered in the expert’sreport and testimony.)is may limit the range of analy-ses you can use in the case at hand; although subsequentlyit can stimulate interesting statistical research. A poten-tially more serious threat to the credibility of your researchand subsequent testimony case is due to the fact that thelawyers provide you with the data and background infor-mation. us, you may not even be informed that otherinformation or data sets exist.A related complication can arise when the lawyer hires

both a consulting expert who has complete access to all thedata as he or she is protected by the “work product” ruleand then hires a “testifying expert”.is second expertmayonly be asked to analyze the data favorable to the defen-dant and not told that any other data exists. Sometimes,the analytic approach, e.g., regression analysis, may be sug-gested to this expert because the lawyer already knows theoutcome. If one believes an alternative statistical techniquewould be preferable or at least deserves exploration, theexpertmay be constrained as to the choice ofmethodology.

is entry describes examples of actual uses of statisti-cal evidence, alongwith suggestions to aid courts in under-standing the implications of the data. Section “7Presentingthe Data or Summary Tablesat will be Analyzed” dis-cusses the presentation of data and the results of statisticaltests. One dataset illustrates the diculty of getting lawyersand judges to appreciate the statistical concept of “power”,the ability of a test to detect a real or important dierence.As a consequence an analysis that used a test with no powerwas accepted by a court. In section “7AMore InformativeSummary of Promotion Data: Hogan v. Pierce ( F.E.P. (D.D.C. ))” will show how in a subsequent case Iwas able to present more detailed data, which helpedmakethe data clearer to the court.e last section oers somesuggestions for improving the quality of statistical analysesand their presentation.

Presenting the Data or Summary TablesThat will be AnalyzedIn the classic Castenada v. Partida ( U.S. , S. Ct. ()) case concerning whether Mexican-Americans

Page 41: International Encyclopedia of Statistical Science || Probit Analysis

Presentation of Statistical Testimony P

P

were discriminated against in the jury selection process,the Court summarized the data by years, i.e., the data forall juries during the year were aggregated and the minorityfraction compared to their fraction of the total popula-tion as well as the subgroup eligible for jury service. (edata is reported in footnote of the opinion as well asin the texts: Finkelstein and Levin () and Gastwirth().) e data showed a highly signicant dierencebetween the Mexican-American fraction of jurors (%)and both their fraction of the total population (.%) andof adults with some schooling (%). From a statisticalview the case is important as it established that formalstatistical hypothesis testing would be used rather thanintuitive judgments about whether the dierence betweenthe percentage of jurors whowere from theminority groupdiered suciently from the percentage of minorities eli-gible for jury service. When the lower courts followed themethodology laid out in Castenada, they also adopted thetradition of presenting yearly summaries of the data indiscrimination cases.Unlike jury discrimination cases, which are typically

brought by a defendant in a criminal case rather thanthe minority juror who was dismissed, in equal employ-ment cases the plainti is the individual who suered thealleged discriminatory act. In the United States the plain-ti has days from the time of the alleged act, e.g., notbeing hired or promoted or of being laid o, to le a for-mal complaint with the Equal Employment OpportunityCommission (EEOC). Quite oen aer receiving noticeof the complaint, the employer will modify their systemto mitigate the eect of the employment practice underscrutiny on the minority group in question.e impact ofthis “change” in policy on statistical analysis has oen beenoverlooked by courts. In particular, if a change occurs dur-ing the year the charge occurs and employer may changetheir policy and include the post-charge minority hires orpromotions in their analysis.Let me use data from a case, Capaci v. Katz & Bestho

( F. Supp. (E.D. La. ), a ’d in part, rev’d in part, F.d (th Cir. )), in which I was an expertfor the plaintis to illustrate this. On January, , theplainti led a charge of discrimination against women inpromotions in the pharmacy department. One way suchdiscriminatory practices may be carried out is to requirefemale employees to work longer at the rm before promo-tion than males.erefore, a study comparing the lengthof time males and female Pharmacists served before theywere promoted to Chief Pharmacist was carried out.etime frame considered started in July , the eectivedate of the Civil Rights Act until the date of the charge.e times each Pharmacist who was promoted had served

Presentation of Statistical Testimony. Table Months ofservice for male and female pharmacists employed at K&Bduring the period July , thru January , beforereceiving a promotion to chief pharmacist

Females: ; .

Males: ; ; ; ; ; ; ; ; , ; ; ; ; ; ; ; ; ;; ; ; ; ; .

until they received their promotion are reported in Table .Only their initials of the employees are given.Applying the Wilcoxon test (see 7Wilcoxon–Mann–

Whitney Test), incorporating ties yielded a statistically sig-nicant dierence (p-value = .). e average numberof months the two females worked until they were pro-moted was , while the average male worked for months before their promotion.e corresponding medi-ans were and months, respectively.e defendant’sexpert presented the data, given in Table , broken outinto two time periods ending at the end of a year. erst was from until the end of and the secondwas from until . e defendant’s data includesthree more females and eight more males in the defen-dant’s data because their expert included essentially therst year’s data subsequent to the charge. Furthermore, thethree females fell into the seniority categories of –,– and – months, i.e., they had much less senior-ity than the two females who were promoted prior to thecomplaint

e defendant’s expert did not utilize the Wilcoxontest; rather he analyzed all the data sets with the mediantest and found no signicant dierence dierences. In con-trast with the Wilcoxon analysis of the data in Table , themedian test did not nd the dierence in time to pro-motion data in the pre-charge period to be statisticallysignicant.Because only two females were promoted from July ,

until the charge was led in January , the mediantest has zero power of detecting a dierence between thetwo samples. us, I suggested that the plaintis’ lawyerask the following series of questions to the defendant’sexpert on cross exam:

. What is the dierence between the average time to pro-motion of the male and female pharmacists in Table .Expected Answer: about years.e actual dier-

ence was years as the mean female took monthswhile the mean male took months to be promoted.

. Suppose the dierence in the two means or averageswas years, would the median test have found a

Page 42: International Encyclopedia of Statistical Science || Probit Analysis

P Presentation of Statistical Testimony

statistically signicant dierence between the timesthat females had to work before being promotedthan males?Expected Answer: No.

. Suppose the dierence in the two means or averageswas years, would the median test have found astatistically signicant dierence between the timesthat females had to work before being promotedthan males?Expected Answer: No.

. Suppose the dierence in the two means or aver-ages was , years, would the median test havefound a statistically signicant dierence between thetimes that females had to work before being promotedthan males?Expected Answer: No.

. Suppose the dierence in the two means or averageswas one million years, would the median test havefound a statistically signicant dierence between thetimes that females had to work before being promotedthan males?Expected Answer: No.My thought was that the above sequence of ques-

tions would have shown the judge the practical impli-cation of nding a non-signicant result with a test thatdid not have any power, in the statistical sense. Unfor-tunately, aer the lawyer asked the rst question, shejumped to the last one. By doing so, the issue was notmade clear to the judge. When I asked why, I was toldthat she felt that the other expert realized the point.Of course, the questions were designed to explain thepractical meaning of statistical “power” to the trialjudge, not the expert. Awhile later while describing thetrial to another, more experienced lawyer, he told methat aer receiving the No answers to the ve questionshe would have turned to the expert and asked him:

. As your statistical test could not detect a dierence of amillion years between the times to promotion of maleand female employees, just how long would my clientand other females have to work without receiving apromotion before your test would nd a statisticallysignicant dierence?

is experience motivated me to look further into thepower properties of nonparametric tests, especially in theunbalanced sample size setting (Gastwirth andWang ;Freidlin and Gastwirth a).e data from the Capacicase is discussed by Finkelstein and Levin (, p. )and Gastwirth (, p. ) and the need to be cautiouswhen a test with low power accepts the null hypothesis isemphasized by Zeisel and Kaye (, p. ).

To further illustrate the change in practices theemployer made one could examine the data for –.It turns out the mean (median) time to promotion formales was . () and for females was . ().us,aer the charge males had to work at least a year morethan females before they were promoted to Chief Phar-macist.is is an example of a phenomenon I refer to as“A Funnying Happens on the Way to the Courtroom.”From both a “common sense” standpoint as well as legalone, the employment actions in the period leading up tothe complaint have the most relevance for determiningwhat happenedwhen the plainti was being considered forpromotion. (Similar issues of timing occur in contract law,where themeaning and conditions of a contract at the timeit was signed are used to determine whether it has beenproperly carried out by both parties. In product liabilitylaw, a manufacturer is not held liable for risks that werenot known when the product was sold to the plainti butwere discovered subsequently.) Indeed, quite oen a plain-ti applies for promotion on several occasions and onlyaer being denied it on all of them, les a formal charge.(For example, inWatson v. Fort Worth Bank & Trust, U.S. () the plainti had applied for promotion fourtimes.e opinion indicates that the Justices felt that shewas unfairly denied promotion on her fourth attempt.)Comment : Baldus and Cole (, p. ) refer to the

dispute concerning the Wilcoxon and Median tests in theCapaci case in a section concerning the dierence betweenpractical and statistical signicance. Let me quote them:

7 Exclusive concern for the 7statistical significance of a dis-parity encourages highly technical skirmishes betweenplaintiff’s and defendants’ experts who may find compet-ing methods of computing statistical significance advan-tageous in arguing their respective positions (citing theCapaci case). The importance of such skirmishes maybeminimized by limiting the role of a test of significance tothat of aiding in the interpretation of a measure of impactwhose practical significance may be evaluated on non-technical grounds.

us, even the authors of perhaps a commonly cited texton statistical proof of discrimination at the time did notappreciate the importance of the theory of hypothesis test-ing and the role of statistical power in choosing betweentests. More importantly, a dierence in the median time topromotion of − = or about years (or the dier-ence in the average times of . years) would appear tometo be of practical signicance.us, well respected authorsaswell as the judiciary allowed the defendant’s expert to use

Page 43: International Encyclopedia of Statistical Science || Probit Analysis

Presentation of Statistical Testimony P

P

the median test, which had no power to detect a dierencein time to promotion, to obfuscate a practicallymeaningfuldierence.Comment :e expert for the defendant was a social

scientist rather than a statistician. Other statisticians whohave faced experts of this type have mentioned to me thatoennon-statisticians haven’t had sucient training in oursubject to know how one should properly choose a pro-cedure. ey may select the rst procedure that comesto mind or choose a method that helps their client eventhough it is quite inappropriate and their ignorance ofstatistical theory makes it dicult for the lawyer you areworking for to get them to admit that their method is notas powerful (or accurate or reliable) as yours. An exampleof this occurred in a case concerning sex discrimination inpay when an “expert” compared the wages of small groupsof roughly similar males and females with the t-test. It iswell known that typically income and wage data are quiteskewed and that the distribution of the two-sample t-testin small samples depends on the assumption of normal-ity. I provided this information to the lawyer who thenasked the other “expert” whether he had ever done anyresearch using income orwage data (Ans. No) andwhetherhe had ever carried out any research or read literature onthe t-test and its properties (Ans. No).us, it was di-cult for the lawyer to get this “expert” to admit that usingthe t-test in such a situation is questionable and the signi-cance levels might not be reliable. On amore positive note,the Daubert ( U.S. ()) opinion listed severalcriteria for courts to evaluate expert scientic testimony,one of which is that the method used has a known errorrate. Today one might be able to t a skewed distributionto the data and then show by simulation that a nominal. level test has an actual level (α) of . or more. Sim-ilarly, if the one must use the t-test in such a situation onecould conduct a simulation study to obtain “correct” crit-ical values that will ensure that a nominal . level testhas true level between . and .. (Although I use the. level for illustration, I agree with Judge Posner ()that it should not be used as a talisman. Indeed, Judge P.Higginbotham’s statement inVuyanich v. Republic NationalBank, F. Supp. (N.D. Texas ) that the p-valueis a sliding-scale makes more sense than a simple yes-nodichotomy of signicance or not in the legal context as thestatistical evidence is only part of the story.e two-sidedp-value . on the post-charge time until promotion datafrom the Capaci case illustrates the wisdom of the state-ments of these judges. Not only do the unbalanced samplesizes diminish the power of two-sample tests, the change inthe promotion practices of the defendant subsequent to thecharge are quite clear from the change in dierence in aver-

age waiting times until promotion of males and females aswell as the change from a signicant dierence in favor ofmales before the charge to a nearly signicant change infavor of females aer the charge.)Another way of demonstrating that an expert does not

possess relevant knowledge is for the lawyer to show thema book or article that states the point you wish tomake andask the expert to read it and then say they agree or dis-agree with the point. (Dr. Charles Mann told the authorthat he has been able to successfully use this technique.) Ifthat expert disagrees, a follow-up question can inquire whyor on what grounds does he or she disagree with it.

e tradition of reporting data by year also makesit more dicult to demonstrate that a change occurredaer a charge was led. In Valentino v. USPS ( F.d (DC Circ. )) the plainti had applied for a pro-motion in and led the charge in May, . edata is reported in Table and has reanalyzed by Frei-dlin and Gastwirth (b) and Kadane and Woodworth(), suggested that females received fewer promotionsthan expected in the two previous years but aer theyreceived close to their expected number. Let me recallthe data and analysis the yearly summaries enabled us topresent.

e data for each year were analyzed by the Mantel-Haenszel test applied to the × tables for each gradegrouping in Tables and . is was done because theCivil Service Commission reported its data in this fashion.Notice that during two time periods, the number of gradeadvancements awarded to females for each year is signi-cantly less than expected. Aer , when the charge wasled the female employees start to receive their expectednumber of promotions.My best recollection is that the promotion the plain-

ti applied for was the th competitive one lled in .Unfortunately, data on all of the applicants was not avail-able even though EEOC guidelines do require employersto preserve records for at least months. Of the or sopositions for which data on the actual applicants was avail-able every one was given to amale candidate. Since femalesdid receive their expected number of promotions over theentire year it would appear that the defendant changed itspractices aer the charge and consequently prevailed inthe case. (e data discussed in the cited references con-sidered employees in job categories classied by their levelin the system.e district court accepted the defendant’scriticism that since each job level contains positions in avariety of occupations, the plaintis should have stratiedthe data by occupation; see F. Supp , (D.C. DC,). Later in the opinion, at F. Supp. , the opinionaccepted a regression analysis submitted by the defendant

Page 44: International Encyclopedia of Statistical Science || Probit Analysis

P Presentation of Statistical Testimony

Presentation of Statistical Testimony. Table Number of employees and promotions they received: from the Valentino v. U.S.P.S

Case

Grade – Grade – Grade – Grade – Grade –

Time period M F M F M F M F M F

/–/

/–/

/–/

/–/

/–/

Key to symbols: F=females; M=males; for any time period and grade group the number of promotions is below the number of employees. Forexample, in grades – during /–/ period out of eligible males were promoted compared to out of eligible females

Presentation of Statistical Testimony. Table The resultsobtained from Mantel-Haenszel test for equality of promotionrates applied to the stratified data for each period in Table

Year Observed Expected p-value (two-sided)

– . .

– . .

– . .

– . .

– . .

that used grade level as a predictor of salary noting that‘level” is a good proxy for occupation.While upholding theultimate nding that U.S.P.S did not discriminate againstwomen, the appellate opinion, fn. at , did not acceptthe district court’s criticism that a regression submittedby plaintis should have included the level of a position.e reason is that in a promotion case, it is advancementin one’s job level that is the issue.us, courts do acceptregressions that include the major other job-related char-acteristics such as experience, education, special trainingand objective measures of productivity.) ere was oneunusual aspect of the case; the Postal Service had changedthe system of reviewing candidates for promotions as of

January , . As no other charges of discrimination inpromotionhad beenled in either or , the analysisof data for those years is considered background infor-mation unless the plainti can demonstrate that the sameprocess carried over into the time period when the pro-motion in question was made. (In Evans v. United Airlines,the Supreme Court stated that since charges of employ-ment discrimination need to be led within days ofthe charge, earlier data is useful background informationbut is not sucient by itself to establish a prima facie caseof discrimination. If there is a continuing violation, how-ever, then the earlier data can be used in conjunction withmore recent data by the plaintis.e complex legal issuesinvolved in determining when past practices have contin-ued into the time period relevant to a particular case arebeyond the scope of the present paper.)When data is reported in yearly periods, invariably

some post-charge data will be included in the data for theyear in which the charge was le and statisticians shouldexamine it to see if there is evidence of a change in employ-ment practice subsequent to the charge. In theCapaci case,the defendant included eleven months of post-charge datain their rst time period, –.e inclusion of threeadditional females promoted in that period lessens theimpact of the fact that only two female Pharmacists werepromoted during the previous seven and a half years. Sim-ilarly, reporting the data by year in Valentino enabled the

Page 45: International Encyclopedia of Statistical Science || Probit Analysis

Presentation of Statistical Testimony P

P

defendant to monitor the new promotion system begun atthe start of subsequent to the complaint, so femalesdid receive their expected number of promotions for the, the year of the charge.

A More Informative Summary ofPromotion Data: Hogan v. Pierce ( F.E.P. (D.D.C. ))e plainti in the case alleged that he had been denieda promotion to a GS- level position in the computerdivision of a government agency and led a formal com-plaint in . As the les on the actual applicants wereunavailable, we considered all individuals employed inGS- level jobswhose records indicated that theymet theCivilService criteria for a promotion to the next job-level tobe the pool of “qualied applicants.” (ese qualicationswere that they were employed in an appropriate computer-related position and had at lest one year of experience at theprevious level (GS-) or its equivalent.)ere were aboutten opportunities for promotion to a GS- post during theseveral years prior to the complaint. About three of the suc-cessful GS- job applicants were outside candidates whohad a Ph.D. in computer science; all of whom were white.Since they had a higher level of education than the internalcandidates, they are excluded fromTable , which gives thenumber of employees who were eligible and the numberpromoted, by race, for the ten job announcements.Although the data is longitudinal in nature and tech-

nically one might want to apply a survival test such as thelog-rank procedure, it was easier to analyze the data by theMantel-Haenszel (MH) test that combines the observedminus expected numbers from the individual the individ-ual × tables (this is, of course, the way the log-rank testis also computed and the resulting statistics are the same).Although there were only promotions awarded to inter-nal candidates during the period under consideration noneof them went to a black. Moreover, the exact p-value of theMH test was ., a clearly signicant result.e analysiscan be interpreted as a test for the eect of race controllingfor eligibility by Feinberg (, p. ) and has been dis-cussed by Agresti () in the STATXACTmanual (,p. ) where it is shown that the lower end of a % con-dence interval of the odds a white employee receives apromotion relative to a minority employee is about two.us, we can conclude that the odds of a black employeehad of being promoted were half those of a white, which isclearly of practical as well as statistical signicance.In order to demonstrate that the most plausible poten-

tial explanation of the promotion data, the white employ-ees had greater seniority, data was submitted that showed

Presentation of Statistical Testimony. Table Promotion data for GS- positions obtained by internal jobcandidates, by Race, from Hogan v. Pierce (From Plaintiff’sexhibit on file with D.C. District Court. Reproduced inGastwirth (, p. ) and in the STATXACT manual (Version, p. ))

Whites Blacks

Date of Promotion Eligible Promoted Eligible Promoted

July

August

Sept.

April

May

Oct.

Nov.

Feb.

March

Nov.

that by the average black employee had worked over ayear more at the GS- level than the average white one.us, if seniority were a major factor that could explainwhy whites received the earlier promotions, it should haveworked in favor of the black employees in the later part ofthe period. e defendant did not suggest an alternativeanalysis of this data but concentrated on post-charge databut Judge A. Robinson observed at the trial that their anal-ysis should have considered a time frame around the timeof the complaint.

A Few Suggestions to StatisticiansDesiring to Increase the Validity of andWeight Given to Statistical and ScientificEvidenceMann (), cited and endorsed by Mallows (),noted that if your analysis does not produce results thelawyer who hired you desired; you will likely be replaced.Indeed, he notes that many attorneys act as though theywill be able to nd a statistician who will give them theresults they want and that “regrettably they may oen becorrect.”is state of aairs unfortunately perpetuates thestatement that there are “lies, damn lies and statistics”attributed to both B. Disraeli and M. Twain. Statisticians

Page 46: International Encyclopedia of Statistical Science || Probit Analysis

P Presentation of Statistical Testimony

have suggested questions that judges may ask experts(Feinberg et al. ) and discussed related ethical issuesarising in giving testimony (Meier ; Kadane ).Here we mention a few more suggestions for improvingstatistical testimony and close with a brief discussion ques-tioning the wisdom of a recent editorial that appeared inScience.

. Mann () is correct in advising statisticianswho areasked to testify inquire as to the existence of other dataor the analysis of any other previous expert the lawyerconsulted. I would add that these are especially impor-tant considerations if you are asked to examine datavery shortly before a trial as you will have a limitedtime to understand the data and how it was generatedso you will need to totally rely on the data and back-ground information the lawyer provides. In civil cases,it is far preferable for an expert to be involved duringthe period when discovery is being carried out. Bothsides are asking the other for relevant data and infor-mation and it is easier for you to tell the lawyer thetype of data that you feel would help answer the rel-evant questions. If the lawyers ignore your requests ordon’t give you a sensible justication (let me give anexample of a business justication. You are asked towork for a defendant in an equal employment case con-cerning the fairness of a layo carried out in a plantmaking brakes for cars and SUVs. Data on absenteeismwould appear to be quite useful.e employer knowsthat the rate of absenteeism in the plant was “higher”than normal for the industry and might be concernedthat if this information was put into the public record,theymight become involved inmany suits arising fromaccidents involving cars with brakes made there. Sincethe corporation is a prot-making organization not ascientic one, they will carry out a “cost-benet” anal-ysis to decide whether you will be given the data), youshould become concerned.

e author’s experience inValentino illustrates thispoint. One reason the original regression equation wedeveloped for the plaintis was considered incompletewas because it did not contain relevant informationconcerning any special degrees the employees had.is eld was omitted from the original le we weregiven and we only learned about it a week or so beforetrial. Aer nding that employees with business, engi-neering or law degrees received about the same salaryincrease, we then included a single indicator for havinga degree in one of the three areas. To assist the court,it would have been preferable to use separate indica-tors for each degree. (Both opinions, see F.d ,

fn. , downplayed this factor because the defen-dant’s expert testied that it was unreliable as theinformation depended on whether applicants listedany specialty on their form. More importantly, use ofthe single indicator variable for three dierent degreesmay not have made it clear that they all had a simi-lar eect on salary and that the indicator was restrictedto employees in these three specialties.) Given the veryshort time available to incorporate the new, but impor-tant information that did reduce the estimated coef-cient on sex by a meaningful amount, although itremained signicant, one did not have time to explorehow best to present the analysis.

. When you are approached by counsel, you should askfor permission to write an article about any interest-ing statistical problems or data arising in the case.(Of course, data and exhibits submitted formally intoevidence are generally considered to be in the pub-lic domain so one can use them in scholarly writings.Sometimes, however, the data in the exhibits are sum-maries and either the individual data or summariesof smaller sub-groups are required to illustrate themethodology.) While the lawyers and client mightrequest that some details of the case not be reported sothat the particular case or parties involved are not iden-tiable, you should be given permission to show theprofession how you used or adapted existing method-ology. If you perceive that the client or lawyer want tobe very restrictive about the use and dissemination ofthe data and your analysis, you should think seriouslyabout becoming involved. I realize that it ismuch easierfor an academic statistician to say “no” in this situation,than statisticians who do consulting for a living; espe-cially if they have a long-term business relationshipwith a law rm.

. Statisticians are now being asked by judges to advisethem in “Daubert” hearings, which arise when oneparty in the case challenges the scientic validity orreliability of the expert testimony the other side desiresto use.is is a useful service that all scientic profes-sions can provide the legal system. Before assessing aproposed expert’s testimony and credentials, it is wisefor you to look over the criteria the court suggested inthe Daubert opinion as well as some examples of deci-sions in such matters. Because courts take a skepticalview of “new techniques” that have not been subject topeer review but appear to have been developed specif-ically for the case at hand, one should not fault anexpert who does not use the most recently developedmethod but uses a previously existing method that hasbeen generally regarded in the eld as appropriate for

Page 47: International Encyclopedia of Statistical Science || Probit Analysis

Presentation of Statistical Testimony P

P

the particular problem. e issue is not whether youwould have performed the same analysis the expertconducted but whether the analysis is appropriate andstatistically sound. Indeed, Kadane () conducted aBayesian analysis of data arising in an equal employ-ment and conrmed it with a frequentist method usedin 7biostatistics that had been suggested for the prob-lem by Gastwirth and Greenhouse (). He notedthat it is reassuring when two approaches lead to thesame inference. In my experience, this oen occurswhen data sets of moderate to large sample size areexamined and one uses some common sense in inter-preting any dierence. (By common sense I mean thatif one statistician analyzes a data set with test A andobtains a p-value of ., i.e., a statistically signicantone, but the other statistician uses another appropriatetest B and obtains a p-value of ., a so-called non-signicant one, the results are really not meaningfullydierent.)

. A “Daubert” hearing can be used to provide the judgewith questions to ask an expert about the statisti-cal methodology utilized. (is also gives the court’sexpert the opportunity to nd the relevant portionsof books and articles that can be shown the expert,thereby implementing the approach described inCom-ment of section “7A More Informative Summary ofPromotion Data: Hogan v. Pierce ( F.E.P. (D.D.C.))”.) is may be eective in demonstrating to acourt that a non-statistician really does not have a rea-sonable understanding of some basic concepts suchas power or the potential eect of violations of theassumptions underlying the methods used.

. One task experts are asked to perform is to criticize orraise questions about the ndings and methodology ofthe testimony given by the other party’s expert.is isone place where is easy for one to become overly parti-san and lose their scientic objectivity.is importantproblem is discussed by (Meier ). Before raisinga criticism, one should ask whether it is important.Could it make a dierence in either the conclusion orthe weight given to it? In addition to checking that theassumptions underlying the analysis they are present-ing are basically satised, one step an expert can carryout to protect their own statistical analyses from beingunfairly criticized is to carry out a7sensitivity analysis.Many methods for assessing whether an omitted vari-able could change an inference have been developed(Rosenbaum ) and van Belle () has discussedthe relative importance of violations of the assump-tions underlying commonly used techniques.

. While this entry focused on statistical testimony incivil cases, similar issues arise in the use of statisti-cal evidence in criminal cases. e reader is referredto Aitken and Taroni (), Balding () for thecommonly used statistical methods used in this set-ting. Articles by a number of experts in both civiland criminal cases appear in Gastwirth () provideadditional perspectives on issues arising in the use ofstatistical evidence in the legal setting.

Acknowledgmentsis entry was adapted from Gastwirth ().e authorwishes to thank Prof. M. Lovic for his helpful suggestions.

About the AuthorProfessor Joseph L. Gastwirth is a Fellow of ASA (),IMS (), and AAS (), and an Elected member ofISI (). He was President of the Washington Statisti-cal Society (–). He has served three times on theAmerican Statistical Association’s Law and Justice Statis-tics Committee and currently is its Chair. He has writtenover articles concerning both the theory and applica-tion of statistics; especially in the legal setting. In , hereceived a Guggenheim Fellowship, which led to his two-volume book Statistical Reasoning in Law and Public Policy(), and the Shiskin Award () for Research in Eco-nomic Statistics for his development of statistical methodsformeasuring economic inequality and discrimination.Heedited the book Statistical Science in the Courtroom ()and in , his article with Dr. B. Freidlin using change-pointmethods to analyze data arising in equal employmentcases shared the “Applications Paper of the Year” award ofthe American Statistical Association. He has served on theeditorial boards of several major statistical journals andrecently, he was appointed Editor of the journal, Law, Prob-ability and Risk. OnAugust , , a workshop celebrating years of statistical activity of Professor Gastwirth wasorganized at the George Washington University to “rec-ognize his outstanding contributions to the developmentof nonparametric and robust methods and their use ingenetic epidemiology, his pioneering research in statisticalmethods and measurement for the analysis of data arisingin the areas of economic inequality, health disparities andequal employment, and in other legal applications”.

Cross References7Frequentist Hypothesis Testing: A Defense7Null-Hypothesis Signicance Testing: Misconceptions7Power Analysis7P-Values

Page 48: International Encyclopedia of Statistical Science || Probit Analysis

P Principal Component Analysis

7Signicance Testing: An Overview7Signicance Tests: A Critique7Statistical Evidence7Statistics and the Law7Student’s t-Tests7Wilcoxon–Mann–Whitney Test

References and Further ReadingAgresti A () An introduction to categorical data analysis. Wiley,

New YorkAitken C, Taroni F () Statistics and the evaluation of evidence

for forensic scientists, nd edn. Wiley, Chichester, UKBalding DJ () Weight of the evidence for forensic DNA profiles.

Wiley, Chichester, UKBaldus DC, Cole JWL () Statistical proof of discrimina-

tion: cumulative supplement. Shepard’s/McGraw-Hill, Col-orado Springs

Berger MA () The supreme court’s trilogy on the admissi-bility of expert testimony. In: Reference manual on scientificevidence, nd edn. Federal Judicial Center, Washington, DC,pp –

Feinberg SE () The evolving role of statistical assessments asevidence in courts. Springer, New York

Feinberg SE, Krislov SH, Straf M () Understanding and evaluat-ing statistical evidence in litigation. Jurimetrics J :–

Finkelstein MO, Levin B () Statistics for lawyers, nd edn.Springer, New York

Freidlin B, Gastwirth JL (a) Should the median test be retiredfrom general use? Am Stat :–

Freidlin B, Gastwirth JL (b) Change point tests designed for theanalysis of hiring data arising in equal employment cases. J BusEcon Stat :–

Gastwirth JL, Greenhouse SW () Estimating a common relativerisk: application in equal employment. J Am Stat Assoc :–

Gastwirth JL, Wang JL () Nonparametric tests in small unbal-anced samples: application in employment discriminationcases. Can J Stat :–

Gastwirth JL () Statistical reasoning in law and public pol-icy vol. statistical concepts and issues of fairness. Academic,Orlando, FL

Gastwirth JL (ed) () Statistical science in the courtroom.Springer, NY

Gastwirth JL () Some issues arising in the presentation ofstatistical testimony. Law, Probability and Risk :–

Kadane JB () A statistical analysis of adverse impact of employerdecisions. J Am Stat :–

Kadane JB () Ethical issues in being an expert witness. Law,Probability and Risk :–

Kadane JB, Woodworth GG () Hierarchical models for employ-ment decisions. J Bus Econ Stat :–xx

Mallows C () Parity: implementing the telecommunications actof . Stat Sci :–

Mann CR () Statistical consulting in the legal environment. In:Gastwirth JL (ed) Statistical science in the courtroom. Springer,New York, pp –

Meier P () Damned liars and expert witnesses. J Am Stat Assoc:–

Rosenbaum PR () Observational studies, nd edn. Springer,New York

Rosenblum M () On the evolution of analytic proof, statis-tics and the use of experts in EEO litigation. In: Gastwirth JL(ed) Statistical science in the courtroom. Springer, New York,pp –

van Belle G () Statistical rules of thumb. Wiley, New YorkZeisel H, Kaye DH () Prove it with figures: empirical methods

in law and litigation. Springer, New York

Principal Component Analysis

Ian JolliffeProfessorUniversity of Exeter, Exeter, UK

IntroductionLarge or massive data sets are increasingly common andoen include measurements on many variables. It is fre-quently possible to reduce the number of variables consid-erably while still retaining much of the information in theoriginal data set. Principal component analysis (PCA) isprobably the best known andmost widely used dimension-reducing technique for doing this. Suppose we have nmeasurements on a vector x of p random variables, andwe wish to reduce the dimension from p to q, where qis typically much smaller than p. PCA does this by nd-ing linear combinations, a′x, a′x, . . . , a′qx, called principalcomponents, that successively have maximum variance forthe data, subject to being uncorrelated with previous a′kxs.Solving this maximization problem, we nd that the vec-tors a, a, . . . , aq are the eigenvectors of the covariancematrix, S, of the data, corresponding to the q largest eigen-values (see7Eigenvalue, Eigenvector and Eigenspace).eeigenvalues give the variances of their respective princi-pal components, and the ratio of the sum of the rst qeigenvalues to the sum of the variances of all p originalvariables represents the proportion of the total variance inthe original data set accounted for by the rst q principalcomponents.e familiar algebraic form of PCA was rstpresented by Hotelling (), though Pearson () hadearlier given a geometric derivation.e apparently simpleidea actually has a number of subtleties, and a surprisinglylarge number of uses, and has a vast literature, including atleast two comprehensive textbooks (Jackson ; Jollie).

An ExampleAs an illustration we use an example that has been widelyreported in the literature, and which is originally due to

Page 49: International Encyclopedia of Statistical Science || Probit Analysis

Principal Component Analysis P

P

Principal Component Analysis. Table Principal ComponentAnalysis Vectors of coefficients for the first two principalcomponents for data from Yule et al. ()

Variable a a

x . .

x . .

x . .

x . .

x . .

x . −.

x . −.

x . −.

x . −.

x . −.

Yule et al. ().e data consist of scores, between and, for children aged / − years from the Isle ofWight, on ten subtests of theWechsler Pre-School and Pri-mary Scale of Intelligence. Five of the tests were “verbal”tests and vewere ’performance’ tests. Table gives the vec-tors a, a that dene the rst two principal components forthese data.e rst component is a linear combination of the tenscores with roughly equal weight (maximum ., mini-mum .) given to each score. It can be interpreted as ameasure of the overall ability of a child to dowell on the fullbattery of ten tests, and represents themajor (linear) sourceof variability in the data. On its own it accounts for %of the original variability.e second component contraststhe rst ve scores (verbal tests) with the ve scores on theperformance tests. It accounts for a further % of the totalvariability.e form of this second component tells us thatonce we have accounted for overall ability, the next mostimportant (linear) source of variability in the tests scoresis between those children who do well on the verbal testsrelative to the performance tests and those children whosetest score prole has the opposite pattern.

Covariance or CorrelationPrincipal components successively maximize variance,and can be found from the eigenvalues/eigenvectors ofa covariance matrix. Oen a modication is adopted, in

order to avoid two problems. If the p variables are mea-sured in a mixture of units, then it is dicult to interpretthe principal components. What is meant by a linear com-bination of weight, height and temperature, for example?Furthermore, if we measure temperature and weight inF and pounds respectively, we may get completely dif-ferent principal components from those obtained from thesame data but using C and kilograms. To avoid this arbi-trariness, we standardize each variable to have zero meanand unit variance. Finding linear combinations of thesestandardized variables that successively maximize vari-ance, subject to being uncorrelated with previous linearcombinations, leads to principal components dened bythe eigenvalues and eigenvectors of the correlation matrix,rather than the covariance matrix, of the original vari-ables. When all variables are measured in the same units,covariance-based PCA may be appropriate, but even herethey can be uninformative when a few variables havemuchlarger variances than the remainder. In such cases the rstfew components are dominated by the high-variance vari-ables and tell us little that could not have been deduced byinspection of the original variances. Circumstances existwhere covariance-based PCA is of interest but most PCAsencountered in practice are correlation-based. Our exam-ple is a case where either approach could be used. eresults given above are based on the correlationmatrix but,because the variances of all tests are similar, results froma covariance-based analysis would be little dierent.

How Many Components?We have talked about q principal components account-ing for most of the variation in the p variables? What ismeant by “most” and, more generally, how do we decidehow many components to keep?ere is a large literatureon this topic – see, for example, Jollie (), Chap. . Per-haps the simplest procedure is to set a threshold, say %,and stop when the rst q components account for a per-centage of total variation greater than this threshold. In ourexample the rst two components accounted for only %of the variation.e threshold is oen set higher than this– % to % are the usual sort of values, but it depends onthe context of the data set and can be higher or lower.Othertechniques are based on the values of the eigenvalues oron the dierences between consecutive eigenvalues. Someof these simple ideas, as well as more sophisticated ones(Jollie , Chap. ) have been borrowed from factoranalysis (see 7Factor Analysis and Latent Variable Mod-elling).is is unfortunate because the dierent objectivesof PCA and factor analysis (see below for more on this)mean that typically fewer dimensions should be retained in

Page 50: International Encyclopedia of Statistical Science || Probit Analysis

P Principal Component Analysis

factor analysis than in PCA, so the factor analysis rules areoen inappropriate. It should also be noted that althoughit is usual to discard low-variance principal components,they can sometimes be useful in their own right, for exam-ple in nding 7outliers (Jollie , Chap. ) and inquality control (Jackson ).

Confusion with Factor Analysisere is much confusion between principal componentanalysis and factor analysis, partly because some widelyused soware packages treat PCA as a special case offactor analysis, which it is not. ere are several techni-cal dierences between PCA and factor analysis, but themost fundamental dierence is that factor analysis explic-itly species a model relating the observed variables to asmaller set of underlying unobservable factors. Althoughsome authors express PCA in the framework of amodel, itsmain application is as a descriptive, exploratory technique,with no thought of an underlying model.is descriptivenature means that distributional assumptions are unnec-essary to apply PCA in its usual form. It can be used,although caution may be needed in interpretation, on dis-crete and even binary data, as well as continuous variables.One notable feature of factor analysis is that it is generallya two-stage procedure; having found an initial solution, itis rotated towards simple structure.is idea can be bor-rowed and used in PCA; having decided to keep q principalcomponents, we may rotate within the q-dimensional sub-space dened by the components in a way that makes theaxes as easy as possible to interpret.is is one of numberof techniques that attempt to simplify the results of PCAby post-processing them in someway, or by replacing PCAwith a modied technique (Jollie , Chap. ).

Uses of Principal Component Analysisere are many variations on the basic use of PCAas a dimension reducing technique whose results areused in a descriptive/exploratory manner – see Jackson(), Jollie (). PCA is oen used a rst step,reducing dimensionality before undertaking another tech-nique, such as multiple regression, cluster analysis (see7Cluster Analysis: An Introduction), discriminant anal-ysis (see 7Discriminant Analysis: An Overview, and7DiscriminantAnalysis: Issues andProblems) or indepen-dent component analysis.

Extensions to Principal ComponentAnalysisPCA has been extended in many ways. For example, onerestriction of the technique is that it is linear. A num-ber of non-linear versions have therefore been suggested.

ese include the Gi approach to multivariate analysis.Another area in which many variations have been pro-posed is when the data are time series, so that there isdependence between observations as well as between vari-ables (Jollie , Chap. ).ere are many other exten-sions and modications, and the list continues to grow.

Acknowledgmentsis article is a revised and shortened version of an entrythat appeared ine Encyclopedia of Statistics in Behav-ioral Science, published by Wiley.

About the AuthorProfessor Ian Jollie is Honorary Visiting Professor at theUniversity of Exeter. Before his early retirement he wasProfessor of Statistics at the University of Aberdeen. Hereceived the InternationalMeetings in Statistical Climatol-ogy Achievement Award in and was elected a Fellowof the American Statistical Association in . He hasauthored, co-authored or co-edited four books, includ-ing Principal Component Analysis (Springer, nd edition,) and Forecast Verication: A Practitioner’s Guide inAtmospheric Science (jointly edited with D B Stephenson,Wiley, , nd edition due ). He has also publishedover papers in peer-reviewed journal and is currentlyAssociate Editor forWeather and Forecasting.

Cross References7Analysis of Multivariate Agricultural Data7Data Analysis7Eigenvalue, Eigenvector and Eigenspace7Factor Analysis and Latent Variable Modelling7Fuzzy Logic in Statistical Data Analysis7Multivariate Data Analysis: An Overview7Multivariate Reduced-Rank Regression7Multivariate Statistical Analysis7Multivariate Technique: Robustness7Partial Least Squares Regression Versus Other Methods

References and Further ReadingHotelling H () Analysis of a complex of statistical variables into

principal components. J Educ Psychol :–, –Jackson JE () A user’s guide to principal components. Wiley,

New YorkJolliffe IT () Principal component analysis, nd edn. Springer,

New YorkPearson K () On lines and planes of closest fit to systems of

points in space. Philos Mag :–Yule W, Berger M, Butler S, Newham V, Tizard J () The WPPSI:

an empirical evaluation with a British sample. Brit J EducPsychol :–

Page 51: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

Principles UnderlyingEconometric Estimators forIdentifying Causal Effects

James J. HeckmanWinner of the Nobel Memorial Prize in EconomicSciences in , Henry Schultz Distinguished ServiceProfessor of Economicse University of Chicago, Chicago, IL, USAUniversity College Dublin, Dublin, Ireland

is paper reviews the basic principles underlying theidentication of conventional econometric evaluationestimators for causal eects and their recent extensions.Heckman () discusses the econometric approachto causality and compares it to conventional statisticalapproaches.is paper considers alternative methods foridentifying causal models.

e paper is in four parts. e rst part presents aprototypical economic choice model that underlies econo-metric models of causal inference. It is a framework that isuseful for analyzing andmotivating the economic assump-tions underlying alternative estimators. e second partdiscusses general identication assumptions for leadingeconometric estimators at an intuitive level.e third partelaborates the discussion of matching in the second part.Matching is widely used in applied work andmakes stronginformational assumptions about what analysts know rela-tive to what the people they analyze know.e fourth partconcludes.

A Prototypical Policy Evaluation ProblemConsider the following prototypical policy problem. Sup-pose a policy is proposed for adoption in a country. It hasbeen tried in other countries and we know outcomes there.We also know outcomes in countries where it was notadopted. From the historical record, what can we concludeabout the likely eectiveness of the policy in countries thathave not implemented it?To answer questions of this sort, economists build

models of counterfactuals. Consider the following model.Let Y be the outcome of a country (e.g., GDP) under ano-policy regime. Y is the outcome if the policy is imple-mented. Y −Y is the “treatment eect” or causal eect ofthe policy. It may vary among countries. We observe char-acteristics X of various countries (e.g., level of democracy,level of population literacy, etc.). It is convenient to decom-pose Y into its mean given X, µ(X) and deviation from

mean U. One can make a similar decomposition for Y:

Y = µ(X) +U

Y = µ(X) +U.()

Additive separability is not needed, but it is convenientto assume it, and I initially adopt it to simplify the expo-sition and establish a parallel regression notation thatserves to link the statistical literature on treatment eectswith the economic literature. (Formally, it involves noloss of generality if we dene U = Y − E(Y ∣ X) andU = U − E(Y ∣ X).)It may happen that controlling for the X, Y −Y is the

same for all countries. is is the case of homogeneoustreatment eects given X. More likely, countries vary intheir responses to the policy even aer controlling for X.Figure plots the distribution of Y − Y for a bench-

mark X. It also displays the various conventional treat-ment parameters. I use a special form of a “generalizedRoy” model with constant cost C of adopting the policy(see Heckman and Vytlacil a, for a discussion of thismodel). is is called the “extended Roy model.” I usethis model because it is simple and intuitive. (e preciseparameterization of the extended Roy model used to gen-erate the gure and the treatment eects is given at the baseof Fig. .)e special case of homogeneity in Y −Y ariseswhen the distribution collapses to its mean. It would beideal if one could estimate the distribution of Y −Y givenX and there is research that does this.More oen, economists focus on some mean of the

distribution in the literature and use a regression frame-work to interpret the data. To turn () into a regressionmodel, it is conventional to use the switching regressionframework. (Statisticians sometimes attribute this repre-sentation to Rubin (, ), but it is due to Quandt(, ). It is implicit in the Roy () model. Seethe discussion of this basic model of counterfactuals inHeckman and Vytlacil (a)). Dene D = if a countryadopts a policy; D = if it does not. Substituting () intothis expression, and keeping all X implicit, one obtains

Y = Y + (Y − Y)D

= µ + (µ − µ +U −U)D +U.()

is is the Roy-Quandt “switching regression” model.Using conventional regression notation,

Y = α + βD + ε ()

Page 52: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

5 0 0.2 50

0.05

0.1

0.15

0.2

0.25TT

ATETUT

b = Y1 – Y0

C = 1.5

Return to marginal agent

TT = 2.666, TUT = −0.632Return to Marginal Agent = C = 1.5

ATE = m1 − m0 = b = 0.2¯

The Model

Outcomes Choice Model

Y = µ + U = α + β + U D =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if D∗ >

if D∗ ≤ Y = µ + U = α + U

General Case

(U ≠ U) ⊥Ò⊥ DATE≠ TT≠ TUT

The Researcher Observes (Y ,D,C)

Y = α + βD + U where β = Y − Y

Parameterization

α = . (U,U) ∼ N (,Σ) D∗ = Y − Y − C

β = . Σ =⎡⎢⎢⎢⎢⎢⎣

−.

−.

⎤⎥⎥⎥⎥⎥⎦

C = .

Principles Underlying Econometric Estimators for Identify-ing Causal Effects. Fig. Distribution of gains, the Roy econ-omy (Heckman et al. )

where α = µ, β = (Y − Y) = µ − µ + U − U andε = U. I will also use the notation that υ = U − U, let-ting β = µ − µ and β = β + υ.roughout this paper Iuse treatment eect and regression notation interchange-ably.e coecient on D is the treatment eect.e case

where β is the same for every country is the case con-ventionally assumed. More elaborate versions assume thatβ depends on X (β(X)) and estimates interactions of DwithX.e case where β varies even aer accounting forXis called the “random coecient” or “heterogenous treat-ment eect” case. e case where υ = U − U dependson D is the case of essential heterogeneity analyzed byHeckman et al. (). is case arises when treatmentchoices depend at least in part on the idiosyncratic returnto treatment. A great deal of attention has been focused onthis case in recent decades and I develop the implicationsof this model in this paper.

An Index Model of Choice and TreatmentEffects: Definitions and Unifying PrinciplesI now present the model of treatment eects developedin Heckman and Vytlacil (, , , a,b) andHeckman et al. (), which relaxes the normality, sep-arability and exogeneity assumptions invoked in the tra-ditional economic selection models. It is rich enough togenerate all of the treatment eects in the program eval-uation literature as well as many other policy parameters.It does not require separability. It is a nonparametric gener-alized Roymodel with testable restrictions that can be usedto unify the treatment eect literature, identify dierenttreatment eects, link the literature on treatment eects tothe literature in structural econometrics and interpret theimplicit economic assumptions underlying 7instrumentalvariables, regression discontinuity designmethods, controlfunctions and matching methods.Y is the measured outcome variable. It is produced

from the switching regression model (). Outcomes aregeneral nonlinear, nonseparable functions of observablesand unobservables:

Y = µ(X,U) ()Y = µ(X,U). ()

Examples of models that can be written in this forminclude conventional latent variable models for discretechoice that are generated by a latent variable crossing athreshold:Yi = (Y∗i ≥ ), whereY

∗i = µi (X)+Ui, i = , .

Notice that in the general case, µi(X,Ui)−E(Yi ∣ X) ≠ Ui,i = , .

e individual treatment eect associated withmovingan otherwise identical person from “” to “” isY − Y = ∆and is dened as the causal eect on Y of a ceteris paribusmove from “” to “”. To link this framework to the lit-erature on economic choice models, I characterize the

Page 53: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

decision rule for program participation by an indexmodel:

D∗= µD(Z) − V ; D = if D

∗≥ ;

D = otherwise, ()

where, from the point of view of the econometrician,(Z,X) is observed and (U,U,V) is unobserved.e ran-dom variable V may be a function of (U,U). For exam-ple, in the original Roy Model, µ and µ are additivelyseparable in U and U respectively, and V = −[U − U].In the original formulations of the generalized Roy model,outcome equations are separable andV = −[U−U−UC],where UC arises from the cost function. Without loss ofgenerality, I dene Z so that it includes all of the elementsofX as well as any additional variables unique to the choiceequation.I invoke the following assumptions that are weaker

than those used in the conventional literature on structuraleconometrics or the recent literature on semiparametricselection models and at the same time can be used bothto dene and to identify dierent treatment parameters.(A much weaker set of conditions is required to dene theparameters than is required to identify them. SeeHeckmanand Vytlacil (b, Appendix B).)e assumptions are:

(A-) (U,U,V) are independent of Z conditional on X(Independence);

(A-) µD(Z) is a nondegenerate random variable condi-

tional on X (Rank Condition);

(A-) e distribution of V is continuous; (Absolutely con-

tinuous with respect to Lebesgue measure.)

(A-) e values of E∣Y∣ and E∣Y∣ are nite (Finite

Means);

(A-) < Pr(D = ∣ X) < .

(A-) assumes that V is independent of Z given X andis used below to generate counterfactuals. For the deni-tion of treatment eects one does not need either (A-) or(A-). e denitions of treatment eects and their uni-cation do not require any elements of Z that are notelements of X or independence assumptions. However,an analysis of instrumental variables requires that Z con-tain at least one element not in X. Assumptions (A-) or(A-) justify application of instrumental variablesmethodsand nonparametric selection or control function methods.Some parameters in the recent IV literature are dened byan instrument so I make assumptions about instrumentsup front, noting where they are not needed. Assumption(A-) is needed to satisfy standard integration conditions.It guarantees that the mean treatment parameters are well

dened. Assumption (A-) is the assumption in the popu-lation of both a treatment and a control group for each X.Observe that there are no exogeneity requirements for X.is is in contrast with the assumptions commonly madein the conventional structural literature and the semipara-metric selection literature (see, e.g., Powell ).A counterfactual “no feedback” condition facilitates

interpretability so that conditioning on X does not maskthe eects ofD. Letting Xd denote a value of X ifD is set tod, a sucient condition that rules out feedback from D toX is:

(A-) Let X denote the counterfactual value of X that

would be observed if D is set to . X is dened anal-ogously. Assume Xd = X for d = , . (e XD areinvariant to counterfactual manipulations.)

Condition (A-) is not strictly required to formulate anevaluation model, but it enables an analyst who conditionson X to capture the “total” or “full eect” of D on Y (seePearl ).is assumption imposes the requirement thatX is an external variable determined outside themodel andis not aected by counterfactual manipulations ofD. How-ever, the assumption allows for X to be freely correlatedwith U, U and V so it can be endogenous.In this notation, P(Z) is the probability of receiv-

ing treatment given Z, or the “propensity score” P(Z) ≡

Pr(D = ∣ Z) = FV∣X(µD(Z)), where FV∣X(⋅) denotesthe distribution of V conditional on X. (roughout thispaper, I will refer to the cumulative distribution function ofa randomvectorA by FA(⋅) and to the cumulative distribu-tion function of a random vectorA conditional on randomvector B by FA∣B(⋅). I will write the cumulative distributionfunction of A conditional on B = b by FA∣B(⋅ ∣ b).) I denoteP(Z) by P, suppressing the Z argument. I also work withUD, a uniform random variable (UD ∼ Unif[, ]) denedby UD = FV∣X(V). (is representation is valid whether ornot (A-) is true. However, (A-) imposes restrictions oncounterfactual choices. For example, if a change in govern-ment policy changes the distribution of Z by an externalmanipulation, under (A-) the model can be used to gen-erate the choice probability fromP (z) evaluated at the newarguments, i.e., the model is invariant with respect to thedistribution Z.)e separability between V and µD (Z) orD (Z) and UD is conventional. It plays a crucial role injustifying instrumental variable estimators in the generalmodels analyzed in this paper.Vytlacil () establishes that assumptions

(A-)–(A-) for the model of Eqs. ()–() are equivalentto the assumptions used to generate the LATE model ofImbens and Angrist ().us the nonparametric selec-tion model for treatment eects developed by Heckman

Page 54: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

and Vytlacil is implied by the assumptions of the Imbensand Angrist instrumental variable model for treatmenteects.e Heckman and Vytlacil approach is more gen-eral and links the IV literature to the literature on economicchoice models. e latent variable model is a version ofthe standard sample selection bias model. is weavestogether two strands of the literature oen thought to bedistinct (see e.g., Angrist and Krueger ). Heckmanet al. () develop this parallelism in detail. (e modelof Eqs. ()–() and assumptions (A-)–(A-) impose twotestable restrictions on the distribution of (Y , D, Z, X).First, it imposes an index suciency restriction: for any setA and for j = , ,

Pr(Yj ∈ A ∣ X,Z,D = j) = Pr(Yj ∈ A ∣ X,P(Z),D = j).

Z (given X) enters the model only through the propensityscore P(Z) (the sets of A are assumed to be measurable).is restriction has empirical content when Z containstwo or more variables not in X. Second, the model alsoimposes monotonicity in p for E(YD ∣ X = x, P = p) andE(Y ( −D) ∣ X = x, P = p). Heckman and Vytlacil (,Appendix A) develop this condition further, and show thatit is testable.Even though this model of treatment eects is not the

most general possible model, it has testable implicationsand hence empirical content. It unites various literaturesand produces a nonparametric version of the selectionmodel, and links the treatment literature to economicchoice theory.)

Definitions of Treatment Effects in the TwoOutcome Modele diculty of observing the same individual in bothtreated and untreated states leads to the use of various pop-ulation level treatment eects widely used in the biostatis-tics literature and oen applied in economics. (Heckmanet al. () discuss panel data cases where it is possibleto observe both Y and Y for the same person.)e mostcommonly invoked treatment eect is the Average Treat-ment Eect (ATE): ∆ATE(x) ≡ E(∆ ∣ X = x) where ∆ =

Y − Y.is is the eect of assigning treatment randomlyto everyone of typeX assuming full compliance, and ignor-ing general equilibrium eects. (See, e.g., Imbens ().)e average impact of treatment on persons who actu-ally take the treatment is Treatment on the Treated (TT):∆TT(x) ≡ E(∆ ∣ X = x,D = ).is parameter can alsobe dened conditional on P(Z): ∆TT(x, p) ≡ E(∆ ∣ X =

x,P(Z) = p,D = ). (ese two denitions of treatment onthe treated are related by integrating out the conditioning

p variable: ∆TT(x) = ∫ ∆

TT(x, p)dFP(Z)∣X,D(p∣x, ) whereFP(Z)∣X,D(⋅∣x, ) is the distribution of P(Z) givenX = x andD = .)

e mean eect of treatment on those for whom X = x

and UD = uD, the Marginal Treatment Eect (MTE), playsa fundamental role in the analysis of the next subsection:

∆MTE(x,uD) ≡ E(∆ ∣ X = x,UD = uD). ()

is parameter is dened independently of any instru-ment. I separate the denition of parameters from theiridentication.e MTE is the expected eect of treatmentconditional on observed characteristics X and conditionalon UD, the unobservables from the rst stage decisionrule. For uD evaluation points close to zero, ∆MTE(x,uD)is the expected eect of treatment on individuals with thevalue of unobservables that make them most likely to par-ticipate in treatment and who would participate even if themean scale utility µD (Z) is small. If UD is large, µD (Z)

would have to be large to induce people to participate.One can also interpret E(∆ ∣ X = x, UD = uD) as the

mean gain in terms of Y − Y for persons with observedcharacteristics X who would be indierent between treat-ment or not if they were randomly assigned a value of Z,say z, such that µD(z) = uD.WhenY andY are value out-comes,MTE is ameanwillingness-to-paymeasure.MTE isa choice-theoretic building block that unites the treatmenteect, selection, matching and control function literatures.A third interpretation is thatMTE conditions onX and

the residual dened by subtracting the expectation of D∗

from D∗: UD = D∗ − E (D∗ ∣ Z,X).is is a “replacementfunction” interpretation in the sense ofHeckman andRobb(a) andMatzkin (), or “control function” interpre-tation in the sense of Blundell and Powell (). (esethree interpretations are equivalent under separability inD∗, i.e., when () characterizes the choice equation, butlead to three dierent denitions of MTE when a moregeneral nonseparable model is developed. See Heckmanand Vytlacil (b).)e additive separability of Eq. interms of observables and unobservables plays a crucial rolein the justication of instrumental variable methods.

e LATE parameter of Imbens and Angrist () isa version of MTE. I dene LATE independently of anyinstrument aer rst presenting the Imbens–Angrist def-inition. Dene D(z) as a counterfactual choice variable,with D(z) = if D would have been chosen if Z had beenset to z, andD(z) = otherwise. LetZ(x) denote the sup-port of the distribution of Z conditional on X = x. For any(z, z′) ∈ Z(x) × Z(x) such that P(z) > P(z′), LATE isE(∆ ∣ X = x,D(z) = ,D(z′) = ) = E(Y − Y ∣ X =

x,D(z) = ,D(z′) = ), the mean gain to persons whowould be induced to switch from D = to D = if Z were

Page 55: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

manipulated externally from z′ to z. In an example of thereturns to education, z′ could be the base level of tuitionand z a reduced tuition level. Using the latent indexmodel,Heckman andVytlacil (, ) show that LATE can bewritten as

E(Y − Y ∣ X = x,D(z) = ,D(z′) = )= E (Y − Y ∣ X = x,u′D < UD < uD)

= ∆LATE (x,uD,u′D)

for uD = Pr(D(z) = ) = P(z), u′D = Pr(D(z′) = ) =

P(z′), where assumption (A-) implies that Pr(D(z) = ) =Pr(D = ∣ Z = z) and Pr(D(z′) = ) = Pr(D = ∣ Z = z′).Imbens and Angrist dene the LATE parameter as

the probability limit of an estimator. eir analysis con-ates issues of denition of parameters with issues of iden-tication. e representation of LATE given here allowsanalysts to separate these two conceptually distinct mat-ters and to dene the LATE parametermore generally. Onecan in principle evaluate the right hand side of the pre-ceding equation at any uD, u′D points in the unit intervaland not only at points in the support of the distribu-tion of the propensity score P (Z) conditional on X = x

where it is identied. From assumptions (A-), (A-), and(A-), ∆LATE (x,uD,u′D) is continuous in uD and u

′D and

limu′D↑uD∆LATE (x,uD,u′D) = ∆

MTE(x,uD). (is follows from

Lebesgue’s theorem for the derivative of an integral andholds almost everywhere with respect to Lebesgue mea-sure. e ideas of the marginal treatment eect and thelimit form of LATE were rst introduced in the contextof a parametric normal generalized Roy model by Björk-lund andMott (), and were analyzed more generallyin Heckman (). Angrist et al. () also dene anddevelop a limit form of LATE.)Heckman and Vytlacil () use assumptions (A-)–

(A-) and the latent index structure to develop the rela-tionship between MTE and the various treatment eectparameters shown in the rst three lines of Table a.eypresent the formal derivation of the parameters and asso-ciated weights and graphically illustrates the relationshipbetween ATE and TT. All treatment parameters may beexpressed as weighted averages of the MTE:

Treatment Parameter (j)

= ∫ ∆MTE (x,uD) ωj (x,uD) duD

where ωj (x,uD) is the weighting function for the MTEand the integral is dened over the full support of uD.Except for theOLSweights, theweights in the table all inte-grate to one, although in some cases the weights for IVmaybe negative (Heckman et al. ).

In Table a, ∆TT (x) is shown as a weighted average of∆MTE:

∆TT (x) = ∫

∆MTE (x,uD) ωTT (x,uD)duD,

where

ωTT (x,uD) = − FP∣X (uD ∣ x)

∫ ( − FP∣X (t ∣ x))dt

=SP∣X (uD ∣ x)

E (P (Z) ∣ X = x),

()

and SP∣X(uD ∣ x) is Pr(P (Z) > uD ∣ X = x) andωTT (x,uD) is a weighted distribution. e parameter∆TT (x) oversamples ∆MTE (x,uD) for those individualswith low values of uD that make them more likely to par-ticipate in the program being evaluated. Treatment onthe untreated (TUT) is dened symmetrically with TTand oversamples those least likely to participate.e var-ious weights are displayed in Table b. A central theme ofthe analysis of Heckman and Vytlacil is that under theirassumptions all estimators and estimands can be writtenas weighted averages of MTE. is allows them to unifythe treatment eect literature using a common functionalMTE (uD).Observe that if E(Y − Y ∣ X = x,UD = uD) =

E(Y −Y ∣ X = x), so ∆ = Y −Y is mean independent ofUD givenX = x, then ∆MTE = ∆ATE = ∆TT = ∆LATE.ere-fore in cases where there is no heterogeneity in terms ofunobservables in MTE (∆ constant conditional on X = x)or agents do not act on it so that UD drops out of theconditioning set, marginal treatment eects are averagetreatment eects, so that all of the evaluation parametersare the same. Otherwise, they are dierent. Only in thecase where the marginal treatment eect is the averagetreatment eect will the “eect” of treatment be uniquelydened.Figure a plots weights for a parametric normal gener-

alized Roy model generated from the parameters shown atthe base of Fig. b.e model allows for costs to vary inthe population and is more general than the extended Roymodel used to construct Fig. .e weights for IV depictedin Fig. b are discussed in Heckman et al. () and theweights for OLS are discussed in the next section. A highuD is associatedwith higher cost, relative to return, and lesslikelihood of choosing D = .e decline of MTE in termsof higher values of uD means that people with higher uDhave lower gross returns. TT overweights low values of uD(i.e., it oversamples UD that make it likely to have D = ).ATE samples UD uniformly. Treatment on the Untreated

Page 56: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

Principles Underlying Econometric Estimators for Identifying Causal Effects. Table

(a) Treatment effects and estimands as weighted averages of the marginal treatment effect

ATE(x) = E (Y − Y ∣ X = x) = ∫

∆MTE(x,uD)duD

TT(x) = E (Y − Y ∣ X = x,D = ) = ∫

∆MTE(x,uD) ωTT(x,uD)duD

TUT(x) = E (Y − Y ∣ X = x,D = ) = ∫

∆MTE (x,uD) ωTUT (x,uD) duD

Policy relevant treatment effect (x) = E (Ya′ ∣ X = x) − E (Ya ∣ X = x) = ∫

∆MTE (x,uD) ωPRTE (x,uD) duD for two policies

a and a′ that affect the Z but not the X

IVJ(x) = ∫

∆MTE(x,uD) ωJ

IV(x,uD)duD, given instrument J

OLS(x) = ∫

∆MTE(x,uD) ωOLS(x,uD)duD

(b) Weights (Heckman and Vytlacil )

ωATE(x,uD) =

ωTT(x,uD) = [∫uDf(p ∣ X = x)dp]

E(P ∣ X = x)

ωTUT (x,uD) = [∫uD

f (p∣X = x)dp]

E (( − P) ∣X = x)

ωPRTE(x,uD) = [FPa′ ,X(uD) − FPa ,X(uD)

∆P]

ωJIV(x,uD) = [∫

uD(J(Z) − E(J(Z) ∣ X = x)) ∫ fJ,P∣X (j, t ∣ X = x) dt dj]

Cov(J(Z),D ∣ X = x)

ωOLS(x,uD) = +E(U ∣ X = x,UD = uD) ω(x,uD) − E(U ∣ X = x,UD = uD) ω(x,uD)

∆MTE(x,uD)

ω(x,uD) = [∫uDf(p ∣ X = x)dp] [

E(P ∣ X = x)]

ω(x,uD) = [∫uD

f(p ∣ X = x)dp]

E(( − P) ∣ X = x)

(E(Y − Y ∣ X = x,D = )), or TUT, oversamples thevalues of UD which make it unlikely to have D = .Table shows the treatment parameters produced from

the dierent weighting schemes for themodel used to gen-erate the weights in Fig. a and b. Given the decline of theMTE in uD, it is not surprising that TT>ATE>TUT.is isthe generalized Roy version of the principle of diminishingreturns.ose most likely to self select into the programbenet the most from it.e dierence between TT andATE is a sorting gain: E(Y − Y ∣ X,D = ) − E(Y −Y ∣ X), the average gain experienced by people whosort into treatment compared to what the average personwould experience. Purposive selection on the basis of gainsshould lead to positive sorting gains of the kind found inthe table. If there is negative sorting on the gains, thenTUT≥ATE≥TT.

The Weights for a Generalized Roy ModelHeckman et al. () show that all of the weights fortreatment eects and IV estimators can be estimated overthe available support. Since the MTE can be estimated bythe method of Local Instrumental variables, we can formeach treatment eect and each IV estimand as an inte-gral to two estimable functions (subject to support). Forthe case of continuous Z, I plot the weights associatedwith the MTE for IV. is analysis draws on Heckmanet al. (), who derive the weights. Figure plots E(Y ∣

P(Z)) and MTE for the extended Roy models generatedby the parameters displayed at the base of the gure. Incases where β⊥⊥D, ∆MTE (uD) is constant in uD. is istrivial when β is a constant. When β is random but selec-tion into D does not depend on β, MTE is still at. emore interesting case termed “essential heterogeneity” by

Page 57: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3.5

3

a

2.5

2

1.5

1

0.5

0

w (uD)

uD

MTE0.35

MTE

ATE

TT

0

TUT

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1–3

–2

–1

0

1

2

3

b

4

5wTT(uD)

uD

MTE

0.5

MTE

IV

OLS

–0.3

wOLS (uD)

Y1 U1 U1

U0 U0

UD

YD Z

Z NV V N~

~VV

V V

0

1 1

0

–1 if 0

0.67 0.012–0.050–1.000(–0.0026, 0.2700)

0.2(0,1)

+ ++0

= b= a=

= = ===

===

=

s ses e ss e s

σ σε

e

a ab

f

Principles Underlying Econometric Estimators for Identify-ing Causal Effects. Fig. (a) Weights for the marginal treat-ment effect for different parameters (Heckman and Vytlacil) (b) Marginal treatment effect vs linear instrumentalvariables and ordinary least squares weights (Heckman andVytlacil )

Heckman and Vytlacil has β⊥⊥/ D. e le hand side(Fig. a) depicts E(Y ∣ P(Z)) in the two cases.e rstcase makes E(Y ∣ P(Z)) linear in P(Z).e second case isnonlinear in P(Z).is arises when β⊥⊥/ D.e derivativeof E(Y ∣ P(Z)) is presented in the right panel (Fig. b). Itis a constant for the rst case (at MTE) but declining inUD = P(Z) for the casewith selection on the gain. A simpletest for linearity in P(Z) in the outcome equation revealswhether or not the analyst is in cases I and II (β⊥⊥D) orcase III (β⊥⊥/ D). (Recall that we keep the conditioning on

Principles Underlying Econometric Estimators forIdentifying Causal Effects. Table Treatment parameters andestimands in the generalized Roy example

Treatment on the treated .

Treatment on the untreated .

Average treatment effect .

Sorting gaina .

Policy relevant treatmenteffect (PRTE)

.

Selection biasb −.

Linear instrumental variablesc .

Ordinary least squares .

aTT − ATE = E(Y − Y ∣ D = ) − E(Y − Y)bOLS − TT = E(Y ∣ D = ) − E(Y ∣ D = )cUsing Propensity Score P (Z) as the instrument.Note: The model used to create Table is the same as those used tocreate Fig. a and b. The PRTE is computed using a policy t characterizedas follows:If Z > then D = if Z( + t) − V > .If Z ≤ t then D = if Z − V > .For this example t is set equal to ..

X implicit.)ese cases are the extended Roy counterpartsto E (Y ∣ P (Z) = p) and MTE shown for the generalizedRoy model in Figs. a and b.MTE gives the mean marginal return for persons who

have utility P(Z) = uD.us, P(Z) = uD is the margin ofindierence.ose with low uD values have high returns.ose with high uD values have low returns. Figure high-lights that in the general case MTE (and LATE) identifyaverage returns for persons at the margin of indierence atdierent levels of the mean utility function (P(Z)).Figure plots MTE and LATE for dierent intervals of

uD using the model generating Fig. . LATE is the chord ofE(Y ∣ P(Z)) evaluated at dierent points.e relationshipbetween LATE and MTE is depicted in the right panel (b)of Fig. . LATE is the integral under theMTE curve dividedby the dierence between the upper and lower limits.Treatment parameters associated with the second case

are plotted in Fig. . e MTE is the same as that pre-sented in Fig. . ATE has the same value for all p. eeect of treatment on the treated for P (Z) = p, ∆TT (p) =E (Y − Y ∣ D = ,P (Z) = p) declines in p (equivalentlyit declines in uD). Treatment on the untreated given p,TUT(p) = ∆TUT (p) = E(Y − Y ∣ D = ,P(Z) = p)

also declines in p.

Page 58: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

E (Y ⎪P (Z ) = p)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

p

Cases IA and IBCase II

a

ΔMTE(uD)

0 0.2 0.4 0.6 0.8 1p=uD

−5

−4

−3

−2

−1

0

1

2

3

4

5Cases IA and IBCase II

b

Outcomes Choice model

Y = α +−β+U D =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if D∗ >

if D∗ ≤ Y = α + U

Case IA Case IB Case II

U = U U − U ⊥⊥ D U − U ⊥Ò⊥ D−β =ATE=TT=TUT=IV

−β =ATE=TT=TUT=IV

−β =ATE≠TT≠TUT≠IV

Parameterization

Cases IA, IB, and II Cases IB and II Case II

α = . (U,U) ∼ N (,Σ) D∗ = Y − Y − γZ

−β = . with Σ =

⎡⎢⎢⎢⎢⎢⎣

−.

−.

⎤⎥⎥⎥⎥⎥⎦

Z ∼ N (µZ ,ΣZ)

µZ = (,−) and ΣZ=⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

γ = (., .)

Principles Underlying Econometric Estimators for Identifying Causal Effects. Fig. Conditional expectation of Y on P(Z) andthe marginal treatment effect (MTE) the extended Roy economy (Heckman et al. )

LATE(p, p′) =∆TT(p′)p′ − ∆TT(p)p

p′ − p, p′ ≠ p

MTE =∂[∆TT(p)p]

∂p.

One can generate all of the treatment parameters from∆TT (p).Matching on P = p (which is equivalent to nonpara-

metric regression given P = p) produces a biased estimatorof TT(p). Matching assumes a at MTE (average return

Page 59: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

p

E[Y

|P(Z

)=p]

E[Y|P(Z)=p]

E[Y|P(Z)=p]

b D||__||____b D

a

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

b

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

uD

MT

E

MTEMTE

b D||__||____b D

Principles Underlying Econometric Estimators for Identify-ing Causal Effects. Fig. (a) Plot of the E(Y∣P(Z) = p), (b)Plot of the identified marginal treatment effect from Fig. a (thederivative). Note: Parameters for the general heterogeneouscase are the same as those used in Fig. a and b. For the homo-geneous case we impose U = U (σ = σ = .). (Heckmanand Vytlacil )

equals marginal return) as we develop below.erefore itis systematically biased for ∆TT (p) in a model with essen-tial heterogeneity, where β⊥⊥/ D. Making observables alikemakes the unobservables dissimilar. Holding p constantacross treatment and control groups understates TT(p) forlow values of p and overstates it for high values of p. Idevelop this point further in section “7Matching”, whereI discuss the method of matching. First I present a uni-ed approach that integrates all evaluation estimators in acommon framework.

The Basic Principles Underlying theIdentification of the LeadingEconometric Evaluation Estimatorsis section reviews the main principles underlying theevaluation estimators commonly used in the economet-ric literature. I assume two potential outcomes (Y,Y).D = if Y is observed, and D = corresponds to Y beingobserved.e observed outcome is

Y = DY + ( −D)Y. ()

e evaluation problem arises because for each personwe observe eitherY orY but not both.us in general it isnot possible to identify the individual level treatment eectY−Y for any person.e typical solution to this problemis to reformulate the problem at the population level ratherthan at the individual level and to identify certain meanoutcomes or quantile outcomes or various distributions ofoutcomes as described in Heckman and Vytlacil (a).For example, a commonly used approach focuses attentionon average treatment eects, such as ATE = E(Y − Y).If treatment is assigned or chosen on the basis of poten-

tial outcomes, so(Y,Y)⊥⊥/ D,

where ⊥⊥/ denotes “is not independent” and “⊥⊥” denotesindependent, we encounter the problem of selection bias.Suppose that we observe people in each treatment stateD = and D = . If Yj⊥⊥/ D, then the observed Yj will beselectively dierent from randomly assigned Yj, j = , .us E(Y ∣ D = ) ≠ E(Y) and E(Y ∣ D = ) ≠ E(Y).Using unadjusted data to constructE(Y−Y)will produceone source of evaluation bias:

E(Y ∣ D = ) − E(Y ∣ D = ) ≠ E(Y − Y).

e selection problem underlies the evaluation prob-lem. Many methods have been proposed to solve bothproblems.

e method with the greatest intuitive appeal, which issometimes called the “gold standard” in evaluation anal-ysis, is the method of random assignment. Nonexperi-mental methods can be organized by how they attemptto approximate what can be obtained by an ideal randomassignment. If treatment is chosen at random with respectto (Y,Y), or if treatments are randomly assigned andthere is full compliance with the treatment assignment,

(R-) (Y,Y)⊥⊥D.

It is useful to distinguish several cases where (R-) will besatised. e rst is that agents (decision makers whosechoices are being analyzed) pick outcomes that are ran-dom with respect to (Y,Y).us agents may not know

Page 60: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

E(Y |P(Z )= p) and ΔLATE(pℓ, pℓ+1)2

LATE(0.1, 0.35)= 1.719LATE(0.6, 0.9) = –1.17

00 0.2 0.4

p0.6 0.8 1

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

a

ΔMTE(uD) and ΔLATE

0 0.2 0.4uD = p

0.6 0.8 1−5

−4

−3

−2

−1

0

1

2

3

4

5

A

B

C D

E

G

F

HArea(A, B, C, D)/0.25 = LATE(0.1, 0.35) = 1.719

− Area(E, F, G, H)/0.3 = LATE(0.6, 0.9) = −1.17

b

(pℓ, pℓ+1)

∆LATE(pℓ ,pℓ+) =

E (Y∣P(Z) = pℓ+) − E (Y∣P(Z) = pℓ)pℓ+ − pℓ

=

pℓ+

∫pℓ

∆MTE(uD)duD

pℓ+ − pℓ

∆LATE(., .) = −.

∆LATE(., .) = .

Outcomes Choice model

Y = α +−β+U D =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if D∗ >

if D∗ ≤

Y = α + U with D∗ = Y − Y − γZ

Parameterization

(U,U) ∼ N (,Σ) and Z ∼ N (µZ ,ΣZ)

Σ =⎡⎢⎢⎢⎢⎢⎣

−.

−.

⎤⎥⎥⎥⎥⎥⎦

, µZ = (,−) and ΣZ=⎡⎢⎢⎢⎢⎢⎣

⎤⎥⎥⎥⎥⎥⎦

α = .,−β = ., γ = (., .)

Principles Underlying Econometric Estimators for Identifying Causal Effects. Fig. The local average treatment effect theextended Roy economy (Heckman et al. )

(Y,Y) at the time they make their choices to partici-pate in treatment or at least do not act on (Y,Y), so thatPr (D = ∣ X,Y,Y) = Pr (D = ∣ X) for all X. Match-ing assumes a version of (R-) conditional on matchingvariables X: (Y,Y)⊥⊥D ∣ X.A second case arises when individuals are randomly

assigned to treatment status even if they would chooseto self select into no-treatment status, and they complywith the randomization protocols. Let ξ be randomized

assignment status. With full compliance, ξ = implies thatY is observed and ξ = implies that Y is observed.en,under randomized assignment,

(R-) (Y,Y)⊥⊥ξ,

even if in a regime of self-selection, (Y,Y) ⊥⊥/ D. If7randomization is performed conditional on X, we obtain(Y,Y)⊥⊥ξ ∣ X.

Page 61: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

0 0.2 0.4 0.6 0.8 1

–5

–4

–3

–2

–1

0

1

2

3

4

5

p

TT(p)

MTE(p)

ATE(p)

TUT(p)

Matching(p)

Parameter Definition Under assumptionsa

Marginal treatment effect E [Y − Y∣D∗ = , P(Z) = p]

−β+σU−UΦ

−( − p)

Average treatment effect E [Y − Y∣P(Z) = p] β

Treatment on the treated E [Y − Y∣D∗ > , P(Z) = p]

−β+σU−U

ϕ(Φ−(−p))p

Treatment on the untreated E [Y − Y∣D∗ ≤ , P(Z) = p]

−β−σU−U

ϕ(Φ−(−p))−p

OLS/matching on P(Z) E [Y∣D∗ > , P(Z) = p]

−β+(

σU−σU ,U√σU−U

)( −pp(−p)) ϕ (Φ−( − p))

-E [Y∣D∗ ≤ , P(Z) = p]

Principles Underlying Econometric Estimators for Identifying Causal Effects. Fig. Treatment parameters and OLS/matchingas a function of P(Z) = p. Note: Φ (⋅) and ϕ (⋅) represent the cdf and pdf of a standard normal distribution, respectively. Φ− (⋅)

represents the inverse ofΦ (⋅). a The model in this case is the same as the one presented below Fig. . (Heckman et al. )

LetA denote actual treatment status. If the randomiza-tionhas full complianceamongparticipants, ξ = ⇒ A = ;ξ = ⇒ A = .is is entirely consistent with a regimein which a person would choose D = in the absence ofrandomization, but would have no treatment (A = ) ifsuitably randomized, even though the agent might desiretreatment.If treatment status is chosen by self-selection, D = ⇒

A = and D = ⇒ A = . If there is imperfect compli-ance with randomization, ξ = ⇏ A = because of agentchoices. In general, A = ξD so that A = only if ξ = and D = . If treatment status is randomly assigned, eitherthrough randomization or randomized self-selection,

(R-) (Y,Y)⊥⊥A.

is version of randomization can also be dened con-ditional on X. Under (R-), (R-), or (R-), the average

treatment eect (ATE) is the same as the marginal treat-ment eect of Björklund and Mott () and Heckmanand Vytlacil (, , a), and the parameters treat-ment on the treated (TT) (E(Y − Y ∣ D = )) andtreatment on the untreated (TUT) (E(Y − Y ∣ D = )).(e marginal treatment eect is formally dened in thenext section.) ese parameters can be identied frompopulation means:

TT =MTE = TUT = ATE = E(Y − Y) = E(Y) − E(Y).

Forming averages over populations of persons who aretreated (A = ) or untreated (A = ) suces to identifythis parameter. If there are conditioning variablesX, we candene themean treatment parameters for allX where (R-)or (R-) or (R-) hold.Observe that even with random assignment of treat-

ment status and full compliance, one cannot, in gen-eral, identify the distribution of the treatment eects

Page 62: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

(Y −Y), although one can identify the marginal distri-butions F(Y ∣ A = ,X = x) = F(Y ∣ X = x)

and F(Y ∣ A = ,X = x) = F(Y ∣ X = x). Onespecial assumption, common in the conventional econo-metrics literature, is that Y − Y = ∆ (x), a constant givenx. Since ∆ (x) can be identied from E(Y ∣ A = ,X =

x) − E(Y ∣ A = ,X = x) because A is randomly allo-cated, in this special case the analyst can identify the jointdistribution of (Y,Y). (Heckman (); Heckman et al.().)is approach assumes that (Y,Y) have the samedistribution up to a parameter ∆ (Y and Y are perfectlydependent). One can make other assumptions about thedependence across ranks from perfect positive or nega-tive ranking to independence. (Heckman et al. ().)ejoint distribution of (Y,Y) or of (Y − Y) is not identi-ed unless the analyst can pin down the dependence across(Y,Y).us, even with data from a randomized trial onecannot, without further assumptions, identify the propor-tion of people who benet from treatment in the senseof gross gain (Pr(Y ≥ Y)). is problem plagues allevaluationmethods. Abbring andHeckman () discussmethods for identifying joint distributions of outcomes.(See also Aakvik et al. (); Carneiro et al. (, );and Cunha et al. ().)Assumption (R-) is very strong. In many cases, it is

thought that there is selection bias with respect to Y, Y,so persons who select into status or are selectivelydierent from randomly sampled persons in the popula-tion. e assumption most commonly made to circum-vent problems with (R-) is that even though D is notrandom with respect to potential outcomes, the analysthas access to control variables X that eectively producea randomization of D with respect to (Y,Y) given X.is is the method of matching, which is based on thefollowing conditional independence assumption:

(M-) (Y,Y)⊥⊥D ∣ X.

Conditioning on X randomizes D with respect to(Y,Y). (M-) assumes that any selective sampling of(Y,Y) can be adjusted by conditioning on observedvariables. (R-) and (M-) are dierent assumptions andneither implies the other. In a linear equations model,assumption (M-) that D is independent from (Y,Y)given X justies application of 7least squares on D toeliminate selection bias in mean outcome parameters.For means, matching is just nonparametric regression.(Barnow et al. () present one application of match-ing in a regression setting.) In order to be able to compareX-comparable people in the treatment regime one mustassume

(M-) < Pr(D = ∣ X = x) < .

Assumptions (M-) and (M-) justify matching. Assump-tion (M-) is required for any evaluation estimator thatcompares treated and untreated persons. It is produced byrandom assignment if the randomization is conducted forall X = x and there is full compliance.Observe that from (M-) and (M-), it is possible to

identify F(Y ∣ X = x) from the observed data F(Y ∣ D =

,X = x) since we observe the le hand side of

F(Y ∣ D = ,X = x) = F(Y ∣ X = x)

= F(Y ∣ D = ,X = x).

e rst equality is a consequence of conditional indepen-dence assumption (M-).e second equality comes from(M-) and (M-). By a similar argument, we observe thele hand side of

F(Y ∣ D = ,X = x) = F(Y ∣ X = x)

= F(Y ∣ D = ,X = x),

and the equalities are a consequence of (M-) and (M-).Since the pair of outcomes (Y,Y) is not identied foranyone, as in the case of data from randomized trials, thejoint distributions of (Y,Y) given X or of Y − Y givenX are not identied without further information.is is aproblem that plagues all selection estimators.From the data on Y given X andD = and the data on

Y given X and D = , since E(Y ∣ D = ,X = x) = E(Y ∣

X = x) = E(Y ∣ D = ,X = x) and E(Y ∣ D = ,X = x) =

E(Y ∣ X = x) = E(Y ∣ D = ,X = x) we obtain

E(Y − Y ∣ X = x) = E(Y − Y ∣ D = ,X = x)

= E(Y − Y ∣ D = ,X = x).

Eectively, we have a randomization for the subset of thesupport of X satisfying (M-).At values ofX that fail to satisfy (M-), there is no vari-

ation in D given X. One can dene the residual variationin D not accounted for by X as

E(x) = D − E(D ∣ X = x) = D − Pr(D = ∣ X = x).

If the variance of E(x) is zero, it is not possible to constructcontrasts in outcomes by treatment status for those X val-ues and (M-) is violated. To see the consequences of thisviolation in a regression setting, use Y = Y +D(Y − Y)

and take conditional expectations, under (M-), to obtain

E(Y ∣ X,D) = E(Y ∣ X) +D[E(Y − Y ∣ X)].

is follows because E(Y ∣ X,D) = E(Y ∣ X,D) +DE(Y−Y ∣ X,D) but from (M-),E(Y ∣ X,D)=E(Y∣X)and E(Y − Y∣X,D) = E(Y − Y∣X). If Var(E(x)) >

Page 63: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

for all x in the support of X, one can use nonparametricleast squares to identify E(Y − Y ∣ X = x) =ATE(x)by regressing Y on D and X.e function identied fromthe coecient onD is the average treatment eect. (Underthe conditional independence assumption (M-), it is alsothe eect of treatment on the treated E (Y − Y ∣ X,D = )and the marginal treatment eect formally dened in thenext section.) If Var(E(x)) = , ATE(x) is not identiedat that x value because there is no variation inD that is notfully explained by X. A special case of matching is linearleast squares where one can write

Y = Xα +U Y = Xα + β +U,

U = U = U and hence under (M-),

E(Y ∣ X,D) = Xα + βD.

If D is perfectly predictable by X, one cannot identifyβ because of a multicollinearity problem (see 7Multi-collinearity). (M-) rules out perfect collinearity. (Clearly(M-) and (M-) are sucient but not necessary condi-tions. For the special case of OLS, as a consequence of theassumed linearity in the functional form of the estimatingequation, we achieve identication of β if Cov (X,U) = ,Cov (D,U) = and (D,X) are not perfectly collinear.ese conditions are much weaker than (M-) and (M-)and can be satised if (M-) and (M-) are only identiedin a subset of the support ofX.) Matching is a nonparamet-ric version of least squares that does not impose functionalform assumptions on outcome equations, and that imposessupport condition (M-).Conventional econometric choice models make a dis-

tinction between variables that appear in outcome equa-tions (X) and variables that appear in choice equations (Z).e same variables may be in (X) and (Z) but more typi-cally, there are some variables not in common. For exam-ple, the instrumental variable estimator is based on vari-ables that are not in X but that are in Z. Matching makesno distinction between the X and the Z. (Heckman et al.() distinguish X and Z in matching.ey consider acase where conditioning on X may lead to failure of (M-)and (M-) but conditioning on (X,Z) satises a suitablymodied version of this condition.) It does not rely onexclusion restrictions.e conditioning variables used toachieve conditional independence can in principle be aset of variables Q distinct from the X variables (covariatesfor outcomes) or the Z variables (covariates for choices). Iuse X solely to simplify the notation.e key identifyingassumption is the assumed existence of a random variableX with the properties satisfying (M-) and (M-).

Conditioning on a larger vector (X augmented withadditional variables) or a smaller vector (Xwith some com-ponents removed) may or may not produce suitably modi-ed versions of (M-) and (M-).Without invoking furtherassumptions there is no objective principle for determiningwhat conditioning variables produce (M-).Assumption (M-) is strong. Many economists do not

have enough faith in their data to invoke it. Assumption(M-) is testable and requires no act of faith. To justify(M-), it is necessary to appeal to the quality of the data.Using economic theory can help guide the choice of

an evaluation estimator. A crucial distinction is the onebetween the information available to the analyst and theinformation available to the agent whose outcomes arebeing studied. Assumptionsmade about these informationsets drive the properties of econometric estimators. Ana-lysts using matching make strong informational assump-tions in terms of the data available to them. In fact,all econometric estimators make assumptions about thepresence or absence of informational asymmetries, and Iexposit them in this paper.To analyze the informational assumptions invoked in

matching, and other econometric evaluation strategies, itis helpful to introduce ve distinct information sets andestablish some relationships among them. (See also the dis-cussion in Barros (), Heckman and Navarro (),and Gern and Lechner ().) () An information setσ(IR) with an associated random variable that satisesconditional independence (M-) is dened as a relevantinformation set; () e minimal information set σ(IR)

with associated random variable needed to satisfy condi-tional independence (M-), theminimal relevant informa-tion set; () e information set σ(IA) available to theagent at the time decisions to participate are made; ()e information available to the economist, σ(IE∗); and ()e information σ(IE) used by the economist in conduct-ing an empirical analysis. I will denote the random vari-ables generated by these sets as IR∗ , IR, IA, IE∗ , IE, respec-tively. (I start with a primitive probability space (Ω, σ ,P)with associated random variables I. I assume minimal σ-algebras and assume that the random variables I are mea-surable with respect to these σ-algebras. Obviously, strictlymonotonic or ane transformations of the I preserve theinformation and can substitute for the I.)

Denition Dene σ(IR∗) as a relevant information set if

the information set is generated by the random variable IR∗ ,

possibly vector valued, and satises condition (M-), so

(Y,Y) ⊥⊥ D ∣ IR∗ .

Page 64: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

Denition Dene σ(IR) as a minimal relevant infor-

mation set if it is the intersection of all sets σ(IR∗) and

satises (Y,Y) ⊥⊥ D ∣ IR. e associated random vari-

able IR is aminimumamount of information that guarantees

that condition (M-) is satised.ere may be no such set.

(Observe that the intersection of all sets σ(IR∗) may beempty and hence may not be characterized by a (pos-sibly vector valued) random variable IR that guarantees(Y,Y) ⊥⊥ D ∣ IR. If the information sets that produceconditional independence are nested, then the intersec-tion of all sets σ(IR∗) producing conditional independenceis well dened and has an associated random variableIR with the required property, although it may not beunique (e.g., strictly monotonic transformations and anetransformations of IR also preserve the property). In themore general case of non-nested information sets with therequired property, it is possible that no uniquely denedminimal relevant set exists. Among collections of nestedsets that possess the required property, there is a mini-mal set dened by intersection but there may be multipleminimal sets corresponding to each collection.)

If one denes the relevant information set as one that pro-duces conditional independence, it may not be unique. Ifthe set σ(IR∗) satises the conditional independence con-dition, then the set σ(IR∗ ,Q) such that Q ⊥⊥ (Y,Y) ∣ IR∗

would also guarantee conditional independence. For thisreason, I dene the relevant information set to be mini-mal, that is, to be the intersection of all relevant sets thatstill produce conditional independence between (Y,Y)and D. However, no minimal set may exist.

Denition e agent’s information set, σ(IA), is dened

by the information IA used by the agent when choos-

ing among treatments. Accordingly, call IA the agent’s

information.

By the agent Imean the personmaking the treatment deci-sion not necessarily the person whose outcomes are beingstudied (e.g., the agent may be the parent; the person beingstudied may be a child).

Denition e econometrician’s full information set,

σ(IE∗), is dened as all of the information available to the

econometrician, IE∗ .

Denition e econometrician’s information set, σ(IE),

is dened by the information used by the econometrician

when analyzing the agent’s choice of treatment, IE, in con-

ducting an analysis.

For the case where a unique minimal relevant infor-mation set exists, only three restrictions are implied by thestructure of these sets: σ(IR) ⊆ σ(IR∗), σ(IR) ⊆ σ(IA),

and σ(IE) ⊆ σ(IE∗). (is formulation assumes that theagent makes the treatment decision.e extension to thecase where the decision maker and the agent are distinctis straightforward. e requirement σ(IR) ⊆ σ(IR∗) issatised by nested sets.) I have already discussed the rstrestriction.e second restriction requires that the mini-mal relevant information set must be part of the informa-tion the agent uses when deciding which treatment to takeor assign. It is the information in σ(IA) that gives rise to theselection problemwhich in turn gives rise to the evaluationproblem.

e third restriction requires that the information usedby the econometrician must be part of the informationthat the agent observes. Aside from these orderings, theeconometrician’s information setmay be dierent from theagent’s or the relevant information set.e econometricianmay know something the agent doesn’t know, for typicallyhe is observing events aer the decision is made. At thesame time, there may be private information known to theagent but not the econometrician. Matching assumption(M-) implies that σ(IR) ⊆ σ(IE), so that the econome-trician uses at least the minimal relevant information set,but of course he or she may use more. However, usingmore information is not guaranteed to produce a modelwith conditional independence property (M-) satised forthe augmented model. us an analyst can “overdo” it.Heckman and Navarro () and Abbring and Heckman() present examples of the consequences of the asym-metry in agent and analyst information sets.

e possibility of asymmetry in information betweenthe agentmaking participationdecisions and the observingeconomist creates the potential for a major identica-tion problem that is ruled out by assumption (M-).emethods of control functions and instrumental variablesestimators (and closely related regression discontinuitydesign methods) address this problem but in dierentways. Accounting for this possibility is a more conser-vative approach to the selection problem than the onetaken by advocates of 7least squares, or its nonparamet-ric counterpart, matching. ose advocates assume thatthey know the X that produces a relevant informationset. Heckman and Navarro () show the biases thatcan result in matching when standard econometric modelselection criteria are applied to pick the X that are usedto satisfy (M-). Conditional independence condition (M-) cannot be tested without maintaining other assump-tions. (ese assumptions may or may not be testable.erequired “exogeneity” conditions are discussed in Heck-man and Navarro ().us randomization of assign-ment of treatment status might be used to test (M-) butthis requires that there be full compliance and that the

Page 65: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

randomization be valid (no anticipation eects or generalequilibrium eects).) Choice of the appropriate condition-ing variables is a problem that plagues all econometricestimators.

e methods of control functions, replacement func-tions, proxy variables, and instrumental variables all recog-nize the possibility of asymmetry in information betweenthe agent being studied and the econometrician and rec-ognize that even aer conditioning on X (variables in theoutcome equation) and Z (variables aecting treatmentchoices, whichmay include the X), analysts may fail to sat-isfy conditional independence condition (M-). (e termand concept of control function is due to Heckman andRobb (a,b, a,b). See Blundell and Powell ()(who call the Heckman and Robb replacement functionscontrol functions). A more recent nomenclature is “con-trol variate.” Matzkin () provides a comprehensivediscussion of identication principles for econometric esti-mators.) ese methods postulate the existence of someunobservables θ, which may be vector valued, with theproperty that

(U-) (Y,Y)⊥⊥D ∣ X,Z, θ,

but allow for the possibility that

(U-) (Y,Y)⊥⊥/ D ∣ X,Z.

In the event (U-) holds, these approaches model the rela-tionships of the unobservable θ with Y,Y and D in var-ious ways. e content in the control function principleis to specify the exact nature of the dependence on therelationship between observables and unobservables in anontrivial fashion that is consistent with economic theory.Heckman and Navarro present examples of models thatsatisfy (U-) but not (U-).

e early literature focused on mean outcomes condi-tional on covariates (Heckman andRobb a, b, a, b)and assumes a weaker version of (U-) based on con-ditional mean independence rather than full conditionalindependence. More recent work analyzes distributions ofoutcomes (e.g., Aakvik et al. ; Carneiro et al. ,). is work is reviewed in Abbring and Heckman().

e normal Roy selection model makes distributionalassumptions and identies the joint distribution of out-comes. A large literature surveyed by Matzkin ()makes alternative assumptions to satisfy (U-) in nonpara-metric settings. Replacement functions (Heckman andRobb a) are methods that proxy θ. ey substituteout for θ using observables. (is is the “control vari-ate” of Blundell and Powell (). Heckman and Robb(a) and Olley and Pakes () use a similar idea.

Matzkin () discusses replacement functions.) Aakviket al. (, ), Carneiro et al. (, ), Cunha et al.(), and Cunha et al. (, ) develop methodsthat integrate out θ from the model assuming θ⊥⊥(X,Z),or invoking weaker mean independence assumptions, andassuming access to proxy measurements for θ.ey alsoconsider methods for estimating the distributions of treat-ment eects.ese are discussed in Abbring andHeckman().

e normal selection model produces partial identi-cation of a generalized Roymodel and full identication ofa Roy model under separability and normality. It modelsthe conditional expectation of U and U given X,Z,D. Interms of (U-), itmodels the conditionalmean dependenceof Y,Y on D and θ given X and Z. Powell () andMatzkin () survey methods for producing semipara-metric versions of these models. Heckman and Vytlacil(a, Appendix B) or the appendix of Heckman andNavarro () present a prototypical identication prooffor a general selection model that implements (U-) byestimating the distribution of θ, assuming θ⊥⊥(X,Z), andinvoking support conditions on (X,Z).Central to both the selection approach and the instru-

mental variable approach for a model with heterogenousresponses is the probability of selection.is is an integralpart of the Roy model previously discussed. Let Z denotevariables in the choice equation. Fixing Z at dierent val-ues (denoted z), I deneD(z) as an indicator function thatis “” when treatment is selected at the xed value of z andthat is “” otherwise. In terms of a separable index modelUD = µD(Z) − V , for a xed value of z,

D(z) = [µD(z) ≥ V]

where Z⊥⊥V ∣ X.us xing Z = z, values of z do not aectthe realizations of V for any value of X. An alternative wayof representing the independence between Z and V givenX due to Imbens and Angrist (), writes that D (z)⊥⊥Z

for all z ∈ Z , whereZ is the support of Z.e Imbens andAngrist independence condition for IV is

D (z)z∈Z ⊥⊥Z ∣ X.

us the probabilities that D (z) = , z ∈ Z are notaected by the occurrence of Z. Vytlacil () estab-lishes the equivalence of these two formulations undergeneral conditions. (See Heckman andVytlacil (b) fora discussion of these conditions.)

e method of instrumental variables (IV) postulatesthat

(IV-) (Y,Y,D (z)z∈Z)⊥⊥Z ∣ X. (Independence)

Page 66: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

One consequence of this assumption is that E(D ∣ Z) =

P(Z), the propensity score, is random with respect topotential outcomes.us (Y,Y)⊥⊥P (Z) ∣ X. So are allother functions of Z given X.e method of instrumentalvariables also assumes that

(IV-) E(D ∣ X,Z) = P(X,Z) is a nondegenerate functionof Z given X. (Rank Condition)

Alternatively, one can write that Var (E (D ∣ X,Z)) ≠

Var (E (D ∣ X)).Comparing (IV-) to (M-) in the method of instru-

mental variables, Z is independent of (Y,Y) given Xwhereas in matching D is independent of (Y,Y) givenX. So in (IV-), Z plays the role of D in matching condi-tion (M-). Comparing (IV-) with (M-), in the methodof IV the choice probability Pr(D = ∣ X,Z) is assumedto vary with Z conditional on X, whereas in matching,D varies conditional on X. Unlike the method of controlfunctions, no explicit model of the relationship between Dand (Y,Y) is required in applying IV.(IV-) is a rank condition and can be empirically veri-

ed. (IV-) is not testable as it involves assumptions aboutcounterfactuals. In a conventional common coecientregression model

Y = α + βD +U,

where β is a constant and where I allow for Cov(D,U) ≠ ,(IV-) and (IV-) identify β. (β = Cov(Z,Y)

Cov(Z,D) .)When β variesin the population and is correlated with D, additionalassumptions must be invoked for IV to identify inter-pretable parameters. Heckman et al. () and Heckmanand Vytlacil (b) discuss these conditions.Assumptions (IV-) and (IV-), with additional assump-

tions in the case where β varies in the population which Idiscuss in this paper, can be used to identify mean treat-ment parameters. Replacing Y with (Y ≤ t) and Y with (Y ≤ t), where t is a constant, the IV approach allows usto identify marginal distributions F(y ∣ X) or F(y ∣ X).In matching, the variation in D that arises aer con-

ditioning on X provides the source of randomness thatswitches people across treatment status. Nature is assumedto provide an experimental manipulation conditional onX that replaces the randomization assumed in (R-)–(R-).When D is perfectly predictable by X, there is no variationin it conditional on X, and the randomization assumed tobe given by nature in the matching model breaks down.Heuristically, matching assumes a residual E (X) = D −

E(D ∣ X) that is nondegenerate and is one manifestationof the randomness that causes persons to switch status. (Itis heuristically illuminating, but technically incorrect to

replace E (X) with D in (R-) or R in (R-) or T in (R-).In general E (X) is not independent of X even if it is meanindependent.)In the IV method, it is the choice probability E(D ∣

X,Z) = P (X,Z) that is random with respect to (Y,Y),not components of D not predictable by (X,Z). Variationin Z for a xed X provides the required variation in D thatswitches treatment status and still produces the requiredconditional independence:

(Y,Y)⊥⊥P(X,Z) ∣ X.

Variation in P(X,Z) produces variations in D that switchtreatment status. Components of variation in D not pre-dictable by (X,Z) do not produce the required inde-pendence. Instead, the predicted component provides therequired independence. It is just the opposite in matching.Versions of the method of control functions use measure-ments to proxy θ in (U-) and (U-) and remove spuriousdependence that gives rise to selection problems. eseare called replacement functions (see Heckman and Robba) or control variates (see Blundell and Powell ).

e methods of replacement functions and proxy vari-ables all start from characterizations (U-) and (U-). θ isnot observed and (Y,Y) are not observed directly but Yis observed:

Y = DY + ( −D)Y.

Missing variables θ produce selection bias which creates aproblem with using observational data to evaluate socialprograms. From (U-), if one conditions on θ, condition(M-) for matching would be satised, and hence onecould identify the parameters and distributions that canbe identied if the conditions required for matching aresatised.

e most direct approach to controlling for θ is toassume access to a function τ(X,Z,Q) that perfectly prox-ies θ:

θ = τ(X,Z,Q). ()

is approach based on a perfect proxy is called themethod of replacement functions by Heckman and Robb(a). In (U-), one can substitute for θ in terms ofobservables (X,Z,Q).en

(Y,Y)⊥⊥D ∣ X,Z,Q.

It is possible to condition nonparametrically on (X,Z,Q)and without having to know the exact functional form ofτ. θ can be a vector and τ can be a vector of functions.ismethod has been used in the economics of education fordecades (see the references in Heckman and Robb a).If θ is ability and τ is a test score, it is sometimes assumed

Page 67: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

that the test score is a perfect proxy (or replacement func-tion) for θ and that one can enter it into the regressions ofearnings on schooling to escape the problem of ability bias(typically assuming a linear relationship between τ and θ).(us if τ = α + αX + αQ + αZ + θ, one can write

θ = τ − α − αX − αQ − αZ,

and use this as the proxy function. Controlling forT,X,Q,Z controls for θ. Notice that one does not needto know the coecients (α, α, α, α) to implement themethod, one can condition on X,Q,Z.) Heckman andRobb (a) discuss the literature that uses replacementfunctions in this way. Olley and Pakes () apply thismethod and consider nonparametric identication of theτ function. Matzkin () provides a rigorous proof ofidentication for this approach in a general nonparametricsetting.

emethodof replacement functions assumes that ()is a perfect proxy. In many applications, this assumptionis far too strong. More oen, θ is measured with error.is produces a factor model or measurement error model(Aigner et al. ). Matzkin () surveys this method.One can represent the factor model in a general way by asystem of equations:

Yj = gj (X,Z,Q, θ, εj) , j = , . . . , J. ()

A linear factormodel separable in the unobservableswrites

Yj = gj (X,Z,Q) + αjθ + εj, j = , . . . , J, ()

where

(X,Z,Q)⊥⊥(θ, εj), εj⊥⊥θ, j = , . . . , J, ()

and the εj are mutually independent. Observe that under() and (), Yj controlling for X,Z,Q only imperfectlyproxies θ because of the presence of εj. θ is called a fac-tor, αj factor loadings and the εj “uniquenesses” (see e.g.,Aigner ).A large literature, reviewed in Abbring and Heckman

() and Matzkin () shows how to establish iden-tication of econometric models under factor structureassumptions. Cunha et al. (), Schennach () andHu and Schennach () establish identication in non-linear models of the form (). (Cunha et al. (, )apply and extend this approach to a dynamic factor settingwhere the θt are time dependent.)e key to identicationis multiple, but imperfect (because of εj), measurementson θ from the Yj, j = , . . . , J and X,Z,Q, and possiblyother measurement systems that depend on θ. Carneiroet al. (), Cunha et al. (, ) and Cunha andHeckman (, ) apply and develop these methods.

Under assumption (), they show how to nonparametri-cally identify the econometric model and the distributionsof the unobservables FΘ(θ) and Fξj(εj). In the context ofclassical simultaneous equations models, identication issecured by using covariance restrictions across equationsexploiting the low dimensionality of vector θ compared tothe high dimensional vector of (imperfect) measurementson it.e recent literature (Cunha et al. ; Cunha et al.; Hu and Schennach ) extends the linear model toa nonlinear setting.

e recent econometric literature applies in specialcases the idea of the control function principle introducedin Heckman and Robb (a).is principle, versions ofwhich can be traced back to Telser (), partitions θ in(U-) into two or more components, θ = (θ, θ), whereonly one component of θ is the source of bias.us it isassumed that (U-) is true, and (U-)′ is also true:

(U-)′ (Y,Y)⊥⊥D ∣ X,Z, θ,

and (U-) holds. For example, in a normal selectionmodelwith additive separability, one can breakU, the error termassociated with Y, into two components:

U = E (U ∣ V) + ε,

where V plays the role of θ and is associated with thechoice equation. Under normality, ε is independent ofE (U ∣ V). Further,

E (U ∣ V) =Cov(U,V)Var(V)

V , ()

assuming E(U) = and E(V) = . Heckman and Robb(a) show how to construct a control function in thecontext of the choice model

D = [µD(Z) > V] .

Controlling for V controls for the component of θ in(U-)′ that gives rise to the spurious dependence. eBlundell and Powell (, ) application of the con-trol function principle assumes functional form () butassumes that V can be perfectly proxied by a rst stageequation. us they use a replacement function in theirrst stage.eir method does not work when one can onlycondition onD rather than onD∗ = µD (Z)−V instead ofdirectly measuring it. (Imbens and Newey () extendtheir approach. See the discussion in Matzkin ().) Inthe sample selection model, it is not necessary to iden-tify V . As developed in Heckman and Robb (a) and

Page 68: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

Heckman and Vytlacil (a, b), under additive separa-bility for the outcome equation for Y, one can write

E (Y ∣ X,Z,D = ) = µ(X) + E (U ∣ µD(Z) > V)´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶control function

,

so the analyst “expects out” rather than solve out the eectof the component of V on U and thus control for selec-tion bias under the maintained assumptions. In terms ofthe propensity score, under the conditions specied inHeckman and Vytlacil (a), one may write the preced-ing expression in terms of P(Z):

E (Y ∣ X,Z,D = ) = µ(X) + K(P(Z)),

where K(P(Z)) = E(U ∣ X,Z,D = ). It is not literallynecessary to know V or be able to estimate it.e Blundelland Powell (, ) application of the control func-tion principle assumes that the analyst can condition onand estimate V .

e Blundell and Powell method and the methodof Imbens and Newey () build heavily on () andimplicitly make strong distributional and functional formassumptions that are not intrinsic to the method of controlfunctions. As just noted, their method uses a replacementfunction to obtain E(U ∣ V) in the rst step of their pro-cedures. e general control function method does notrequire a replacement function approach. e literaturehas begun to distinguish between the more general controlfunction approach and the control variate approach thatuses a rst stage replacement function.Matzkin () develops the method of unobservable

instruments which is a version of the replacement functionapproach applied to 7nonlinear models. Her unobserv-able instruments play the role of covariance restrictionsused to identify classical simultaneous equations models(see Fisher, ). Her approach is distinct from and there-fore complementary with linear factor models. Instead ofassuming (X,Z,Q)⊥⊥ (θ, εj), she assumes in a two equa-tion system that (θ, ε) ⊥⊥ Y ∣ Y,X,Z. See Matzkin().I do not discuss panel data methods in this paper.e

most commonly used panel data method is dierence-in-dierences as discussed in Heckman and Robb (a),Blundell et al. (), Heckman et al. (), andBertrand et al. (), to cite only a few of the keypapers. Most of the estimators I have discussed canbe adapted to a panel data setting. Heckman et al.() develop dierence-in-dierences matching estima-tors. Abadie () extends this work. (ere is relatedwork by Athey and Imbens (), which exposits theHeckman et al. () dierence-in-dierences matching

estimator.) Separability between errors and observablesis a key feature of the panel data approach in its stan-dard application. Altonji and Matzkin () and Matzkin() present analyses of nonseparable panel data meth-ods. Regression discontinuity estimators, which are ver-sions of IV estimators, are discussed by Heckman andVytlacil (b).Table summarizes some of the main lessons of this

section. I stress that the stated conditions are necessaryconditions. ere are many versions of the IV and con-trol functions principle and extensions of these ideaswhichrene these basic postulates. See Heckman and Vytlacil(b). Matzkin () is an additional reference onsources of identication in econometric models.I next introduce the generalized Roy model and the

concept of the marginal treatment eect which helps tolink the econometric literature to the statistical literature.e Roy model also provides a framework for thinkingabout the dierence in information between the agents andthe statistician.

Matchinge method of matching is widely-used in statistics. It isbased on strong assumptions which oen make its appli-cation to economic data questionable. Because of its popu-larity, I single it out for attention.e method of matchingassumes selection of treatment based on potential out-comes

(Y,Y)⊥⊥/ D,

so Pr (D = ∣ Y,Y) depends on Y,Y. It assumes accessto variables Q such that conditioning on Q removes thedependence:

(Y,Y)⊥⊥D ∣ Q. (Q-)

us,

Pr (D = ∣ Q,Y,Y) = Pr (D = ∣ Q) .

Comparisons between treated and untreated can be madeat all points in the support of Q such that

< Pr (D = ∣ Q) < . (Q-)

emethod does not explicitly model choices of treatmentor the subjective evaluations of participants, nor is thereany distinction between the variables in the outcome equa-tions (X) and the variables in the choice equations (Z) thatis central to the IVmethod and themethod of control func-tions. In principle, condition (Q-) can be satised using aset of variables Q distinct from all or some of the compo-nents of X and Z.e conditioning variables do not haveto be exogenous.

Page 69: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

Principles Underlying Econometric Estimators for Identifying Causal Effects. Table Identifying assumptions undercommonly used methods

Identifying assumptions

Identifiesmarginaldistributions?

Exclusionconditionneeded?

Random (Y, Y) ⊥⊥ ξ, ξ = Ô⇒ A = , ξ = Ô⇒ A = Yes No

assignment (full compliance)

Alternatively, if self-selection is random with

with respect to outcomes, (Y, Y) ⊥⊥ D.

Assignment can be conditional on X .

Matching (Y, Y) ⊥⊥Ò D, but (Y, Y) ⊥⊥ D ∣ X , Yes No

< Pr(D = ∣ X) < for all X

So D conditional on X is a nondegenerate random

variable

Control functions (Y, Y) ⊥⊥Ò D ∣ X , Z, but (Y, Y) ⊥⊥ D ∣ X , Z, θ. Yes Yes

and extensions The method models dependence induced by θ (for

or else proxies θ (replacement function) semiparametric

Version (i) Replacement functions (substitute models)

out θ by observables) (Blundell and Powell, ;

Heckman and Robb, b; Olley and Pakes,

).

Factor models Carneiro et al., () allow for

measurement error in the proxies.

Version (ii) Integrate out θ assuming θ ⊥⊥ (X , Z)

(Aakvik et al., ; Carneiro et al., )

Version (iii) For separable models for mean

response expect θ conditional on X , Z, D as in

standard selection models (control functions in

the same sense of Heckman and Robb).

IV (Y, Y) ⊥⊥Ò D ∣ X , Z, but (Y, Y) ⊥⊥ Z ∣ X , Yes Yes

Pr(D = ∣ Z) is a nondegenerate function of Z

(Y , Y) are potential outcomes that depend on X

D =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

if assigned (or choose) status

otherwise

Z are determinants of D, θ is a vector of unobservables

For random assignments, A is a vector of actual treatment status. A = if treated; A = if not.

ξ = if a person is randomized to treatment status; ξ = otherwise (Heckman and Vytlacil b)

From condition (Q-) one recovers the distributions ofY and Y given Q – Pr (Y ≤ y ∣ Q = q) = F (y ∣ Q = q)

and Pr (Y ≤ y ∣ Q = q) = F (y ∣ Q = q) – but not thejoint distribution F, (y, y ∣ Q = q), because the ana-lyst does not observe the same persons in the treated

and untreated states. is is a standard evaluation prob-lem common to all econometric estimators. Methodsfor determining which variables belong in Q rely onuntested exogeneity assumptions which we discuss in thissection.

Page 70: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

OLS is a special case of matching that focuses on theidentication of conditional means. In OLS linear func-tional forms are maintained as exact representations orvalid approximations. Considering a common coecientmodel, OLS writes

Y = πQ +Dα +U, (Q-)

where α is the treatment eect and

E (U ∣ Q,D) = . (Q-)

e assumption is made that the variance-covariancematrix of (Q,D) is of full rank:

Var (Q,D) full rank. (Q-)

Under these conditions, one can identify α even though Dand U are dependent: D⊥⊥/ U. Controlling for the observ-ableQ eliminates any spurious mean dependence betweenD and U: E (U ∣ D) ≠ but E (U ∣ D,Q) = . (Q-) isthe linear regression counterpart to (Q-). (Q-) is the lin-ear regression counterpart to (Q-). Failure of (Q-) wouldmean that using a nonparametric estimator onemight per-fectly predict D given Q, and that Pr (D = ∣ Q = q) = or . (is condition might be met only at certain valuesof Q = q. For certain parameterizations (e.g., the linearprobability model), onemay obtain predicted probabilitiesoutside the unit interval.)Matching can be implemented as a nonparametric

method.When this is done, the procedure does not requirespecication of the functional form of the outcome equa-tions. It enforces the requirement that (Q-) be satisedby estimating functions pointwise in the support of Q.Assume that Q = (X,Z) and that X and Z are the sameexcept where otherwise noted.us I invoke assumptions(M-) and (M-) presented in section “7e Basic Princi-ples Underlying the Identication of the Leading Econo-metric Evaluation Estimators”, even though in principleone can use a more general conditioning set.Assumptions (M-) and (M-) or (Q-) and (Q-) rule

out the possibility that aer conditioning on X (or Q),agents possess more information about their choices thaneconometricians, and that the unobserved informationhelps to predict the potential outcomes. Put another way,the method allows for potential outcomes to aect choicesbut only through the observed variables, Q, that predictoutcomes. is is the reason why Heckman and Robb(a, b) call the method selection on observables.Heckman and Vytlacil (b) establish the following

points. () Matching assumptions (M-) and (M-) generi-cally imply a at MTE in uD, i.e., they assume that E(Y −Y ∣ X = x,UD = uD) does not depend on uD.us theunobservables central to the Roy model and its extensions

and the unobservables central to the modern IV litera-ture are assumed to be absent once the analyst conditionson X. (M-) implies that all mean treatment parametersare the same. () Even if one weakens (M-) and (M-) tomean independence instead of full independence, generi-cally the MTE is at in uD under the assumptions of thenonparametric generalized Roy model developed in sec-tion “7An Index Model of Choice and Treatment Eects:Denitions and Unifying Principles”, so again all meantreatment parameters are the same. () IV and match-ing make distinct identifying assumptions even thoughthey both invoke conditional independence assumptions.() Comparing matching with IV and control function(sample selection)methods, matching assumes that condi-tioning on observables eliminates the dependence between(Y,Y) and D.e control function principle models thedependence. () Heckman and Navarro () and Heck-man and Vytlacil (b) demonstrate that if the assump-tions of the method of matching are violated, the methodcan produce substantially biased estimators of the param-eters of interest. () Standard methods for selecting theconditioning variables used in matching assume exogene-ity. Violations of the exogeneity assumption can producebiased estimators.Nonparametric versions ofmatching embodying (M-)

avoid the problem of making inferences outside the sup-port of the data.is problem is implicit in any applica-tion of least squares. Figure shows the support problemthat can arise in linear least squares when the linearity ofthe regression is used to extrapolate estimates determinedin one empirical support to new supports. Careful atten-tion to support problems is a virtue of any nonparamet-ric method, including, but not unique to, nonparametricmatching. Heckman, Ichimura, Smith, and Todd ()

X

Y

True functionY = g(X) + U

Data

Least squares approximating lineY = ΠX + V

Principles Underlying Econometric Estimators for Identify-ing Causal Effects. Fig. The least squares extrapolation prob-lem avoided by using nonparametric regression or matching(Heckman and Vytlacil b)

Page 71: International Encyclopedia of Statistical Science || Probit Analysis

Principles Underlying Econometric Estimators for Identifying Causal Effects P

P

show that the bias from neglecting the problem of lim-ited support can be substantial. See also the discussion inHeckman, LaLonde, and Smith ().

Summaryis paper exposits the basic economic model of causal-ity and compares it to models in statistics. It exposits thekey identifying assumptions of commonly used econo-metric estimators for causal inference.e emphasis is onthe economic content of these assumptions. I discuss howmatching makes strong assumption about the informationavailable to economist/statistician.

AcknowledgmentsUniversity of Chicago, Department of Economics, E.th Street, Chicago IL , USA.is research was sup-ported by NSF: --, -, and SES- andNICHD: R-HD-. I thank Mohan Singh, SergioUrzua, and Edward Vytlacil for useful comments.

About the AuthorProfessor Heckman shared the Nobel Memorial Prizein Economics in with Professor Daniel McFaddenfor his development of theory and methods for ana-lyzing selective samples. Professor Heckman has alsoreceived numerous awards for his work, including the JohnBates Clark Award of the American Economic Associa-tion in , the Jacob Mincer Award for LifetimeAchievement in Labor Economics, the UniversityCollege Dublin Ulysses Medal, the Aigner awardfrom the Journal of Econometrics, and Gold Medal ofthe President of the Italian Republic, Awarded by theInternational Scientic Committee, in . He holds sixhonorary doctorates. He is considered to be among theve most inuential economists in the world, in .(http://ideas.repec.org/top/top.person.all.html).

Cross References7Causal Diagrams7Causation and Causal Inference7Econometrics7Factor Analysis and Latent Variable Modelling7Instrumental Variables7Measurement Error Models7Panel Data7Random Coecient Models7Randomization

References and Further ReadingAakvik A, Heckman JJ, Vytlacil EJ () Training effects on

employment when the training effects are heterogeneous: anapplication to Norwegian vocational rehabilitation programs.University of BergenWorking Paper , University of Chicago

Aakvik A, Heckman JJ, Vytlacil EJ () Estimating treatmenteffects for discrete outcomes when responses to treatmentvary: an application to Norwegian vocational rehabilitationprograms. J Econometrics :–

Abadie A () Bootstrap tests of distributional treatment effectsin instrumental variable models. J Am Stat Assoc :–

Abbring JH, Heckman JJ () Econometric evaluation of socialprograms, part III: distributional treatment effects, dynamictreatment effects, dynamic discrete choice, and general equi-librium policy evaluation. In: Heckman J, Leamer E (eds)Handbook of econometrics, vol B. Elsevier, Amsterdam, pp–

Aigner DJ () The residential electricity time-of-use pricingexperiments: what have we learned? In: Hausman JA, WiseDA (eds) Social experimentation. University of Chicago Press,Chicago, pp –

Aigner DJ, Hsiao C, Kapteyn A, Wansbeek T () Latent variablemodels in econometrics. In: Griliches Z, Intriligator MD (eds)Handbook of econometrics, vol , chap . Elsevier, Amsterdam,pp –

Altonji JG, Matzkin RL () Cross section and panel data esti-mators for nonseparable models with endogenous regressors.Econometrica :–

Angrist J, Graddy K, Imbens G () The interpretation of instru-mental variables estimators in simultaneous equationsmodel-swith an application to the demand for fish. Rev Econ Stud:–

Angrist JD, Krueger AB () Empirical strategies in labor eco-nomics. In: Ashenfelter O, Card D (eds) Handbook of laboreconomics, vol A. North-Holland, New York, pp –

Athey S, Imbens GW () Identification and inference in nonlin-ear difference-indifferences models. Econometrica :–

Barnow BS, Cain GG, Goldberger AS () Issues in the analysis ofselectivity bias. In: Stromsdorfer E, Farkas G (eds) Evaluationstudies, vol . Sage Publications, Beverly Hills, pp –

Barros RP () Two essays on the nonparametric estimation ofeconomic models with selectivity using choice-based samples.PhD thesis, University of Chicago

Bertrand M, Duflo E, Mullainathan S () How much should wetrust differences-indifferences estimates? Q J Econ :–

Björklund A, Moffitt R () The estimation of wage gains andwelfare gains in self-selection. Rev Econ Stat :–

Blundell R, Duncan A, Meghir C () Estimating labor supplyresponses using tax reforms. Econometrica :–

Blundell R, Powell J () Endogeneity in nonparametricand semiparametric regression models. In: DewatripontLPHM, Turnovsky SJ (eds) Advances in economics andeconometrics: theory and applications, eighth world congress,vol . Cambridge University Press, Cambridge

Blundell R, Powell J () Endogeneity in semiparametric binaryresponsemodels. Rev Econ Stud :–

Carneiro P, Hansen K, Heckman JJ () Removing the veil of igno-rance in assessing the distributional impacts of social policies.Swedish Econ Policy Rev :–

Carneiro P, Hansen K, Heckman JJ () Estimating distributionsof treatment effects with an application to the returns to school-ing and measurement of the effects of uncertainty on collegechoice. Int Econ Rev :–

Cunha F, Heckman JJ () Identifying and estimating the distri-butions of Ex Post and Ex Ante returns to schooling: a survey ofrecent developments. Labour Econ :–

Page 72: International Encyclopedia of Statistical Science || Probit Analysis

P Principles Underlying Econometric Estimators for Identifying Causal Effects

Cunha F, Heckman JJ () A new framework for the analysis ofinequality. Macroecon Dyn :–

Cunha F, Heckman JJ, Matzkin R () Nonseparable factor anal-ysis. Unpublished manuscript, University of Chicago, Depart-ment of Economics

Cunha F, Heckman JJ, Navarro S () Separating uncertaintyfromheterogeneity in life cycle earnings. The Hicks Lec-ture. Oxford Economic Papers , –

Cunha F, Heckman JJ, Navarro S () Counterfactual analysisof inequality and social mobility. In: Morgan SL, Grusky DB,Fields GS (eds) Mobility and inequality: frontiers of research insociology and economics, chap . Standford University Press,Stanford, pp –

Cunha F, Heckman JJ, Schennach SM () Nonlinear factor anal-ysis. Unpublished manuscript, University of Chicago, Depart-ment of Economics, revised

Cunha F, Heckman JJ, Schennach SM () Estimating the technol-ogy of cognitive and noncognitive skill formation, Forthcom-ing. Econometrica

Fisher RA () The design of experiments. Hafner Publishing,New York

Gerfin M, Lechner M () Amicroeconomic evaluation of theactive labormarket policy in Switzerland. Econ J :–

Heckman JJ () Randomization and social policy evaluation. In:Manski C, Garfinkel I (eds) Evaluating welfare and trainingprograms. Harvard University Press, Cambridge, pp –

Heckman JJ () Instrumental variables: a study of implicit behav-ioral assumptions used inmaking programevaluations. J HumResour :–; addendum published vol. no. (Winter)

Heckman JJ () Econometric causality. Int Stat Rev :–Heckman JJ, Ichimura H, Smith J, Todd PE () Characteriz-

ing selection bias using experimental data. Econometrica :–

Heckman JJ, LaLonde RJ, Smith JA () The economics and econo-metrics of active labor market programs. In: Ashenfelter O,Card D (eds) Handbook of labor economics, vol A, chap .North-Holland, New York, pp –

Heckman JJ, Navarro S () Usingmatching, instrumental vari-ables, and control functions to estimate economic choicemodels. Rev Econ Stat :–

Heckman JJ, Navarro S () Dynamic discrete choice and dynamictreatment effects. J Econometrics :–

Heckman JJ, Robb R (a) Alternative methods for evaluating theimpact of interventions. In: Heckman J, Singer B (eds) Longitu-dinal analysis of labor market data, vol . Cambridge UniversityPress, New York, pp –

Heckman JJ, Robb R (b) Alternative methods for evaluatingthe impact of interventions: an overview. J Econometrics :–

Heckman JJ, Robb R (a) Alternative methods for solving theproblem of selection bias in evaluating the impact of treatmentson outcomes. In: Wainer H (ed) Drawing inferences from self-selected samples. Springer, New York, pp –, reprinted in, Erlbaum, Mahwah

Heckman JJ, Robb R (b) Postscript: a rejoinder to Tukey. In:Wainer H (ed) Drawing inferences from self-selected samples.Springer, New York, pp –, reprinted in , Erlbaum,Mahwah

Heckman JJ, Smith JA, Clements N () Making the most outof programme evaluations and social experiments: accounting

for heterogeneity in programme impacts. Rev Econ Stud :–

Heckman JJ, Urzua S, Vytlacil EJ () Understanding instrumen-tal variables in models with essential heterogeneity. Rev EconStat :–

Heckman JJ, Vytlacil EJ () Local instrumental variables andlatent variable models for identifying and bounding treatmenteffects. Proc Natl Acad Sci :–

Heckman JJ, Vytlacil EJ () Local instrumental variables. In:Hsiao C, Morimune K, Powell JL (eds) Nonlinear statisticalmodeling: proceedings of the thirteenth international sympo-sium in economic theory and econometrics: essays in honor ofTakeshi Amemiya. Cambridge University Press, New York, pp–

Heckman JJ, Vytlacil EJ () Structural equations, treatmenteffects and econometric policy evaluation. Econometrica:–

Heckman JJ, Vytlacil EJ (a) Econometric evaluation of socialprograms, part I: causal models, structural models and econo-metric policy evaluation. In: Heckman J, Leamer E (eds)Handbook of econometrics, vol B. Elsevier, Amsterdam, pp–

Heckman JJ, Vytlacil EJ (b) Econometric evaluation of socialprograms, part II: using the marginal treatment effect to orga-nize alternative economic estimators to evaluate social pro-grams and to forecast their effects in new environments. In:Heckman J, Leamer E (eds) Handbook of econometrics, vol B.Elsevier, Amsterdam, pp –

Hu Y, Schennach SM () Instrumental variable treatment of non-classical measurement error models. Econometrica :–

Imbens GW () Nonparametric estimation of average treatmenteffects under exogeneity: a review. Rev Econ Stat :–

Imbens GW, Angrist JD () Identification and estimation of localaverage treatment effects. Econometrica :–

Imbens GW, Newey WK () Identification and estimation oftriangular simultaneous equations models without additivity.Technical working paper , National Bureau of EconomicResearch

Matzkin RL () Nonparametric estimation of nonadditive ran-domfunctions. Econometrica :–

Matzkin RL () Nonparametric identification. In: Heckman J,Leamer E (eds) Handbook of econometrics, vol B. Elsevier,Amsterdam

Olley GS, Pakes A () The dynamics of productivity inthe telecommunications equipment industry. Econometrica:–

Pearl J () Causality. Cambridge University Press, CambridgePowell JL () Estimation of semiparametric models. In: Engle R,

McFadden D (eds) Handbook of econometrics, vol . Elsevier,Amsterdam, pp –

Quandt RE () The estimation of the parameters of a linearregression systemobeying two separate regimes. J Am Stat Assoc:–

Quandt RE () A new approach to estimating switching regres-sions. J Am Stat Assoc :–

Roy A () Some thoughts on the distribution of earnings. OxfordEcon Pap :–

Rubin DB () Estimating causal effects of treatments in random-ized and nonrandomized studies. J Educ Psychol :–

Rubin DB () Bayesian inference for causal effects: the role ofrandomization. Ann Stat :–

Page 73: International Encyclopedia of Statistical Science || Probit Analysis

Prior Bayes: Rubin’s View of Statistics P

P

Schennach SM () Estimation of nonlinear models with mea-surement error. Econometrica :–

Telser LG () Iterative estimation of a set of linear regressionequations. J Am Stat Assoc :–

Vytlacil EJ () Independence, monotonicity, and latent indexmodels: an equivalence result. Econometrica :–

Prior Bayes: Rubin’s View ofStatisticsHerman RubinPurdue University, West Lafayette, IN, USA

IntroductionWhat is statistics?e commonway of looking at it is a col-lection of methods, somehow or other produced, and oneshould use one of thosemethods for a given set of data.etypical user who has little more than this comes to a statis-tician asking for THE answer, as if the data are sucient toget this without knowing the problem.No; the rst thing is to formulate the problem. One

cannot do better than to assume that there is an unknownstate of nature, that there is a probability distributionof observations given a state of nature, a set of possibleactions, and that each action in each state of nature has(possibly random) consequences.Following the von Neumann–Morgenstern ()

axioms for utility, in (see Rubin a) I was able toshow that if one has a self-consistent evaluation of actionsin each state of nature, the utility function for an unknownstate of nature has to be an integral of the utilities for thestates of nature. Another way of looking at this in the dis-crete case is that one assigns a weight to each result in eachstate of nature, and should choose the action which pro-duces the best sum; this generalizes to the best value of theintegral.is is the prior Bayes approach.Let us simplify to the usual Bayes model; it can be gen-

eralized to include more, and in fact, must be for the in-nite parametric (usually called non-parametric) problemsencountered.So the quantity to be minimized is

∫ ∫ L(ω, q(x))dµ(x∣ω)dξ(ω).

If this integral is nite, and if, for example, L is positive, theintegration can be interchanged and the result written as

∫ ∫ L(ω, q(x))dϕ(ω∣x)dm(x),

and if the properties of q for dierent x are unrestricted,one can (hopefully) use the usual Bayes proceduce of min-imizing the inner integral.But this can require a huge amount of computing

power, possibly even exceeding the capacity of the uni-verse,and sucient assurance that one has a sucientlygood approximation to the loss-prior combination. Oneuse this latter because it is only the product of L and ξ

which is relevant. One can try to approximate, but poste-rior robustness results are hard, and oen impossible, tocome by.On the other hand, the prior Bayes approach can show

what is and what is not important, and can, in many cases,providemethodswhich are notmuchworse than full Bayesmethods, or at least the approximations made. One mightquestion how this can be shown, considering that the fullBayes procedure cannot be calculated; however, a smallerproblemwith one which can be calculatedmight be shownto come close. Also, one can get procedures which are goodwith considerable uncertainty about the prior distributionof the parameter.

Results and ExamplesSome early approaches, some by those not believing inthe Bayes approach, were the empirical Bayes results ofRobbins and his followers. Empirical Bayes extends muchfarther now, and it is in the spirit of prior Bayes, as the per-formance of the procedures is what is considered.ere isa large literature on this, and Iwill not go into it in any greatdetail.Suppose we consider the case of the usual test of a

point null, when the distribution is normal. If we assumethe prior is symmetric, we need only consider procedureswhich accept if the mean is close enough to the null,and reject otherwise. If we assume that the prior gives apoint mass at the null, and is otherwise given by a smoothdensity, the prior Bayes risk is

ξ()P(∣X > c∣) + ∫ Q(ω)h(ω)P(∣X∣ ≤ c∣ω)dω,

where Q is the loss of incorrect acceptance.is shows that the tails of the prior distribution

are essentially irrelevant if the variance is at all small,so the prior probability of the null is NOT an impor-tant consideration. Only the ratio of the probability ofthe null to the density at the null of the alternative isimportant. is shows that the large-sample results ofRubin and Sethuraman () can be a good approxima-tion for moderate, or even small, samples. It also showshow the “p-value” should change with the sample size. Anexpository paper is in preparation.

Page 74: International Encyclopedia of Statistical Science || Probit Analysis

P Probabilistic Network Models

If the null is not a point null, this is still a good approxi-mation if the width of the null is smaller than the standarddeviation of the observations; if it is not, the problem ismuch harder, and the form of the loss-prior combina-tion under the null becomes of considerable importance.Again, the prior Bayes approach shows where the prob-lem lies, in the behavior of the loss-prior combination nearthe division point between accepptance and rejection. In“standard” units, a substantial part of the parameter spacemay be involved.For another example, consider the problem of esti-

mating an innite dimensional normal mean with thecovariance matrix a multiple of the identity, or the simi-lar problem of estimating a spectral density function.erst problem was considereed by Rubin (b), and wascorrrected in Hui Xu’s thesis in . If one assumes theprior mean square of the kth mean, or the correspond-ing Fourier coecient, is non-increasing, an empiricalBayes type result can be obtained, and can be shown tobe asymptotically optimal in the class of “simple” pro-cedures, which are the ones currently being used withmore stringent assumptions, and which are generally notas good.e results are good even if the precise form is notknown, while picking a kernel ismuchmore restrictive andgenerally not even asymptotically optimal.For the latter problem, obtaining a reasonable prior

seems to be extremely dicult, but the simple proceduresobtained are likely to be rather good even if one is found.e positive deniteness of the Toeplitz matrix of thecovariances is a dicult condition to work with.I see a major set of applications to non-parametric

problems, properly called innite parametric, problemssuch as density estimation, testingwith innite dimensionalalternatives, etc. With a large number of dimensions, theclassical approach does not work well, and the only “sim-ple Bayesian” approaches use priors which look highlyconstrained to me.

About the Author“Professor Rubin has contributed in deep and originalways to statistical theory and philosophy. e statisticalcommunity has been vastly enriched by his contribu-tions through his own research and through his inuence,direct or indirect on the research and thinking of others.”Erich Marchard and William Strawdeman, A festschri forHerman Rubin, A. DasGupta (ed.), p. . “He is well knownfor his broad ranging mathematical research interests andfor fundamental contributions in Bayesian decision theory,in set theory, in estimations for simultaneous equations,in probability and in asymptotic statistics.” Mary EllenBlock, Ibid. p. . Professor Rubin is a Fellow of the IMS,

and a Fellow of the AAAS (American Association for theAdvancement of Science).

Cross References7Bayesian Analysis or Evidence Based Statistics?7Bayesian Statistics7Bayesian Versus Frequentist Statistical Reasoning7Model Selection7Moderate Deviations7Statistics: An Overview7Statistics: Nelder’s View

References and Further ReadingRubin H (a) A weak system of axioms for “rational” behavior and

the non-separability of utility from prior. Stat Decisions :–Rubin H (b) Robustness in generalized ridge regression and

related topics. Third Valencia Symp Bayesian Stat :–Rubin H, Sethuraman J () Bayes risk efficiency. Sankhya A

:–von Neumann J, Morgenstern O () Theory of games and eco-

nomic behavior. Princeton university press, PrincetonXu H () Some applications of the prior Bayes approach.

Unpublished thesis

Probabilistic Network Models

Ove FrankProfessor EmeritusStockholm University, Stockholm, Sweden

A network on vertex set V is represented by a function yon the set V of ordered vertex pairs.e function can beunivariate or multivariate and its variables can be numeri-cal or categorical. A graphGwith vertex setV and edge setE inV is represented by a binary function y = (u, v, yuv) :(u, v) ∈ V where yuv indicates whether (u, v) ∈ E.If V = , . . . ,N and vertices are ordered according totheir labels, y can be given as an N by N adjacency matrixy = (yuv). Simple undirected graphs have yuv = yvu andyvv = for all u and v. Colored graphs have a categori-cal variable y with the categories labeled by colors. Graphswith more general variables y are called valued graphs ornetworks. If Y is a random variable with outcomes y rep-resenting networks in a specied family of networks, theprobability distribution induced on this family is a prob-abilistic network model and Y is a representation of arandom network.Simple random graphs are denedwith uniformdistri-

butions or Bernoulli distributions. Uniformmodels assign

Page 75: International Encyclopedia of Statistical Science || Probit Analysis

Probability on Compact Lie Groups P

P

equal probabilities to all graphs in a specied nite fam-ily of graphs, such as all graphs of order N and sizeM, orall connected graphs of order N, or all trees of order N.Bernoulli graphs (Bernoulli digraphs) have edge indicatorsthat are independent Bernoulli variables for all unordered(ordered) vertex pairs.ere is an extensive literature onsuch random graphs, especially on the simplest Bernoulli(p) graph, which has a common edge probability p for allvertex pairs. An extension of a xed graphG to a Bernoulli(G, α, β) graph is a Bernoulli graph that is obtained byindependently removing edges inGwith probability α andindependently inserting edges in the complement of Gwith probability β. Suchmodels have been applied to studyreliability problems in communication networks. Attemptstomodel theweb have recently contributed to an interest inrandom graph models with specied degree distributionsand random graph processes for very large dynamicallychanging graphs.

e literature on social networks describes models fornite random digraphs on V = , . . . ,N in which dyads(Yuv,Yvu) for u < v are independent and have probabilitiesthat depend on parameters governing in- and out-edges ofeach vertex and mutual edges of each vertex pair. Specialcases of such models with independent dyads are obtainedby assuming homogeneity for the parameters of dierentvertices or dierent groups of vertices. Extensions to mod-elswith dependent dyads includeMarkov graphs that allowdependence between incident dyads. Other extensionsare log-linear models that assume that the log-likelihoodfunction is a linear function of specied network statis-tics chosen to reect various properties of interest in thenetwork.Statistical analysis of network data comprise explo-

ratory tools for selecting appropriate probabilistic networkmodels as well as conrmatory tools for estimating andtesting various models. Many of these tools use computerintensive methods.Applications of probabilistic networkmodels appear in

many dierent areas in which relationships between theunits studied are essential for an understanding of theirproperties and characteristics. e social and behavioralsciences have contributed to the development of manynetwork models for the study of social interaction, friend-ship, dominance, co-operation and competition. ereare applications to criminal networks and co-oending,communication and transportation networks, vaccinationprograms in epidemiology, information retrieval and orga-nizational systems, particle systems in physics, biometriccell systems. Random graphs and random elds are alsotheoretically developed in computer science, mathemat-ics, and statistics. ere is an exciting interplay between

model development and new applications in a variety ofimportant areas.Many references to the literature on graphs, random

graphs, and random networks are provided by the follow-ing sources.

About the AuthorFor biography see the entry 7Network Sampling.

Cross References7Graphical Markov Models7Network Models in Probability and Statistics7Network Sampling7Social Network Analysis7Uniform Distribution in Statistics

References and Further ReadingBonato A () A course on the web graph. AmericanMathematical

Society, ProvidenceCarrington P, Scott J, Wasserman S (eds) () Models and meth-

ods in social network analysis. Cambridge University Press,New York

Diestel R () Graph theory. Springer, Berlin/HeidelbergDurrett R () Random graph dynamics. Cambridge University

Press, New YorkKolaczyk E () Statistical analysis of network data. Springer,

New YorkMeyers R (ed) () Encyclopedia of complexity and systems

science. Springer, New York

Probability on Compact LieGroups

David ApplebaumProfessor, HeadUniversity of Sheeld, Sheeld, UK

IntroductionProbability on groups enables us to study the interactionbetween chance and symmetry. In this article I’ll focuson the case where symmetry is generated by continuousgroups, specically compact Lie groups.is class containsmany examples such as the n-torus, special orthogonalgroups SO(n) and special unitary groups SU(n) whichare important in physics and engineering applications. It isalso a very good context to demonstrate the key role playedby non-commutative harmonic analysis via group repre-sentations.e classic treatise (Heyer ) by Heyer givesa systematic mathematical introduction to this topic while

Page 76: International Encyclopedia of Statistical Science || Probit Analysis

P Probability on Compact Lie Groups

Diaconis () presents a wealth of concrete examples inboth probability and statistics.For motivation, let ρ be a probability measure on the

real line. Its characteristic function ρ is the Fourier trans-form ρ(u) = ∫R e

iuxρ(dx) and ρ uniquely determines ρ.Note that themappings x → eiux are the irreducible unitaryrepresentations of R.Now let G be a compact Lie group and ρ be a prob-

ability measure dened on G.e group law of G will bewritten multiplicatively. If we are given a probability space(Ω,F ,P) then ρ might be the law of a G-valued randomvariable dened on Ω.e convolution of two such mea-sures ρ and ρ is the unique probability measure ρ ∗ ρon G such that

∫Gf (σ)(ρ ∗ ρ)(dσ) = ∫

G∫Gf (στ)ρ(dσ)ρ(dτ),

for all continuous functions f dened on G. If X andX are independent G-valued random variables with lawsρ and ρ (respectively), then ρ ∗ ρ is the law of XX.

Characteristic FunctionsLet G be the set of all irreducible unitary representationsof G. Since G is compact, G is countable. For each π ∈

G, σ ∈ G, π(σ) is a unitary (square) matrix acting on anite dimensional complex inner product spaceVπ havingdimension dπ . Every group has the trivial representation δ

acting on C by δ(σ) = for all σ ∈ G.e characteristicfunction of the probability measure ρ is the matrix-valuedfunction ρ on G dened uniquely by

⟨u, ρ(π)v⟩ = ∫G⟨u, π(τ)v⟩ρ(dτ),

for all u, v ∈ Vπ . ρ has a number of desirable properties(Siebert ). It determines ρ uniquely and for all π ∈ G:

ρ ∗ ρ(π) = ρ(π)ρ(π).

In particular δ = .Lo and Ng () considered a family of matrices

(Cπ , π ∈ G) and asked when there is a probability mea-sure ρ on G such that Cπ = ρ(π).ey found a necessaryand sucient condition to be that Cδ = and that the fol-lowing non-negativity condition holds: for all families ofmatrices (Bπ , π ∈ G) where Bπ acts on Vπ and for which∑π∈Sπ

dπtr(π(σ)Bπ) ≥ for all σ ∈ G and all nite subsetsSπ of Vπ we must have∑π∈Sπ

dπtr(π(σ)CπBπ) ≥ .

DensitiesEvery compact group has a bi-invariant nite Haar mea-sure which plays the role of Lebesgue measure on Rd

and which is unique up to multiplication by a positivereal number. It is convenient to normalise this measure

(so it has total mass ) and denote it by dτ inside inte-grals of functions of τ. We say that a probability mea-sure ρ has a density f if ρ(A) = ∫A f (τ)dτ for allBorel sets A in G. To investigate existence of densitieswe need the Peter–Weyl theorem that the set of functions

dπ πij; ≤ i, j ≤ dπ , π ∈ G are a complete orthonormal

basis for L(G,C). So any f ∈ L(G,C) can be written

f (σ) = ∑π∈Gdπtr(π(σ )f (π)), ()

where f (π) = ∫G f (τ)π(τ−)dτ is the Fourier trans-form. In Applebaum () it was shown that ρ has asquare-integrable density f (which then has an expan-sion as in ()) if and only if ∑π∈G dπtr(ρ(π)ρ(π)∗) < ∞

where ∗ is the usual matrix adjoint. A sucient con-dition for ρ to have a continuous density is that

∑π∈G dπ ∣tr(ρ(π)ρ(π)∗)∣

< ∞ in which case the series

on the right hand side of (.) converges absolutely anduniformly (see Proposition .. on pp. – of Faraut[]).

Conjugate Invariant ProbabilitiesMany interesting examples of probability measures areconjugate invariant, i.e., ρ(σAσ−) = ρ(A) for all σ ∈ G. Inthis case there exists cπ ∈ C such that ρ(π) = cπIπ whereIπ is the identity matrix in Vπ (Said et al. ). If a den-sity exists it takes the form f (σ) = ∑π∈G dπcπ χπ(σ), whereχπ(σ) = tr(π(σ)) is the group character.

Example GaussMeasure.Here cπ = eσ κπ whereκπ ≤ isthe eigenvalue of the group Laplacian corresponding to theCasimir operator κπIπ onVπ and σ > . For example ifG =

SU() then it can be parametrized by the Euler angles ψ, ϕand θ, G = Z+, κm = −m(m+) and we have a continuousdensity depending only on ≤ θ ≤ π

:

f (θ) =∞∑m=

(m + )e−σ m(m+) sin((m + )θ)

sin(θ).

Example Laplace Distribution.is is a generalizationof the double exponential distribution on R (with equalparameters). In this case cπ = ( − βκπ)

− where β > and κπ is as above.

Infinite DivisibilityA probability measure ρ on G is innitely divisible if foreach n ∈ N there exists a probability measure ρ

n on G

such that the nth convolution power (ρn )∗n

= ρ. Equiv-

alently ρ(π) =ρn (π)n for all π ∈ G. If G is connected

as well as compact any such ρ can be realised as µ in

Page 77: International Encyclopedia of Statistical Science || Probit Analysis

Probability Theory: An Outline P

P

a weakly continuous convolution semigroup of probabil-ity measures (µt , t ≥ ). For a general Lie group, suchan embedding may not be possible and the investigationof this question has generated much research over morethan years (McCrudden ). e structure of con-volution semigroups has been intensely analyzed. esegive the laws of group-valued 7Lévy processes, i.e., pro-cesses with stationary and independent increments (Liao). In particular there is a Lévy–Khintchine type for-mula (originally due to G.A.Hunt) which classies these interms of the structure of the innitesimal generator of theassociatedMarkov semigroup that acts on the space of con-tinuous functions. One of the most important examples isBrownianmotion (see7BrownianMotion andDiusions)and this has a Gaussian distribution. Another importantexample is the compound Poisson process (see 7PoissonProcesses)

Y(t) = XX⋯XN(t) ()

where (Xn,n ∈ N) is a sequence of i.i.d. random variableshaving common law ν (say) and (N(t), t ≥ ) is an inde-pendent Poisson process of intensity λ > . In this case

µt = ∑∞n= e

−λt (λt)n

n!ν∗n. Note that µt does not have a

density and it is conjugate invariant if ν is.

Applicationsere is intense interest among statisticians and engineersin the deconvolution problem on groups.e problem is toestimate the signal density fX from the observed density fYwhen the former is corrupted by independent noise havingdensity fє , so the model is Y = Xє and the inverse prob-lem is to untangle fY = fX ∗ fє . Inverting the characteristicfunction enables the construction of non-parametric esti-mators for fX and optimal rates of convergence are knownfor these when fє has certain smoothness properties (KimandRichards ; Koo andKim ). In Said et al. ()the authors consider the problem of decompounding, i.e., toobtain non-parametric estimates of the density of X in ()based on i.i.d. observations of a noisy version of Y : Z(t) =єY(t), where є is independent of Y(t).is is applied tomultiple scattering of waves from complexmedia by work-ing with the group SO() which acts as rotations on thesphere.

About the AuthorDavid Applebaum is Professor of Probability and Statis-tics and is currently Head of the Department of Proba-bility and Statistics, University of Sheeld UK. He haspublished over research papers and is the author oftwo books, Probability and Information (second edition),Cambridge University Press (, ) and Lévy Pro-cesses and Stochastic Calculus (second edition), Cambridge

University Press (, ). He is joint managing editorof the Tbilsi Mathematical Journal and associate editor forBernoulli, Methodology and Computing in Applied Proba-bility, Journal of Stochastic Analysis and Applications andCommunications on Stochastic Analysis. He is a memberof the Atlantis Press Advisory Board for Probability andStatistics Studies.

Cross References7Brownian Motion and Diusions7Characteristic Functions7Lévy Processes7Poisson Processes

References and Further ReadingApplebaum D () Probability measures on compact groups

which have square-integrable densities. Bull Lond Math Sci:–

Diaconis P () Group representations in probability and statis-tics, Lecture Notes – Monograph Series Volume . Institute ofMathematical Statistics, Hayward

Faraut J () Analysis on Lie groups. Cambridge University Press,Cambridge

Heyer H () Probability measures on locally compact groups.Springer, Berlin/Heidelberg

Kim PT, Richards DS () Deconvolution density estimators oncompact Lie groups. Contemp Math :–

Koo J-Y, Kim PT () Asymptotic minimax bounds for stochasticdeconvolution over groups. IEEE Trans Inf Theory :–

Liao M () Lévy processes in Lie groups. Cambridge UniversityPress, Cambridge

Lo JT-H, Ng S-K () Characterizing Fourier series representa-tions of probability distributions on compact Lie groups. SiamJ Appl Math :–

McCrudden M () An introduction to the embedding prob-lem for probabilities on locally compact groups. In: Hilgert J,Lawson JD, Neeb K-H, Vinberg EB (eds) Positivity in Lie theory:open problems. Walter de Gruyter, Berlin/New York, pp –

Said S, Lageman C, LeBihan N, Manton JH () Decompoundingon compact Lie groups. IEEE Trans Inf Theory ():–

Siebert E () Fourier analysis and limit theorems for convolutionsemigroups on a locally compact group. Adv Math :–

Probability Theory: An Outline

Tamás RudasProfessor, Head of Department of Statistics, Faculty ofSocial SciencesEötvös Loránd University, Budapest, Hungary

Sources of Uncertainty in StatisticsStatistics is oen dened as the science of the methods ofdata collection and analysis, but from a somewhat more

Page 78: International Encyclopedia of Statistical Science || Probit Analysis

P Probability Theory: An Outline

conceptual perspective, statistics is also the science ofmethods dealing with uncertainty.e sources of uncer-tainty in statistics may be divided into two groups. Uncer-tainty associated with the data one has and uncertaintyassociated with respect to the mechanism which producedthe data.ese kinds of uncertainty are oen interrelatedin practice, yet it is useful to distinguish them.

Uncertainties in DataOnemay distinguish twomain sources of uncertainty withrespect to data. One is related to measurement error, theother to sampling error.Measurement error is the dierence between the actual

measurement obtained and the true value of what wasmeasured. is applies to both cases when a numericalmeasurement taken (typical in the physical, biological,medical, psychological sciences) but also to qualitativeobservations when subjects are classied (typical in thesocial and behavioral sciences), except that in the lattercase, instead of the numerical value of the error, its lackor presence is considered. In the former case, it is oenobserved that the errors are approximately distributedas normal, even if very precise and expensive measur-ing instruments are used. In the latter case, the existenceof measurement error, that is misclassication, is oenattributed to self-reection of the human beings observed.In both situations, the lack of understanding of the precisemechanisms behind measurement errors suggest apply-ing a stochastic model assuming that the result of themeasurement is the sum of the true value plus a randomerror.Sampling error stems from the uncertainty of how our

results would dier, if a sample that is dierent from theactual one were observed. It is usually associated with theentire sample (and not with the individual observations)and is measured as the dierence between the estimatesobtained from the actual sample, and the census value thatcould be obtained if the same data collectionmethodswereapplied to the entire population.e census value may ormay not be equal to the true population parameter of inter-est. For example, if the measurement error is assumed tobe constant, then the census value diers from the popu-lation value by this quantity. Usually, the likely size of thesampling error is characterized by the standard deviation(standard error) of the estimates. Under many commonrandom sampling schemes, the distribution of the esti-mates is normal, and the choice of the standard error, asa characteristic quantity, is well justied.

Uncertainties in ModelingWhile uncertainties associated with the data seem to beinherent characteristics, uncertainties related to modeling

are more determined by our choice of models, whichdepends very oen on the existing knowledge regardingthe research problem at hand.e most common assump-tion is that the quantity of interest has a specied, thoughunknown to the researcher, distribution, that may or maynot be assumed to belong to some parametric family ofdistributions. A further possible choice, gaining increasingpopularity during the recent decades, is that the distribu-tion of interest belongs to a parametric family, though notwith a specied parameter value, rather characterized bya probability distribution (the prior distribution) on thepossible parameter values.e former view is adopted infrequentist statistics and the latter view is the Bayesianapproach to statistics.To model uncertainty, frequentist statistics uses fre-

quentist or classical probability theory, while 7Bayesianstatistics oen relies on a subjective concept of probability.

Classical and Frequentist ProbabilityHistorically, there are two sources of modern probabilitytheory. One is the theory of gambling, where themain goalwas to determine how probable certain outcomes were ina game of chance.ese problems could be appropriatelyhandled under the assumptions that all possible outcomesof an experiment (rolling a die, for example) are equallylikely and probabilities could be determined as the ratio ofthe number of outcomes with a certain characteristic, tothe total number of outcomes.is interpretation of prob-ability is called classical probability. Questions related togambling also made important contributions to develop-ing the concepts of Boolean algebra (the algebra of eventsassociated with an experiment), conditional probabilityand innite sequences of random variables (which play animportant role in the frequentist interpretation of proba-bility see below).e other source of modern probabilitytheory is the analysis of errors associated with a measure-ment.is led, among others, to the understanding of thecentral role played by the normal distribution.It is remarkable, that the main concepts and results of

these two apparently very dierent elds, all may be basedon one set of axioms, proposed by Kolmogorov and givenin the next section.

The Kolmogorov Axiomse axioms, summarizing the concepts developed withinthe classical and frequentist approaches, apply to experi-ments thatmay be repeated innitelymany times, where allcircumstances of the experiment are supposed to remainconstant. An experiment may be identied with its pos-sible outcomes. Certain subsets of outcomes are calledevents, with he assumption that no outcome (the impos-sible event) and all the outcomes (the certain event) are

Page 79: International Encyclopedia of Statistical Science || Probit Analysis

Probability Theory: An Outline P

P

events and countable unions or intersections of eventsare also events. is means that the set of events asso-ciated with an experiment form a sigma-eld. en theKolmogorov axioms (basic assumptions that are acceptedto be true) are the following:For any event A, its probability

P(A) ≥

For the certain event Ω

P(Ω) =

For a series Ai, i = , , . . . of pairwise disjoint events

Σi=,,. . .P(Ai) = P(Σi=,,. . .Ai)

e heuristic interpretation is that the probability ofan event manifests itself via the relative frequency of thisevent over long series of repetitions of the experiment.is is why this approach to probability is oen called fre-quentist probability. Indeed, the axioms are true for relativefrequencies instead of probabilities.

The Laws of Large Numberse link between the heuristic notion of probability andthemathematical theory of probability is established by theresult that if fn(A) denotes the frequency of event A aern repetitions of an experiment, then

fn(A)/n→ P(A),

where the convergence → may be given various interpre-tations. More generally, if X is a random variable (that is,such a function that X ∈ I is an event for every intervalI), then for the average of n independent observations ofX, Xn,

Xn → E(X),

where E(X) is the expected value of X. Here, convergenceis in probability (weak law) or almost surely (strong law)(See also 7Laws of Large Numbers).

The Central Limit Theoremis fundamental result explains why the normal distri-bution plays such a central role of statistics. Many of thestatistics are sample averages and for their asymptotic dis-tributions the following result holds. If V(X) denotes thevariance of X, then the asymptotic distribution of

Xn − E(X)√V(X)/n

is standard normal (see also 7Central Limiteorems).

Subjective Probabilityis interpretation of the concept of probability associatesit with the strength of trust or belief that a person has in theoccurrence of an event. Such beliefs manifest themselves,for example, in betting preferences: out of two events, arational person would have a betting preference for theone with which he/she associates a larger subjective prob-ability. A fundamental dierence between frequentist andsubjective probability is that the latter may also be appliedto experiments and events that may not be repeated manytimes. Of course, the subjective probabilities of dierentindividuals may be drastically dierent from each otherand it has been demonstrated repeatedly that the subjec-tive probabilities an individual associates with dierentevents, may not be logically consistent. Bayesian statis-tics sometimes employs the elicitation of such subjectiveprobabilities to construct a prior distribution.

About the AuthorProfessor Rudas was the Founding Dean of the Faculty ofSocial Sciences, Eötvös Loránd University, from to. He is also an Aliate Professor in the Department ofStatistics, University of Washington, Seattle, and a Recur-rent Visiting Professor in the Central European University,Budapest. He is Vice President, European Association ofMethodology, –. Professor Rudas has been awardedthe Erdei Prize of the Hungarian Sociological Associa-tion for the application of statistical methods in sociology(), and the Golden Memorial Medal of the EötvösLoránd University ().

Cross References7Axioms of Probability7Bayesian Analysis or Evidence Based Statistics?7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs. Classical Point Estimation: A ComparativeOverview7Central Limiteorems7Convergence of Random Variables7Foundations of Probability7Fuzzy Set eory and Probability eory: What is theRelationship?7Laws of Large Numbers7Limiteorems of Probabilityeory7Measureeory in Probability7Philosophy of Probability7Probability, History of7Statistics and Gambling

References and Further ReadingBillingsley P () Probability and measure, rd edn. Wiley,

New York

Page 80: International Encyclopedia of Statistical Science || Probit Analysis

P Probability, History of

Kolmogorov AN () Foundations of the theory of probabil-ity. Chelsey, New York (Original work: Grundbegriffe derWahrscheinlichkeits Rechnung, , Berlin: Springer-Verlag)

Rudas T (ed) () Handbook of probability: theory and applica-tions. Sage, Thousand Oaks

Probability, History of

Jordi VallverdúUniversitat Autònoma de Barcelona, Barcelona,Spain

Five thousand years ago dice were invented in India (David).is fact implies that their users had at least a com-mon sense approach to the idea of probability.ose dicewere not the contemporary cubical standard dice, but fruitstones or animal bones (Dandoy ).ey must surelyhave been used for fun and gambling as well as for fortune-telling practices. e worries about the future and theabsurd idea that the world was causally guided by super-natural forces led those people to a belief in the explanatorypower of rolling dice.In fact, cosmogonical answers were the rst attempt to

explain in a causal way the existence of things and beings.e Greek creation myth involved a game of dice betweenZeus, Poseidon, andHades. Also in the classic Hindu bookMahabharata (section “Sabha-parva”), we can nd the useof dice for gambling. But in both cases there is no theoryregarding probability in dice, just their use “for fun.”Later, and beyond myths, Aristotle was the strongest

defender of the causal and empirical approach to reality(Physics, II, –) although he considered the possibility ofchance, especially the problem of the game of dice (OnHeavens, II, a) and probabilities implied in it.eseideas had nothing to do with those about atomistic chanceby Leucippus and Democritus nor Lucrecius’ controver-sial clinamen’s theory. Hald () arms the existence ofprobabilistic rather thanmathematical thought inClassicalAntiquity; we can accept that some authors (like Aristotle)were worried about the idea of chance (as well as about theprimordial emptiness and other types of conceptual cul-de-sac), but they made no formal analysis of it. Later, wecan nd traces of interest in the moral aspects of gamblingwith dice in Talmudic (Babylonian Talmud, Book : TractSanhedrin, chapter III, Mishnas I to III) and Rabbinicaltexts, and we know that in , Bishop Wibolf of Cambraicalculated diverse ways of playing with three dice. DeVetula, a Latin poem from the thirteenth century, tells us of

possibilities. But the rst occurrence of combinatoricsper se arose from Chinese interest in future predictionthrough the hexagrams of the I Ching (previously eighttrigrams derived from four binary combinations of twoelemental forces, yin and yang).In Luca Paccioli dened the basic principles of

algebra and multiplication tables up to × in his bookSumma de arithmetica, geometria, proportioni e propor-

tionalita. He posed the rst serious statistical problem oftwo men playing a game called “balla,” which is to endwhen one of them has won six rounds. However, whenthey stop playing A has only won ve rounds and B three.How should they divide the wager? It would be another years before this problem was solved.In Girolamo Cardano wrote the books Ars magna

(the great art) and Liber de ludo aleae (the book on gamesof chance). is was the rst attempt to use mathemat-ics to describe statistics and probability, and accuratelydescribed the probabilities of throwing various numberswith dice. Galileo expanded on this by calculating prob-abilities using two dice. At the same time the quantica-tion of all aspects of daily life (art, music, time, space)between the years and made possible the numer-ical analysis of nature and, consequently, the discovery ofthe distribution of events and their rules (Crosby ).It was nally Blaise Pascal who rened the theories

of statistics and, with Pierre de Fermat, solved the “balla”problem of Paccioli (Devlin ). All these paved the wayfor modern statistics, which essentially began with the useof actuarial tables to determine insurance for merchantships (Hacking , ). Pascal was also the rst to applyprobability studies to the theory of decision (see his Pen-sées, ), curiously, in the eld of religious decisions. Itis in this historical moment that the Latin term “proba-bilis” acquires its actual meaning, evolving from “worthyof approbation” to “numerical assessment of likelihood ona determined scale” (Moussy ).In , Antoine Arnauld and Pierre Nicole published

the inuentialLalogiqueoul’artdepenser,wherewecanndstatistical probabilities. Games and their statistical rootsworried people like Cardano, Pascal, Fermat, and Huygens(Weatherford ), although all of them were immersedin a strict mechanistic paradigm. Huygens is consideredthe rst scientist interested in scientic probability, and in he publishedDe ratiotiniis in aleae ludo. In PierreRaymond de Montmort published his Essay d’Analyse surles Jeux deHazard, probably the rst comprehensive text onprobability theory. Itwas thenext step aerPascal’sworkoncombinatoricsanditsapplicationtothesolutionofproblemson games of chance. Later, DeMoivre wrote the inuentialDemensurasortis(),andyearslater,Laplacepublished

Page 81: International Encyclopedia of Statistical Science || Probit Analysis

Probability, History of P

P

hisPhilosophicalEssayAboutProbability.Inthes,DanielBernoulli (Jacob Bernoulli’s nephew) developed the ideaof utility as the mathematical combination of the quantityand perception of risk. Gottfried Leibniz at the beginningof the eighteenth century argued in several of his writingsagainst the ideaofchance,defendingdeterministic theories.According to him, chance was not part of the true natureof reality but the result of our incomplete knowledge. Inthis sense, probability is the estimation of facts that couldbe completely known and predicted, not the basic natureof things. Even morality was guided by natural laws, asImmanuelKantarguedinhisFoundationsoftheMetaphysicsof Morals ().In an inuential paper written by the Reverend

omas Bayeswas published posthumously. Richard Price,who was a friend of his, worked on the results of hiseorts to nd the solution to the problem of computinga distribution for the parameter of a 7binomial distribu-tion: An Essay towards solving a Problem in the Doctrine ofChances. Proposition in the essay represented the mainresult of Bayes. Degrees of belief are therein consideredas a basis for statistical practice. In a nutshell, Bayes pro-posed a theorem in which “probability” is dened as anindex of subjective condence, at the same time takinginto account the relationships that exist within an array ofsimple and conditional probabilities. 7Bayes’ theorem is atool for assessing how probable evidence can make a givenhypothesis (Swinburne ). So, we can revise predic-tions in the light of relevant evidence and make a Bayesianinference, based on the assignment of some a priori dis-tribution of a parameter under investigation (Stigler ).e classical formula of Bayes’ rule is:

P(A∣B) =P(B∣A)P(A)

P(B),

where our posterior belief P(A∣B) is calculated by multi-plying our prior belief P(A) by the likelihood P(B∣A) thatB will occur if A is true.is classical version of Bayesian-ism had a long history, beginning with Bayes and con-tinuing through Laplace to Jereys, Keynes, and Carnapin the twentieth century. Later, in the s, a new typeof Bayesianism appeared, the “subjective Bayesianism” ofRamsey and De Finetti (Ramsey ; de Finetti ;Savage ).At the end of the nineteenth century, a lot of things

were changing in the scientic and philosophical arena.e end of the idea of “causality” and the conicts aboutobservation lay at the heart of the debate. Gödel attackedHilbert’s axiomatic approach tomathematics and BertrandRussell, as clever as ever, told us: “e law of causality (. . .)is a relic of a bygone age, surviving, like themonarchy, only

because it is erroneously supposed to do no harm (. . .)eprinciple “same cause, same eect,” which philosophersimagine to be vital to science, is therefore utterly otiose”(Suppes , p. ). Nevertheless, scientists like Einsteinwere reluctant to accept the loss of determinism in favor ofa purely random Universe; Einstein’s words “God does notplay dice” are the example of the diculty of consideringthe whole world as a world of probabilities, with no innerintentionality, nor moral direction. On the other hand, sci-entists like Monod (Chance and Necessity, ) acceptedthis situation. In both cases, there is a deep considerationof the role of probability and chance in the construction ofthe philosophical and scientic meaning about reality.In the s there arose from theworks of Fisher ()

and Neyman and Pearson () the classic statisticalparadigm: frequentism. ey use the relative frequencyconcept, that is, you must perform one experiment manytimes and measure the proportion where you get a posi-tive result.is proportion, if you perform the experimentenough times, is the probability. If Neyman and Pearsonwrote their rst joint paper and presented their approachas one among alternatives, Fisher, with his null hypoth-esis testing, gave a dierent message: his statistics wasthe formal solution of the problem of inductive inference(Gigerenzer , p. ).From then on, these two main schools, Bayesian and

Frequentist, were ghting each other to demonstrate thattheirs was the superior and only valid approach (Vallverdú).Finally, with the advent of the information era and all

the (super)computer scientic simulations, Bayesianismhas again achieved a higher status inside the community ofexperts on probability. Bayesian inference also allows intel-ligent and real-time monitoring of computational clusters,and its application in belief networks has proved to bea good technique for diagnosis, forecasting, and decisionanalysis tasks. is fact has contributed to the increas-ing application of parallel techniques for large Bayesiannetworks in expert systems (automated causal discovery,AI…) (Korb and Nicholson ).

Acknowledgmentsis research was supported by the project “El diseño delespacio en entornos de cognición distribuida: plantillas yaordances,” MCI [FFI-/FISO].

About the AuthorJordi Vallverdú is a lecturer professor of philosophyin science and computing at Universitat Autonoma deBarcelona. He holds a Ph.D. in philosophy of science(UAB) and a master’s degree in history of sciences (UAB).

Page 82: International Encyclopedia of Statistical Science || Probit Analysis

P Probit Analysis

Aer a short research stay as a fellowship researcher atGlaxo-Wellcome Institute for the History of Medicine,London (), and a research assistant of Dr. Jasano atJ.F.K. School of Government, Harvard University (),he worked in computing epistemology issues and bioethicand synthetic emotions. He is listed as an EU Bioso-ciety Research Expert and is a member of the E-CAP’Steering (http://www.ia-cap.org/administration.php). Heleads a research group, SETE (Synthetic Emotions inTechnological Environments), which has published aboutcomputational models of synthetic emotions and theirimplementation into social robotic systems. He is Editor-in-Chief of the International Journal of Synthetic Emotions(IJSE) and has edited (and written as an included author)the following books: Handbook of research on syntheticemotions and sociable robotics: new applications in aec-

tive computing and articial intelligence (with D. Casacu-berta, Information Science Publishing, ) andinkingmachines and the philosophy of computer science: concepts

and principles (Ed., ).

Cross References7Actuarial Methods7Bayes’eorem7Bayesian Analysis or Evidence Based Statistics?7Bayesian Versus Frequentist Statistical Reasoning7Bayesian vs. Classical Point Estimation: A ComparativeOverview7Foundations of Probability7Philosophy of Probability7Probabilityeory: An Outline7Statistics and Gambling7Statistics, History of

References and Further ReadingCrosby AW () The measure of reality: quantification in Western

Europe –. Cambridge University Press, CambridgeDandoy JR () Astragali through time. In: Maltby M (ed) Inte-

grating zooarchaeology. Oxbow Books, Oxford, pp –David FN () Games, gods and gambling, a history of probability

and statistical ideas. Dover, MineolaDe Finetti B () La Prevision: Ses Lois Logiques, Ses Sources

Subjectives. Annals de l’Institute Henri Poincaré :–Devlin K () The unfinished game: Pascal, Fermat, and the

seventeenth-century letter that made the world modern. BasicBooks, New York

Fisher RA () On the mathematical foundations of theoreticalstatistics. Philos trans Roy Soc London Ser A :–

Gigerenzer G et al () The Empire of chance. How probabilitychanged science and everyday life. Cambridge University Press,Cambridge

Hacking I () The emergence of probability: a philosophicalstudy of early ideas about probability, induction and statisticalinference. Cambridge University Press, Cambridge

Hacking I () The taming of chance. Cambridge University Press,Cambridge

Hald A () A history of probability and statistics and theirapplications before . Wiley, New York

Korb KB, Nicholson AE () Bayesian artificial intelligence. CRCPress, Boca Raton

Moussy C () Probare, probatio, probabilis’ dans de vocabulairede la démonstration. Pallas :–

Neymanv J, Pearson ES () On the use and interpretation certaintest criteria for purposes of statistical inferences. Biometrika(A):–, –

Ramsey FP () Truth and probability (). In: BraithwaiteRB (ed) The foundations of mathematics and other logicalessays, Ch. VII. Kegan Paul, Trench, Trubner/Harcourt, Brace,London/New York, pp –

Savage LJ () The foundations of statistics. Wiley, New YorkStigler SM () The history of statistics: the measurement of

uncertainty before . Harvard University Press, CambridgeSuppes P () A probabilistic theory of causality. North-Holland,

HelsinkiSwinburne R (ed) () Bayes’s theorem. In: Proceedings of the

British academy, vol . Oxford University Press, LondonVallverdú J () The false Dilemma: Bayesian vs. Frequen-

tist. E – LOGOS Electron J Philos –. http://e- logos.vse.cz/index.php?target=indexyear

Weatherford R () Philosophical foundations of probability the-ory. Routledge & Kegan Paul, London

Probit Analysis

Tiberiu PostelnicuProfessor EmeritusRomanian Academy, Bucharest, Romania

Introductione idea of probit analysis was originally published inScience by Chester Ittner Bliss (–) in . He wasprimarily concerned with nding an eective pesticide tocontrol insects that fed on grape leaves. By plotting theresponse of the insects to various concentrations of pesti-cides, he could visually see that each pesticide aected theinsects at dierent concentrations, but he did not have astatisticalmethod to compare this dierence.emost log-ical approach would be to t a regression of the responseversus the concentration or dose and compare between thedierent pesticides. e relationship of response to dosewas sigmoid in nature and at that time regression wasonly used on linear data. erefore, Bliss developed theidea of transforming the sigmoid dose–response curve to astraight line.When biological responses are plotted againsttheir causal stimuli (or their logarithms) they oen form a

Page 83: International Encyclopedia of Statistical Science || Probit Analysis

Probit Analysis P

P

sigmoid curve. Sigmoid relationships can be linearized bytransformations such as logit, probit, and angular. Formostsystems the probit (normal sigmoid) and logit (logistic sig-moid) give the most closely tting result. Logistic methodsare useful in epidemiology, but in biological assay work,probit analysis is preferred.David Finney, from theUniver-sity of Edinburgh, tookBliss’ idea andwrote a book entitledProbit Analysis in , and since this year it is still the pre-ferred statistical method in understanding dose–responserelationships.

ProbitIn probability theory and statistics, the probit functionis the inverse cumulative distribution function (CDF),or quantile function associated with the standard nor-mal distribution. It has applications in exploratory sta-tistical graphics and specialized regression modeling ofbinary response variables. For the standard normal dis-tribution, the CDF is commonly denoted Φ(z), whichis a continuous, monotone-increasing sigmoid functionwhose domain is the real line and range is (, ).e pro-bit function gives the inverse computation, generating avalue of an N(, ) random variable, associated with spec-ied cumulative probability. Formally, the probit functionis the inverse of Φ(z), denoted Φ−(p). In general, wehave Φ(probit(p)) = p and probit(Φ(z)) = z. Bliss pro-posed transforming the percentage into “probability unit”(or “probit”) and included a table to aid researchers to con-vert their kill percentages to his probit, which they couldthen plot against the logarithm of the dose and thereby,it was hoped, obtain a more or less straight line. Such aprobit model is still important in toxicology, as well as inother elds. It should be observed that probit methodol-ogy, including numerical optimization for tting of probitfunctions, was introduced beforewidespread availability ofelectronic computing and, therefore, it was convenient tohave probits uniformly positive.

Related Topicse probit function is useful in statistical analysis for diag-nosing deviation from normality, according to the methodof Q−Q plotting. If a set of data is actually a sample ofa normal distribution, a plot of the values against theirprobit scores will be approximately linear. Specic devi-ation from normality such as asymmetry, heavy tails, orbimodality can be diagnosed based on the detection of spe-cic deviations from linearity. While theQ−Q plot can beused for comparisonwith any distribution family (not onlythe normal), the normal Q−Q plot is a relatively standard

exploratory data analysis procedure because the assump-tion of normality is oen a starting point for analysis.

e normal distribution CDF and its inverse are notavailable in closed form, and computation requires carefuluse of numerical procedures. However, the functions arewidely available in soware for statistics and probabilitymodeling, and also in spreadsheets. In computing envi-ronments where numerical implementations of the inverseerror function are available, the probit function may beobtained as probit(p) =

√erf −(p − ). An example

is MATLAB, where an “ernv” function is available andthe language MATHEMATICA implements “InverseErf.”Other environments directly implement the probit func-tion in the R programming language.Closely related to the probit function is the logit func-

tion using the “odds” p/(−p), where p is the proportionalresponse, i.e., r out of n responded, so there is p = r/n

and logit(p) = log odds = log(p/( − p)). Analogouslyto the probit model, it is possible to assume that such aquantity is related linearly to a set of predictors, resulting inthe logitmodel, the basis in particular of logistic regressionmodel (see7Logistic Regression), themost prevalent formof regression analysis for binary response data. In currentstatistical practice, probit and logit regression models areoen handled as cases of the generalized linear model (see7Generalized Linear Models).

Probit ModelIn statistics and related elds, a probit model is a speci-cation for a binary response model that employs a probitlink function.is model is most oen estimated using thestandard maximum likelihood procedure; such an estima-tion is called probit regression. A fastmethod for computingmaximum likelihood estimates for probit models was pro-posed by Ronald Fisher in an Appendix to the article ofBliss in .Probit analysis is a method of analyzing the rela-

tionship between a stimulus (dose) and the quantal (allor nothing) response. Quantitative responses are almostalways preferred, but in many situations they are not prac-tical. In these cases, it is only possible to determine if a cer-tain response has occurred. In a typical quantal responseexperiment, groups of animals are given dierent doses ofa drug.e percent dying at each dose level is recorded.ese data may then be analyzed using probit analysis.StatPlus includes two dierent methods of probit analysis,but the Finney method is the most important and use-ful.e probit model assumes that the percent response isrelated to the log dose as the cumulative normal distribu-tion, that is, the log doses may be used as variables to readthe percent dying from the cumulative normal. Using the

Page 84: International Encyclopedia of Statistical Science || Probit Analysis

P Probit Analysis

normal distribution, rather than other probability distri-butions, inuences the predicted response rate at the highand low ends of possible doses, but has little inuence nearthe middle. Much of the comparison of dierent drugs isdone using response rates of %.e probit model maybe expressed mathematically as follows:

P = a + b(log(Dose)),

where P is ve plus the inverse normal transform of theresponse rate (called the probit).e ve is added to reducethe possibility of negative probits, a situation that causedconfusion when solving the problem by hand.Suppose the response variableY is binary, that is, it can

have only two possible outcomes, which we will denote as and . For example, Y may represent presence/absenceof a certain condition, success/failure of some device, andanswer yes/no on a survey.We also have a vector of regres-sion X, which are assumed to inuence the outcome Y .Specically, we assume that the model takes the form

P[Y = ∣X] = Φ(X′β),

where P is the probability and Φ is the probit function –the CDF of the standard normal distribution.e param-eters β are typically estimated by maximum likelihood.For more complex probit analysis, such as the calculationof relative potencies from several related dose–responsecurves, consider nonlinear optimization soware or spe-cialist dose–response analysis soware.e latter is a FOR-TRAN routine written by David Finney and Ian Craigiefrom Edinburgh University Computing Center. MLP orGENSTAT can be used for amore general nonlinearmodeltting. We must take into account that the standard pro-bit analysis is designed to handle only quantal responseswith binomial error distributions.Quantal data, such as thenumber of subjects responding versus the total number ofsubjects tested, usually have binomial error distributions.We should not use continuous data, such as percent max-imal response, with probit analysis as these data are likelyto require regressionmethods that assume a dierent errordistribution.

ApplicationsProbit analysis is used to analyze many kinds of dose–response or binomial response experiments in a variety ofelds. It is commonly used in toxicology to determine therelative toxicity of chemicals to living organisms. is isdone by testing the response of an organism under vari-ous concentrations of each of the chemicals in question andthen comparing the concentrations at which one encoun-ters a response.e response is always binomial and the

relationship between the response and the various con-centrations is always sigmoid. Probit analysis acts as atransformation from sigmoid to linear and then runs aregression on the relationship. Once the regression is run,we can use the output of the probit analysis to compare theamount of chemical required to create the same response ineach of the various chemicals.ere are many points usedto compare the diering toxicities of chemicals, but theLC (liquids) or LD (solids) are the most widely usedoutcomes of the modern dose–response experiments.eLC/LD represent the concentration (LC) or dose(LD) at which % of the population responds. It ispossible to use probit analysis with various methods suchas statistical packages SPSS, SAS, R, or S, but it is goodto see the history of the methodology to get a thoroughunderstanding of the material. Wemust take care that pro-bit analysis assumes that the relationship between numberresponding (non-percent response) and concentration isnormally distributed; if not, logit is preferred.

e properties of the estimates given by probit anal-ysis have been studied also by Ola Hertzberg ().eup-and-down technique is the best known among staircasemethods for estimating the parameters in quantal responsecurves (QRC). Some small sample properties of probitanalysis are considered and in the estimate distribution themedians are used as a measure of location.

About the AuthorTiberiu Postelnicu (born on June , , Camp-ina), received his Ph.D. in Mathematics, University ofBucharest, in . He was Head of Department ofBiostatistics, Carol Davila University of Medicine andPharmacy, Bucharest, andDepartment of Biometrics, Cen-tre of Mathematical Statistics of the Romanian Academy,Bucharest. He is a member of the International StatisticalInstitute, Biometric Society, Bernoulli Society for Mathe-matical Statistics and Probability, Italian Society for Statis-tics, and the New York Academy of Science. Dr. Postelnicuwas a member of the editorial boards of the BiometricalJournal and Journal for Statistical Planning and Inference.He was awarded the Gheorghe Lazar Prize for Mathe-matics, Romanian Academy (). Currently, ProfessorPostelnicu is President of the Commission for Biometricsof the Romanian Academy.

Cross References7Agriculture, Statistics in7Econometrics7Generalized Linear Models7Logistic Regression

Page 85: International Encyclopedia of Statistical Science || Probit Analysis

Promoting, Fostering and Development of Statistics in Developing Countries P

P

7Nonlinear Models7Normal Distribution, Univariate

References and Further ReadingBliss CI () The method of probit. Science ():–Bliss CI () The calculation of the dosage–mortality curve. Ann

Appl Biol :, –Bliss CI () The determination of the dosage-mortality curve

from small numbers. Q J Pharmacol :–Collett D () Modelling binary data. Chapman & Hall, LondonFinney DJ () Probit analysis, st edn. Cambridge University

Press, CambridgeFinney DJ, Stevens WL () A table for the calculation of work-

ing probits and weights in probit analysis. Biometrika (–):–

Finney DJ () Probit analysis, rd edn. Cambridge UniversityPress, Cambridge

Greenberg BG () Chester I Bliss, –. Int Stat Rev():–

Hertzberg JO () On small sample properties of probitanalysis. In: Proceedings of the th International BiometricConference. Constantza, Romania, pp –

McCullagh P, Nelder J () Generalized linear models. Chapman &Hall, London

Promoting, Fostering andDevelopment of Statistics inDeveloping Countries

NoëlH. Fonton, NorbertHounkonnouProfessor, Head of Centre of Biostatistics and DirectorLaboratory of Modeling of Biological PhenomenaUniversity of Abomey-Calavi, Cotonou, Benin RepublicProfessor, Chairman of International Chair ofMathematical Physics and Applications(ICMPA UNESCO Chair)Cotonou, Benin Republic

Statistical methods are universal and hence their appli-cability depends neither on geographical area nor on apeople’s culture. Promoting and increasing the use of statis-tics in developing countries can help to nd solutions tothe needs of their citizens. Developing countries are con-fronted with endemic poverty that requires implementablesolutions for alleviating suering. Such poverty is a signalcall to the world to meet fundamental human needs—food, adequate shelter, access to education and healthcare,protection from violence, and freedom. Statistics and sta-tistical tools, thematching betweenhypothesis, data collec-tion, and statistical method, are necessary as development

strategies in the developing counties are formulated toaddress these needs.First, a reliable basis for the implementation of strate-

gies against poverty and achievement of the MillenniumDevelopment Goals require good statistics, an essentialelement of good governance. erefore, important indi-cators to inform and monitor development policies areoen derived from household surveys, which have becomea dominant form of data collection in developing coun-tries. Such surveys are an important source of socioe-conomic data. Azouvi () has proposed a low-coststatistical program in four areas: statistical coordination,national accounts, economic and social conjuncture, anddissemination. To increase awareness that good statisticsare important for achieving better development results,the Marrakech Action Plan for Statistics (MPS) was devel-oped in .is global plan for improving developmentstatistics was agreed upon at a second round-table for bestmanaging development results (World Bank, ). eidea is that better data are needed for better results inorder to improve development statistics. One indicator toensure the application of this MPS is the full participa-tion of developing countries in the census round.Additionally, funds allocated by the World Bank from theDevelopment Grant Facility and technical assistance fromnational universities are critical.Second, national capacity-building in statistics is very

limited. Even in developed countries, secondary schoolstudents, together with their teachers, rarely see the appli-cability and the challenge of statistical thinking (Boland,).is situation exists to a greater degree in the uni-versity training systems of many developing countries. Itis suggested that statistical programs should be reviewedand executed by statisticians and examples of a local natureshould be used. is is possible with sucient numberof statisticians in various elds. According to Lo (),there are to holders of doctoral degrees in statisticsper country, with higher numbers in some countries.ere is a low number of statisticians in developingcountries due to the lack of master’s degree programsin statistics. In sub-Saharan, French-speaking Africancountries, the “Statistiques pour l’Afrique Francophone etApplications au vivant” (STAFAV) project is being imple-mented in three parts: a network for master’s training inapplied statistics at Cotonou (Benin), Saint-Louis (Sene-gal), and Yaounde (Cameroon); some PhD candidatesjointly supervised by scientists from African universitiesand French universities; and the development of statisti-cal research through an African Network of Mathemat-ical Statistics and Applications (RASMA). STAFAV con-stitutes a good means for increasing statistical capacity in

Page 86: International Encyclopedia of Statistical Science || Probit Analysis

P Properties of Estimators

developing countries.With the launch of the Statistical PanAfrican Society (SPAS), more visibility for research andthe use of statistics and probability may be achieved viathe African Visibility Program in Statistics and Probabil-ity. is society is an antidote to the very isolated workof statistics researchers and allows researchers to shareexperiences and scientic work.

ird, statistics are a tool for solving the problemsof developing countries, but the development of researchactivities is a challenge.Many of these countries are locatedin tropical areas; therefore, there are major dierencesbetween them and other countries due to high biolog-ical variability and the probability distribution of thestudied phenomena is frequently misunderstood. Controlof biological variability requires wide use of probabilitytheory. Because of the disparate populations and subpop-ulations in the experimental data, the use of one proba-bility distribution should be called into question. Lackingsophisticated statistical methods, development of statis-tical methods of mix-distributions becomes a challenge.Economic loss, agriculture, water policies, and health(malaria, HIV/AIDS, and recently HN) are the majorareas for research programs inwhich statistics have amajorrole to play. Development of statistical research programsdirected at the well-being of local people is necessary.

e sustainability of statistical development is anotherimportant issue in the eld. Graduate statisticians, whenreturning to their native countries, oen do not have facil-ities for continuing education, documentation, or statis-tics soware packages. To foster statistics in developingcountries, national statistics institutes, universities, andresearch centers need to increase funds allocated for sub-scribing to statistical journals, mainly online, and soware,with a sta properly trained on those elds. Statisticiansmust be oered opportunities to attend annual confer-ences and to do research and/or professional training ina statistical institute outside of their countries.Contributions of statisticians from developed

countries, working or teaching in developing countries,are welcome. Specically, they can join research teams indeveloping countries, share experiences with them, helpacquire funding, and teach.

Cross References7African Population Censuses7Careers in Statistics7Learning Statistics in a Foreign Language7National Account Statistics7Online Statistics Education7Rise of Statistics in the Twenty First Century7Role of Statistics

7Role of Statistics: Developing Country Perspective7Selection of Appropriate Statistical Methods in Develop-ing Countries7Statistics and Climate Change7Statistics Education

References and Further ReadingAzouvi A () Proposals for a minimum programme for statistics

in developing countries. Int Stat Rev ():–Boland PJ () Promoting statistics thinking amongst secondary

school students in national context. ICOTS, p Lo GS () Probability and statistics in Africa. IMS Bull ():World Bank () The marrakech action plan for statistics. Second

international roundtable on managing for development results.Morocco, p

Properties of Estimators

PaulH. GarthwaiteProfessor of Statisticse Open University, Milton Keynes, UK

Estimation is a primary task of statistics and estimatorsplay many roles. Interval estimators, such as condenceintervals or prediction intervals, aim to give a range ofplausible values for an unknown quantity. Density estima-tors aim to approximate a probability distribution.eseand other varied roles of estimators are discussed in othersections. Here attention is restricted to point estimation,where the aim is to calculate from data a single value thatis a good estimate of an unknown parameter.We will denote the unknown parameter by θ, which is

assumed to be a scalar. In the standard situation there is astatistic T whose value, t, is determined by sample data.T is a random variable and it is referred to as a (point)estimator of θ if t is an estimate of θ. Usually there willbe a variety of possible estimators so criteria are neededto separate good estimators from poor ones. ere are anumber of desirable properties which we would like esti-mators to possess, though a property will not necessarilyidentify a unique “best” estimator and rarely will there bean estimator that has all the properties mentioned here.Also, cautionmust be exercised in using the properties as areasonable property will occasionally lead to an estimatorthat is unreasonable.One property that is generally useful is unbiasedness.

T is an unbiased estimator of θ if, for any θ, E(T) = θ.us T is unbiased if, on average, it tends neither to be big-ger nor smaller than the quantity it estimates, regardless of

Page 87: International Encyclopedia of Statistical Science || Probit Analysis

Properties of Estimators P

P

the actual value of the quantity.e bias of T is dened tobe E(T) − θ. Obviously a parameter can have more thanone unbiased estimator. For example, if θ is the mean ofa symmetric distribution from which a random sample istaken, then T is an unbiased estimator if it is the mean,median or mid-range of the sample. It is also the case thatsometimes a unique unbiased estimator is not sensible. Forexample, Cox andHinkley (, p. ) show that if a sin-gle observation is taken from a geometric distributionwithparameter θ, then there is only one unbiased estimator andits estimate of θ is either (if the observation’s value is )or (if the observation’s value is greater than ). In mostcircumstances these are not good estimates.It is desirable, almost by denition, that the estimate

t should be close to θ. Hence the quality of an estimatormight be judged by its expected absolute error, E(∣T − θ∣),or its mean squared error, E[(T − θ)].e latter is usedfar more commonly, partly because of its relationship tothe mean and variance of T:

mean squared error = variance + (bias). ()

If the aim is to nd an estimator with small mean squarederror (MSE), clearly unbiasedness is desirable, as then thelast term in Eq. () vanishes. However, unbiasedness is notessential and trading a small amount of bias for a largereduction in variance will reduce the MSE. Perhaps thebest known biased estimators are the regression coe-cients given by ridge regression (see7Ridge and SurrogateRidge Regressions), which handles multicollinearities in aregression problem by allowing a small amount of bias inthe coecient estimates, thereby reducing the variance ofthe estimates.It may seem natural to try to nd estimators which

minimize MSE, but this is oen dicult to do. Moreover,given any estimator, there is usually some value of θ forwhich that estimator’sMSE is greater than theMSEof someother estimator. Hence the existence of an estimator with auniformlyminimumMSE is generally in doubt. For exam-ple, consider the trivial and rather stupid estimator thatignores the data and chooses some constant θ as the esti-mator of θ. Should θ actually equal θ, then this estimatorhas an MSE of and other estimators will seldom matchit.us other estimators will not have a uniformly smallerMSE than this trivial estimator.Restricting attention to unbiased estimators solves

many of the diculties of working with MSE. e taskof minimizing MSE reduces to that of minimizing vari-ance and substantial theory has been developed aboutminimum variance unbiased estimators (MVUEs). isincludes two well-known results, the Cramer–Rao lowerbound and the 7Rao–Blackwell theorem.e Cramer–Rao

lower bound is I−θ , where Iθ is the Fisher informationabout θ. (Iθ is determined from the likelihood for θ.) Sub-ject to certain regularity conditions, theCramer–Rao lowerbound is a lower bound to the variance of any unbiasedestimator of θ.A benet of the Cramer–Rao lower bound is that it

provides a numerical scale-free measure for judging anestimator: the eciency of an unbiased estimator is denedas the ratio of the Cramer–Rao lower bound to the vari-ance of the estimator. Also, an unbiased estimator is saidto have the property of being ecient if its variance equalsthe Cramer–Rao lower bound. Ecient estimators are notuncommon. For example, the sample mean is an ecientestimator of the population mean when sampling is froma normal distribution or a Poisson distribution, and thereare many others. By denition, only an MVUE might beecient.Suciency is a property of a statistic that can lead to

good estimators. A statistic S (which may be a vector) issucient for θ if it captures all the information about θ

that the data contain. For example, the sample variance issucient for the population variance when data are a ran-dom sample from a normal distribution – hence to makeinferences about the population variance we only need toknow the sample variance and not the individual data val-ues.e denition of suciency is a littlemore transparentin 7Bayesian statistics than in classical statistics (thoughthe denitions are equivalent). In the Bayesian approach,S is sucient for θ if the distribution of θ, given the valueof S, is the same as θ’s distribution given all the data. i.e.g(θ ∣ S) = g(θ ∣data), where g( . ) is the p.d.f. of θ. In theclassical denition (where θ cannot be considered to have adistribution), S is sucient for θ if the conditional distribu-tion of the data, given the value of S, does not depend on θ.A sucient statistic may contain much superuous infor-mation along with the information about θ, so the conceptof a minimal sucient statistic is also useful. A statistic isminimal sucient if it can be expressed as a function ofevery other sucient statistic.

e Rao–Blackwell theorem shows the importance ofsucient statistics when seeking unbiased estimators withsmall variance. It states that if θ is an unbiased estimator ofθ and S is a sucient statistic, then

. TS = E(θ ∣ S) is a function of S alone and is an unbiasedestimator of θ.

. Var(TS) ≤ var(θ).

e theorem means that we can try to improve on anyunbiased estimator by taking its expectation conditionalon a sucient statistic – the resulting estimator will alsobe unbiased and its variance will be smaller than, or equal

Page 88: International Encyclopedia of Statistical Science || Probit Analysis

P Properties of Estimators

to, the variance of the original estimator. Stronger resultshold if a minimal sucient statistic is also complete: S iscomplete ifE[h(S)] cannot equal for all θ unless h(S) ≡ almost everywhere (where h is any function). If S is com-plete there is at most one function of S that is an unbiasedestimator of θ. Suppose now, that S is a complete minimalsucient statistic for θ. An important result is that if h(S)is an unbiased estimator of θ, then h(S) is anMVUE for θ,if an MVUE exists.e consequence is that, when search-ing for an MVUE, attention can be conned to functionsof a complete sucient statistic.Turning to asymptotic properties, suppose that data

consist of a7simple random sample of size n and considerthe behavior of an estimatorT as n→∞. An almost essen-tial property is that the estimator should be consistent: Tis a consistent estimator of θ if T converges to θ in prob-ability as n → ∞. Consistency implies that, as the samplesize increases, any bias in T tends to and the variance ofT also tends to .Two useful properties that do not relate directly to the

accuracy of an estimator are 7asymptotic normality andinvariance. When sample sizes are large, condence inter-vals and hypothesis tests are oen based on the assumptionthat the distribution of an estimator is approximately nor-mal. Hence asymptotic normality is a useful property inan estimator, especially if approximate normality holdsquite well for modest sample sizes. Invariance of estima-tors relates to the method of forming them. It is the notionthat if we take a transformation of a parameter, then ideallyits estimator should transform in the same way. For exam-ple, let ϕ = g(θ), where ϕ is a one-to-one function of θ.en if a method of forming estimators gives t and t asestimates of ϕ and θ, invariance would imply that t neces-sarily equalled g(t). Maximum likelihood estimators areinvariant.We have assumed that the unknown parameter (θ) is

a scalar. Concepts such as unbiasedness, suciency, con-sistency, invariance and asymptotic normality extend verynaturally to the case where the unknown parameter is avector. If θ is a vector but an estimate of just one of its com-ponents is required, then a vector-formof the Cramer–Raolower bound yields a minimum variance for any unbi-ased estimator of the component. Simultaneous estimationof more than one component, however, raises new chal-lenges unless estimating each component separately andcombining the estimates is optimal.While a search for MVUEs has been a focus of one

area of statistics, other branches of statistics want dierentproperties in estimators. Robust methods want point esti-mators that are comparatively insensitive to a few aberrantobservations or the odd outlier. Nonparametric methods

want to estimate a population mean or variance, say,without making strong assumptions about the popula-tion distribution.ese branches of statistics do not placeparamount importance on unbiasedness or minimal vari-ance, but they nevertheless typically seek estimators withlow bias and small variance – it is just that their estimatorsmust also satisfy other requirements. In contrast, Bayesianstatistics uses amarkedly dierent framework for choosingestimators. In its basic form the parameters of the samplingmodel are given a prior distribution, while a loss functionspecies the penalty for inaccuracy in estimating a param-eter. e task is then to select an estimator or decisionrule that will minimize the expected loss, so minimizingexpected loss is the property of dominant importance.Many other sections of this encyclopedia also consider

point estimators and point estimation methods. eseinclude sections on nonparametrics, robust estimation,Bayesian methods and decision theory.e focus in thissection has been the classical properties of point estima-tors. Deeper discussion of this topic and proofs of resultsare given in most advanced textbooks on statistical infer-ence or theoretical statistics, such as Bickel and Doksum(), Cox and Hinkley (), Garthwaite et al. (),and Lehmann and Casella ().

About the AuthorPaul Garthwaite is Professor of Statistics at the Open Uni-versity, UK, where he was Head of the Department ofStatistics from – and again in . He workedfor twenty years in the University of Aberdeen and hasheld visiting positions at universities in New York State,Minnesota, Brisbane and Sydney. In he was awardedthe L J Savage Prize, a prize now under the auspices ofthe International Society for Bayesian Analysis. He haspublished journal papers and is co-author of twobooks, Statistical Inference (Oxford University Press, )and Uncertain Judgements: Eliciting Expert’ Probabilities(Willey, ).

Cross References7Approximations for Densities of Sucient Estimators7Asymptotic Normality7Asymptotic Relative Eciency in Estimation7Bayesian Statistics7Bayesian vs. Classical Point Estimation: A ComparativeOverview7Cramér–Rao Inequality7Estimation7Estimation: An Overview7Minimum Variance Unbiased7Rao–Blackwelleorem

Page 89: International Encyclopedia of Statistical Science || Probit Analysis

Proportions, Inferences, and Comparisons P

P

7Sucient Statistics7Unbiased Estimators andeir Applications

References and Further ReadingBickel PJ, Doksum KA () Mathematical statistics: basic ideas

and selected topics, nd edn. Prentice Hall, LondonCox DR, Hinkley DV () Theoretical statistics. Wiley, New YorkGarthwaite PH, Jolliffe IT, Jones B () Statistical inference, nd

edn. Oxford University Press, OxfordLehmann EL, Casella G () Theory of point estimation, nd edn.

Springer, New York

Proportions, Inferences, andComparisons

George A. F. SeberEmeritus Professor of StatisticsAuckland University, Auckland, New Zealand

A common problem in statistics, and especially in samplesurveys, is how to estimate the proportion p (= − q) ofpeople with a given characteristic (e.g., being le-handed)in a population of known sizeN. If there areM le-handedpeople in the population, then p =M/N.e usualmethodof estimating p is to take a 7simple random sample (SRS),that is, a random samplewithout replacement, of size n andcount the number of le-handed people, x, in the sample.If the sample is representative, then p can be estimated bythe sample proportion p = x/n. Tomake inferences about pwe need to use the probability distribution of x, namely theHypergeometric distribution (see 7Hypergeometric Dis-tribution and Its Application in Statistics), a distributionthat is dicult to use. From this distribution we can getthe mean and variance of x and hence of p, namely

µp = p and σp = r

pq

n,

where r = (N − n)/(N − ) ≈ − f , and f is the samplingfraction n/N, which can be ignored if it is less that . (orbetter .). Fortunately, for suciently large N,M and n,x and p are approximately normal so that z = (p−p)/σp hasan approximate standard normal distribution (with mean and variance ).is approximation will still hold if wereplace p by p in the denominator of σp to get σp giving usan approximate % condence interval p ± .σp.Inverse sampling can also be used to estimate p, espe-

cially when the characteristic is rare. Random samplingis continued until x units of the given characteristic areselected, n now being random, and Haldane in gave

the estimate pI = (x − )/(n − ).is has variance esti-mate σ pI = rI(pI qI)/(n − ), where rI = − (n − )/N forwithout replacement and rI = for with replacement, andan approximate % condence interval for p is pI±.σpI(Salehi and Seber ).Another application of this theory is in the case where

M consists of a known number of marked animals releasedinto a population of unknown size, but with f known to besuciently small so that we can set r = .e condenceinterval for p can then be rearranged to give a condenceinterval for N. is simple idea has lead to a very largeliterature on capture-recapture (Seber ).Returning to our example relating to le-handed peo-

ple, when we choose the rst person from the population,the probability of getting a le-handed person will be p sothat the terms “probability” and “proportion” tend to beused interchangeably in the literature, although they aredistinct concepts.ey can be brought even closer togetherif sampling is with replacement for then the probability ofgetting a le-handed person at each selection remains at pand x nowhas a7Binomial distribution, as we have n inde-pendent trials with probability of “success” being p. eabove formulas for means and variances and condenceinterval are still the same except that r is now exactly .is is not surprising as we expect sampling with replace-ment to be a good approximation for sampling withoutreplacement when a small proportion of a population issampled. Condence intervals for the Binomial distribu-tion have been studied for many years and a variety ofapproximations and modications have been considered,for example, Newcombe (a). “Exact” condence inter-vals, usually referred to as the Clopper–Pearson intervals,can also be computed using the so-called “tail” probabil-ities of the binomial distribution, which are related to a7Beta distribution (cf. Agresti and Coull ). We canalso use the above theory to test null hypotheses like H:p = p, though such hypotheses applymore to probabilitiesthan proportions.When it comes to comparing two proportions, there

are three dierent experimental situations that need to beconsidered. Our example for explaining these relates tovoting preferences. Suppose we wish to compare the pro-portions, say pi (i = , ), of people voting for a particularcandidate in two separate areas and we do so by taking anSRS of size ni in each area and computing pi for each area.In comparing the areas we will be interested in estimatingθ = p − p using θ = p − p. As the two estimates arestatistically independent, and assuming f can be neglectedfor each sample, we have

σθ=pq

n+pq

n,

Page 90: International Encyclopedia of Statistical Science || Probit Analysis

P Psychiatry, Statistics in

which we can estimate by replacing each pi by its esti-mate. Assuming the normal approximation is valid foreach sample, we have the usual approximate % con-dence interval θ ± .σ

θfor θ.is theory also applies to

comparing two Binomial probabilities and, as with a singleprobability, a number of methods have been proposed (seeNewcombe b). Several procedures for testing the nullhypothesisH: θ = are available including an “exact” testdue to Fisher and an approximate Chi-square test with orwithout a correction for continuity.A second situation is when we have a single population

and the pis now refer to two dierent candidates.e esti-mates pi (i = , ) are no longer independent and we nownd, from Scott and Seber (), that

σθ=n[ p + p − (p − p)

],

where n is the sample size. Once again we can replace theunknown pi by their estimates and, assuming f can beignored, we can obtain an approximate condence intervalfor θ, as before.A third situation occurs when two proportions from

the same populations overlap in some way. Suppose wecarry out a sample survey 7questionnaire of n questionsthat have answers “Yes” and “No.” Considering the rst twoquestions, Q and Q, let p be the proportion of peoplewho say “Yes” to both questions, p the proportion whosay “Yes” toQ and “No” toQ, p the proportion who say“No” toQ and “Yes” toQ, and p the proportion who say“No” to both questions.en p = p+p is the proportionsaying “Yes ” to Q and p = p + p the proportion say-ing “Yes” to Q. We want to estimate θ = p − p, as before.If xij are observed in the category with probability pij andx = x + x and x = x + x, then, fromWild and Seber(),

σθ=n[ p + p − (p − p)

].

To estimate the above variance, we would replace each pijby its estimate pij. In many surveys though, only x andx are recorded so that the only parameters we can esti-mate are the pi using pi = xi/n. However, we can use theseestimates in the following bounds

nd( − d) ≤ σ

θ≤n[min(p + p, q + q) − (p − p)

],

where d = ∣p − p∣ = ∣p − p∣. Further comments aboutconstructing condence intervals, testing hypotheses, anddealing with non-responses to the questions are given inthe above paper.For an elementary discussion of some of the above

ideas see Wild and Seber ().

About the AuthorFor biography see the entry 7Adaptive Sampling.

Cross References7Asymptotic Normality7Binomial Distribution7Fisher Exact Test7Hypergeometric Distribution and Its Application inStatistics7Inverse Sampling7Statistical Inference7Statistical Inference in Ecology7Statistical Inference: An Overview

References and Further ReadingAgresti A, Coull BA () Approximate is better than ‘exact’ for

interval estimation of binomial proportions. Am Stat :–Brown LD, Cai TT, DasGupta A () Interval estimation for a bino-

mial proportion. Stat Sci ():– (Followed by a discussionby several authors)

Newcombe R (a) Two-sided confidence intervals for the sin-gle proportion: comparison of seven methods. Stat Med ():–

Newcombe R (b) Interval estimation for the difference betweenindependent proportions: comparison of eleven methods. StatMed ():– (Correction: , , )

Salehi MM, Seber GAF () A new proof of Murthy’s estima-tor which applies to sequential sampling. Aust N Z J Stat :–

Seber GAF () The estimation of animal abundance, nd edn.Blackburn, Caldwell (Reprinted from the edition).

Scott AJ, Seber GAF () Difference of proportions from the samesurvey. Am Stat ():–

Wild CJ, Seber GAF () Comparing two proportions fromthe same survey. Am Stat ():– (Correction: ,(), )

Wild CJ, Seber GAF () Chance encounters: a first course in dataanalysis and inference. Wiley, New York

Psychiatry, Statistics in

Graham DunnProfessor of Biomedical Statistics and Head of the HealthMethodology Research GroupUniversity of Manchester, Manchester, UK

Why “Statistics in Psychiatry”? What makes statistics inpsychiatry a particularly interesting intellectual challenge?Why is it not merely a sub-discipline of 7medical statis-tics such as the application of statistics in rheumatologyor statistics in cardiology? It is in the nature of mental ill-ness and of mental health. Mental illness extends beyond

Page 91: International Encyclopedia of Statistical Science || Probit Analysis

Psychiatry, Statistics in P

P

medicine into the realms of the social and behavioralsciences. Similarly, statistics in psychiatry owes as much,ormore, to developments in social and behavioral statisticsas it does to medical statistics. Statisticians in this area typ-ically user a much wider variety of multivariate statisticalmethods than domedical statisticians elsewhere. Scienticpsychiatry has always taken the problems of measurementmuch more seriously than appears to be the case in otherclinical specialties. is is partly due to the fact that themeasurement problems in psychiatry are obviously rathercomplex, but partly also because the other clinical eldsappear to have been a bit backward by comparison. Itis also an academic discipline where, at its best, there isfruitful interplay between the ideas typical of the ‘medicalmodel’ of disease and those coming from the psychometrictraditions of, say, educationalists and personality theorists.“Mental diseases have both psychological, sociological

and biological aspects and their study requires a combina-tion of the approaches of the psychologist, the sociologistand the biologist, using the last word rather than physiciansince the latter must be all three. In each of these aspectsstatistical reasoning plays a part, whether it be in the futureplanning of hospitals, the classication of the various formsof such illnesses, the study of causation or the evaluation ofmethods of treatment.” (Moran – my italics)“e etiology ofmental illness inevitably involves com-

plex and inadequately understood interactions betweensocial stressors and genetically and socially determinedvulnerabilities – the whole area being overlaid by athick carpet of measurement and misclassication errors.”(Dunn )Do the varieties of mental illness fall into discrete,

theoretically justied, diagnostic categories? Or are theboundaries entirely pragmatic and articial? Should webe considering dimensions (matters of degree) or cate-gories? Is the borderline between bipolar depression andschizophrenia, for example, real or entirely arbitrary?esame question even applies to the existence of mental ill-ness itself. What distinguishes illness from social devianceand eccentricity? Establishing the validity and utility ofpsychiatric diagnoses has been, and still is, a major appli-cation of statistical thinking involving a whole range ofcomplex multivariate methods (factor analysis being oneof the most prominent). Once psychiatrists have created adiagnostic system they also need to be able to demonstrateits reliability.ey need to be condent that psychiatristswill consistently allocate patients with a particular proleof symptoms to the same diagnostic group.ey need toknow that the category “schizophrenia” means the samething and is used in the same way by all scientic psychia-trists, whether they work in the USA, China or Uganda.

Here, again, statistical methods (kappa coecients, forexample) hold the centre stage.

e development of rating scales and the evaluation ofthe associatedmeasurement errors form the central core ofstatistics in psychiatry. It is the problem of measurementthat makes psychiatry stand apart from the other medi-cal specialties. Scientic studies in all forms of medicine(including psychiatry) need to take account of confound-ing and selection eects. Demonstration of treatment e-cacy for mental illness, like treatment ecacy elsewhere inmedicine, always needs the well-designed controlled ran-domized trial. But measurement and measurement errormake psychiatry stand out.Here, the statistician’sworld haslong been populated by latent variable models of all sorts –nite mixture and latent class models, factor analysis anditem response models and, of course, structural equationsmodels (SEM) allowing one to investigate associations and- with luck and careful design - causal links between theselatent variables.

About the AuthorGraham Dunn has been Professor of Biomedical Statis-tics and Head of the Biostatistics Group at the Universityof Manchester since . Before that he was Head of theDepartment of Biostatistics and Computing at the Insti-tute of Psychiatry. Graham’s research is primarily focussedon the design and analysis of randomised trials of complexinterventions, specialising on the evaluation of cognitivebehavioural and other psychological approaches to thetreatment of psychosis, depression and othermental healthproblems. Of particular interest is the design and analysisof multi-centre explanatory trials from which it is possi-ble to test for and estimate the eects of mediation andmoderation, and for the eects of dose (sessions attended)and the quality of the therapy provided (including thera-pist eects). He also has interests in the design and analysisof measurement reliability studies. A key methodologicalcomponent of both of these elds of applied research is thedevelopment and implementation of econometric meth-ods such as the use of instrumental variables. He is authoror co-author of seven statistics books, including Statisti-cal Evaluation of Measurement Errors (Wiley, nd edition), Statistics in Psychiatry (Hodder Arnold, ) and(with Brian Everitt) Applied Multivariate Data Analysis(Wiley, nd edition ).

Cross References7Factor Analysis and Latent Variable Modelling7Kappa Coecient of Agreement7Medical Statistics

Page 92: International Encyclopedia of Statistical Science || Probit Analysis

P Psychological Testing Theory

7Rating Scales7Structural Equation Models

References and Further ReadingDunn G () Statistics in psychiatry. Arnold, LondonMoran PAP () Statistical methods in psychiatric research (with

discussion). J Roy Stat Soc A :–

Psychological Testing Theory

Vesna BuškoAssociate Professor, Faculty of Humanities and SocialSciencesUniversity of Zagreb, Zagreb, Croatia

Testing and assessment of individual dierences have beena critical part of the professional work of scientists andpractitioners in psychology and related disciplines. It isgenerally acknowledged that psychological tests, alongwith the existing conceptualizations of measurements ofhuman potential, are among the most valuable contribu-tions of the behavioral sciences to society. Testing practiceis for many reasons an extremely sensitive issue, and is notonly a professional but also a public issue. As the decisionsbased on test results and their interpretations oen entailimportant individual and societal consequences, psycho-logical testing has been the target of substantial publicattention and long-standing criticism (AERA, APA, andNCME ).

e theory of psychological tests andmeasurement, or,as typically referred to, test theory or psychometric the-ory, oers a general framework and a set of techniquesfor evaluating the development and use of psychologicaltests. Due to their latent nature, the majority of psycho-logical constructs are typically measured indirectly, i.e., byobserving behavior on appropriate tasks or responses totest items. Dierent test theories have been proposed toprovide rationales for behaviorally based measurement.Classical test theory (CTT) has been the foundation

of psychological test development since the turn of thetwentieth century (Lord and Novick ). It comprises anumber of psychometric models and techniques intendedto estimate theoretical parameters, including the descrip-tion of dierent psychometric properties, such as thederivation of reliability estimates and ways to assess thevalidity of use of test.is knowledge is crucial if we are

to make sound inferences and interpretations from thetest scores.

e central notion of CTT is that any observed testscore (X) can be decomposed into two additive compo-nents: a true score (T) and a random measurement errorterm (e). Dierent models of CTT have been proposed,each dened by specic sets of assumptions that deter-mine the circumstances under which the model may bereasonably applied. Some assumptions are associated withproperties of measurement error as random discrepan-cies between true and observed test scores, whereas othersinclude variations of the assumption that the two testsmeasure the same attribute.e latter assumption is essen-tial for deducing test reliability, i.e., the ratio of true scorevariance to observed score variance, from the discrepancybetween two measurements of the same attribute in thesame person.CTT and its applications have been criticized for var-

ious weaknesses, such as population dependence of itsparameters, focus on a single undierentiated randomerror, or arbitrary denition of test score variables. Gener-alizability theory (see Brennan ) was developed as anextension of the classical test theory approach, providinga framework for estimating the eects of multiple sourcesof error or other factors determining test scores. Anothergeneralization of CTT has been put forwardwithin the for-mulation of the Latent State-Traiteory (see Steyer ).Formal denitions of states and traits have been intro-duced, and models allowing the separation of persons,situations, and/or interaction eects from measurementerror components of the test scores are presented.A more recent development in psychometric theory,

item response theory (IRT), emerged to address some ofthe limitations of the classical test theory (Embretson andReise ).e core assumption of IRT is that the prob-ability of a person’s expected response to an item is thejoint function of that person’s ability, or her/his position onthe latent trait, and one or more parameters characterizingthe item.e response probability is presented in the formof an item characteristic curve as a function of the latenttrait.Despite the controversies and criticisms surround-

ing CTT, and the important advances and challenges inthe eld of IRT, both classical and modern test theo-ries appear today to be widely used and are comple-mentary in designing and evaluating psychological andeducational tests.

Cross References7Psychology, Statistics in

Page 93: International Encyclopedia of Statistical Science || Probit Analysis

Psychology, Statistics in P

P

References and Further ReadingAERA, APA, NCME () Standardi za pedagoško i psihološko

testiranje (Standards for Educational and Psychological Test-ing). Naklada Slap, Jastrebasko

Brennan RL () Generalizability theory. Springer, New YorkEmbretson SE, Reise SP () Item response theory for psycholo-

gists. LEA, MahwahLord FC, Novick MR () Statistical theories of mental test scores.

Addison-Wesley Publishing Company, ReadingSteyer R () Wahrscheinlichkeit und regression. Springer, Berlin

Psychology, Statistics in

Joseph S. RossiProfessor, Director of Research at the Cancer PreventionResearch CenterUniversity of Rhode Island, Kingston, RI, USA

e use of quantitative methods in psychology is presentessentially at its beginning as an independent discipline,and many of the early developers of statistical methods,such as Galton, Pearson, and Yule, are generally consid-ered by psychologists as among the major contributors tothe development of psychology itself. In addition, manyearly psychologists mademajor contributions to the devel-opment of statistical methods, oen in the context of psy-chometric measurement theory and multivariate methods(e.g., Spearman, urstone). Among the techniques thatpsychologists developed or helped to develop during theearly part of the twentieth century are the correlation coef-cient, the chi-square test, regression analysis, factor anal-ysis (see7Factor Analysis and Latent Variable Modelling),7principal components analysis, and various multivariateprocedures.e use of the7analysis of variance (ANOVA)in psychology did not begin until about and quicklybecame widespread.During the decades of the s and s, a kind

of schism arose among psychologists, with experimentalpsychologists favoring the use of ANOVA techniques andpsychologists interested in measurement and individualdierences favoring correlation and regression techniques,culminating in Cronbach’s famous declaration concern-ing the “two disciplines” of scientic psychology. atthese procedures were both aspects of the general lin-ear model (see 7General Linear Models) and essentiallyequivalent mathematically did not become widely knownamong psychologists until about . A similar sort ofschism with respect to models of statistical inference has

been resolved with a kind of hybrid model that accom-modates both the Fisher andNeyman-Pearson approaches,although in this case, most researchers in psychology arecompletely unaware that such a schism ever existed, andthat the models of statistical decision-making espoused intheir textbooks and in common everyday use represent acombination of views thought completely antithetical bytheir original proponents. Bayesian approaches, while notunknown in psychology, remain vastly underutilized.Statistical methods currently in common use in psy-

chology include: Pearson product-moment correlationcoecient, chi-square test (see 7Chi-Square Tests), t test,univariate and multivariate analysis of variance (see7Analysis of Variance and7Multivariate Analysis of Vari-ance (MANOVA)) and covariance with associated follow-up procedures (e.g., Tukey test), multiple regression, factoranalysis (see 7Factor Analysis and Latent Variable Mod-elling) and 7principal components analysis, discriminantfunction analysis, path analysis, and structural equationmodeling (see 7Structural Equation Models). Psycholo-gists have been instrumental in the continued develop-ment and renement of many of these procedures, par-ticularly for measurement oriented procedures, such asitem response theory, and structural equation modelingtechniques, including conrmatory factor analysis, latentgrowth curve modeling, multiple group structural invari-ance modeling, and models to detect mediation and mod-eration eects. ere is considerable emphasis on grouplevel data analysis using parametric statistical proceduresand the assumptions of univariate and multivariate nor-mality.e use of nonparametric procedures, once fairlycommon, has declined substantially in recent decades.euse of moremodern nonparametric techniques and robustmethods is almost unknown among applied researchers.

The Null Hypothesis Significance TestingControversyCommon to many of the procedures in use in psychol-ogy is an emphasis on null hypothesis signicance testing(NHST) and concomitant reliance on statistical test p val-ues for assessing the merit of scientic hypotheses. Con-sidering its still dominant position, the use of the NHSTparadigm in psychology and related disciplines has beensubject to numerous criticisms over a surprisingly longperiod of time, starting at least years ago. Until recently,these criticisms have not gained much traction. Commonobjections raised against the NHST paradigm include thefollowing:

e null is not a meaningful hypothesis and is essen-tially always false.

Page 94: International Encyclopedia of Statistical Science || Probit Analysis

P Psychology, Statistics in

Rejection of the null hypothesis provides only weaksupport for the alternative hypothesis.

Failure to reject the null hypothesis does not meanthat the null can be accepted, so that null results areinconclusive.

Signicance test p values are misleading in that theydepend largely on sample size and consequently do notindicate the magnitude or importance of the obtainedeect.

e obtained p value is unrelated to, but frequentlyconfused with both the study alpha (α) level and −α.

Reliance on p values has led to an overemphasis on thetype I error rate and to the neglect of the type II errorrate.

Statistical signicance is not the same as scientic orpractical signicance.

e NHST approach encourages an emphasis on pointestimates of parameter values rather than condenceintervals.

e use of the p < . criterion for 7statistical sig-nicance is arbitrary and has led to dichotomous deci-sion making with regard to the acceptance/rejection ofstudy hypotheses.is has resulted in the phenomenonof “publication bias,” which is the tendency for studiesthat report statistical signicance to be published whilethose that do not are not published, despite the overallquality or merit of the research.

e dichotomous decision making approach inherentto the NHST paradigm has seriously compromised theability of researchers to accumulate data and evidenceacross studies.is has hindered the development oftheories in many areas of psychology, since a few neg-ative results tend to be accorded more weight thannumerous positive results.

e extent and seriousness of these criticisms has led someto suggest an outright ban on the use of signicance testingin psychology. Once inconceivable, this position is receiv-ing serious consideration in numerous journal articles inthe most prestigious journals in psychology, has been dis-cussed by recent working groups and task forces on quan-titative methods and reporting standards in psychology,and is even the subject of one recent book. Even amongthose not willing to discard signicance testing entirely,there is widespread agreement on a number of alternativeapproaches that would reduce reliance on p values.eseinclude the use of condence intervals, eect size indices,7power analysis, and 7meta-analysis. Condence inter-vals (see 7Condence Interval) provide useful informa-tion beyond that supplied by point estimates of parametersand p values. In psychology, the most frequently used has

been the % condence interval. Despite its simplicity,condence intervals are still not widely used and reportedor even that well understood by many applied researchers.For example, some recent studies have indicated that evenwhen error bars are shown on graphs, it is not at all clearif authors intended to show standard deviations, standarderrors, or condence intervals.Measures of eect size have been recommended as

supplements to or even as substitutes for reporting p val-ues.ese provide a more direct index of the magnitudeof study results and are not directly inuenced by studysample size. Measures of eect size fall into two broad cat-egories.e most common historically are measures of theproportion of shared variance between two (ormore) vari-ables, such as r and R.ese indices have a long historyof use in research employing correlational and regressionmethods, particularly within the tradition of individualdierences research. Within the tradition of experimen-tal psychology, comparable measures have been availablefor many years but have not seen widespread use until rel-atively recently. Most commonly used for the t test andANOVA are η and ω, which index the proportion ofvariance in the dependent variable that is accounted forby the independent variable. Again, under the general lin-ear model (see 7General Linear Models), these indicesare essentially equivalent mathematically and retain theirseparate identities primarily for historical purposes. Mul-tivariate analogs exist but are not well known and not oenused.A second and more recent approach to measuring the

magnitude of study outcomes independent of p values isrepresented by standardized measures of eect size.esewere developed by Cohen as early as but have notseen widespread use until relatively recently. While hedeveloped a wide range of such indices designed to covernumerous study designs and statistical methods, by far themost widely known and used is Cohen’s d. An advantageof d is its simplicity:

d = (M −M)/sp

where M and M represent the means of two indepen-dent groups and sp is the pooled within-groups standarddeviation. It is also easily related to proportion of variancemeasures:

η= d

/(d

+ ).

A disadvantage of d is that is it applicable only to the com-parison of two groups. A generalized version applicable tothe ANOVA F test is Cohen’s f . However, this measure ismuch less well-known and not oen utilized. As a guide touse and interpretation, Cohen developed a simple rubric

Page 95: International Encyclopedia of Statistical Science || Probit Analysis

Psychology, Statistics in P

P

for categorizing themagnitudes of eect sizes. For d, small,medium, and large eect sizes are dened as ., .,and ., respectively.e equivalent magnitudes for pro-portion of variance accounted for are ., ., and .,respectively. Cohen recognized these denitions as arbi-trary, but subsequent research suggests they hold up wellacross a broad range of research areas in the “soer” areasof psychology.Although methods for aggregating data across inde-

pendent studies have been in use formore than years, amore formal and systematic approach did not begin to takeshape until with the independent work of Rosenthaland Glass, who coined the term meta-analysis. While verycontroversial at rst, and to a lesser extent still, the tech-nique caught on rapidly as an ecient way to summarizequantitatively the results of a large number of studies, thusovercoming the heavy reliance on p values used in themore traditional narrative literature review. In principleany outcome measure can be employed; however, in prac-tice meta-analysis relies heavily on the use of eect sizes asthe common metric integrated across studies. Many stud-ies employ the Pearson correlation coecient r for thispurpose, although Cohen’s d is without question the mostfrequently used, primarily due to its simplicity and easyapplicability to a wide range of focused two-group com-parisons characteristic of many studies in psychology (e.g.,control vs treatment, men vs women). It is probably fair tosay that the rise of meta-analysis over the past – yearshas greatly facilitated and popularized the concept of theeect size in psychology. As a consequence, a great deal ofwork has been conducted on d to investigate its proper-ties as a statistical estimator.is has resulted in substantialadvances in meta-analysis as a statistical procedure. Mostearly meta-analyses employed simple two-group compar-isons of eect size across studies using a xed eectsmodelapproach (oen implicitly). Most recent applications haveemphasized a regressionmodel approach in which numer-ous study-level variables are quantied and used as pre-dictors of eect size (e.g., subject characteristics, studysetting, measures and methods used, study design quality,funding source, publication status and date, author char-acteristics, etc.). Fixed eects models still predominate,but there is growing recognition that random (or mixed)eects models may be more appropriate in many cases.A considerable array of follow-on procedures have beendeveloped as aids in the interpretation of meta-analysisresults (e.g., eect size heterogeneity test Q, funnel andforest plots, assessment of publication bias, fail-safe num-ber, power analysis, etc.). When done well, meta-analysisnot only summarizes the literature, it identies gaps andprovides clear suggestions for future research.

Current Trends and Future DirectionsQuantitative specialists in psychology continue to workon methods and design in a number of areas. ese in-clude data descriptive and exploratory procedures andalternatives to parametric methods, such as 7exploratorydata analysis and cluster analysis (see 7Cluster Analy-sis: An Introduction), robust methods, and computer-intensive methods. Work focusing on design includesalternatives to randomized designs, methods for eldexperiments and quasi-experimental designs, and the useof fractional ANOVA designs. Structural equation model-ing continues to receive a lot of attention, including workon latent growth curve modeling, latent transition anal-ysis, intensive longitudinal modeling, invariance model-ing, multiple group models, multilevel models, hierarchi-cal linear modeling, and models to detect mediation andmoderation eects.Missing data analysis and multiple imputation meth-

ods (see 7Multiple Imputation), especially for longitudi-nal designs, is also receiving considerable emphasis. eincreased interest in longitudinal approaches has not beenlimited to group designs. e single subject/idiographicapproach, also known as the person-specic paradigm, hasbeen the focus ofmuch recent work.is approach focuseson change over time at the individual level, exempliedby time series analysis, intensive longitudinal modeling,dynamic factor analysis, and dynamic cluster analysis.Psychologists also continue to conduct a great deal

of work on meta-analysis and integrative data analysis,particularly on random eects and hierarchical modelapproaches, as well as on investigations of the propertiesof various eect size indices as statistical estimators andthe application and development of eect size indices fora wider range of study designs. Increased use of meta-analysis by applied researchers is encouraging the useof alternatives to null hypothesis testing, including thespecication of non-zero null hypotheses, and the use ofalternative hypotheses that predict the magnitude of theexpected eect sizes.

About the AuthorJoseph S. Rossi, Ph.D., is Professor and Director of theBehavioral Science Ph.D. program in the Departmentof Psychology, and Director of Research at the CancerPrevention Research Center, at the University of RhodeIsland.Hehas beenprincipal investigator or co-investigatoron more than grants and has published more than papers and chapters. In , the American Psycho-logical Society and the Institute for Scientic Informationlisted Dr. Rossi th in author impact (citations/paper)and th in number of citations (APS Observer, January

Page 96: International Encyclopedia of Statistical Science || Probit Analysis

P Public Opinion Polls

, pp. –). In , he was named one of themost highly cited researchers in the world in theelds of Psychology/Psychiatry by omson Reuters(http://isihighlycited.com/). He won the University ofRhode Island’s Scholarly Excellence Award in , waselected to membership in the Society of MultivariateExperimental Psychology in , and is a fellow ofDivision of the American Psychological Association.Dr. Rossi is one of the principal developers of the trans-theoretical model of health behavior change. His areasof interest include quantitative psychology, health promo-tion and disease prevention, and expert system develop-ment for health behavior change. Dr. Rossi is a memberof the University of Rhode Island’s Institutional ReviewBoard.

Cross References7Condence Interval7Cross Classied and Multiple Membership MultilevelModels7Eect Size7Factor Analysis and Latent Variable Modelling7Frequentist Hypothesis Testing: A Defense7Meta-Analysis7Moderating and Mediating Variables in PsychologicalResearch7Multidimensional Scaling7Multidimensional Scaling: An Introduction7Null-Hypothesis Signicance Testing: Misconceptions7Psychological Testingeory7Sociology, Statistics in7Statistics: Controversies in Practice

References and Further ReadingAPA Publications and Communications Board Working Group on

Journal Article Reporting Standards () Reporting stan-dards for research in psychology: why do we need them? Whatmight they be? Am Psychol :–

Cohen J () Statistical power analysis for the behavioral sciences,nd edn. Lawrence Erlbaum, Hillsdale, NJ

Cohen J () Things I have learned (so far). Am Psychol :–

Cohen J () The earth is round (p < .). Am Psychol :–

Cowles M () Statistics in psychology: an historical perspective.Lawrence Erlbaum, Hillsdale, NJ

Cronbach LJ () The two disciplines of scientific psychology.Am Psychol :–

Cumming G, Finch S () Inference by eye: confidence intervalsand how to read pictures of data. Am Psychol :–

Cumming G, Fidler F, Leonard M, Kalinowski P, Christiansen A,Kleinig A, Lo J, McMenamin N, Wilson S () Statisticalreform in psychology: is anything changing? Psychol Sci :–

Grissom RJ, Kim JJ () Effect sizes for research: a broad practicalapproach. Lawrence Erlbaum, Hillsdale, NJ

Harlow LL, Mulaik SA, Steiger JH (eds) () What if there were nosignificance tests? Lawrence Erlbaum, Hillsdale, NJ

Kline RB () Beyond significance testing: Reforming data anal-ysis methods in behavioral research. American PsychologicalAssociation, Washington, DC

Lipsey MW, Wilson DB () Practical meta-analysis. Sage,Thousand Oaks, CA

MacKinnon DP () Introduction to statistical mediation analy-sis. Lawrence Erlbaum, New York

Maxwell SE () The persistence of underpowered studies inpsychological research: causes, consequences, and remedies.Psychol Meth :–

Meehl PE () Theoretical risks and tabular asterisks: Sir Karl,Sir Ronald, and the slow progress of soft psychology. J ConsultClin Psychol :–

Molenaar P, Campbell CG () The new person-specific paradigmin psychology. Current Directions in Psychological Science:–

Nickerson RS () Null hypothesis significance testing: A reviewof an old and continuing controversy. Psychol Meth :–

Shadish WR, Cook TD () The renaissance of field experimenta-tion in evaluating interventions. Annual Review of Psychology:–

Wilkinson L, and the Task Force on Statistical Inference ()Statistical methods in psychology journals: Guidelines andexplanations. American Psychologist :–

Public Opinion Polls

ChristopherWlezienProfessor of Political SciencesTemple University, Philadelphia, PA, USA

A public opinion poll is a survey of the views of a sample ofpeople. It is what we use to measure public opinion in themodern day.is was not always true, of course, as non-random “straw polls” have been in regular use at least sincethe early nineteenth century. We thus had informationeven back then about what the public thought and wanted,though it was not very reliable.e development of prob-ability sampling, and its application by George Gallup,Archibald Crossley, and Elmo Roper, changed things inimportant ways, as we now had more reliable informa-tion about public opinion (see Geer ). e explo-sion of survey data since that time has fueled the growthin research on attitudes and opinion and behavior thatcontinues today.

ere are various forms of probability sampling. Insimple random sampling respondents are selected purelyat random from the population.is is themost basic formof probability sampling and does a good job representing

Page 97: International Encyclopedia of Statistical Science || Probit Analysis

Public Opinion Polls P

P

the population particularly as the sample size increases andsampling error declines. In stratied random sampling,the population is divided into strata, e.g., racial or eth-nic groups, and respondents are selected randomly fromwithin the strata. is approach helps reduce samplingerror across groups, which can result from simple ran-dom sampling. Traditionally, most survey organizationshave relied on 7cluster sampling. Here the population isdivided into geographic clusters, and the survey researcherdraws a sample of these clusters and then samples ran-domly from within them.is is particularly useful whenrespondents are geographically disbursed. Survey organi-zations using any of thesemethods traditionally have reliedon face-to-face interviews.

e technology of public opinion polling has changedquite dramatically over time. e invention of randomdigit dialing (RDD) had an especially signicant impact,as interviewing could be done over the telephone basedon lists of randomly-generated phone numbers.e morerecent introduction of internet polling is having a simi-lar impact.ese developments have clear and increasingadvantages in cost and speed, and have made it much eas-ier to conduct polls. Witness the growth in the number ofpre-election trial-heat polls in presidential election yearsin the United States (US) (Wlezien and Erikson ).Consider also that National Annenberg Election Survey(NAES) conducted over , telephone interviews dur-ing the presidential election campaign, with similarnumbers in and . In the same election yearsKnowledge Networks’ conducted repeated interviews with, individuals via the internet. Similar developmentscan be seen in other countries, including Canada and theUnited Kingdom.ese numbers would be almost incon-ceivable using face-to-face interviews.

e developments also come with disadvantages. Tobegin with, there is coverage error. Not everyone has a tele-phone, and the number relying solely on a cell phone –which poses special challenges for telephone surveys – isgrowing. Fewer have access to the internet and we cannotrandomly e-mail them (note that this precludes calcu-lations of sampling error, and thus condence intervals.Internet polls do have a number of advantages for scholarlyresearch, however (Clarke et al. )). Even among thosewe can reach, nonresponse is a problem. e issue hereis that respondents who select out of surveys may not berepresentative of the population. Survey organizations andscholars have long relied on weighting devices to addresscoverage and nonresponse error (besides sampling errorand coverage and nonresponse error, all polls are subjectto measurement error, which reects aws in the surveyinstrument itself, including question wording, order, inter-viewer training and other things. For a treatment of the

dierent forms of survey error, see Weissberg ()). Inrecent years more complicated approaches have begun tobe used, including “propensity scores” (see, e.g., Terhanian).A recent analysis of polling methods in the US

presidential election campaign suggested that the surveymode had little eect on poll estimates (AAPOR ).e extent to which the weighting xes used by surveyorganizations succeed is the subject of ongoing research –see, e.g., work on internet polls by Malhotra and Krosnick() and Sanders et al. (). It is of special importancegiven the appeal of internet surveys owing to the speedwith which they can be conducted and their comparativelylow cost.Despite the diculties, polls have performed very well.

Pre-election polls have proved very accurate at predict-ing the nal vote, particularly at the end of the campaign(Traugott ).ey also predict well earlier on, thoughit is not an identity relation, e.g., in US presidential elec-tions, early leads tend to decline by Election Day (Wlezienand Erikson ). Pre-election polls in recent electionyears have provided more information about the electionoutcome than highly-touted election prediction markets(Erikson andWlezien ). Polls also tell quite a lot aboutpublic opinion (see, e.g., Stimson ; Page and Shapiro; Erikson and Tedin ). Policymakers now havereliable information about the preferences of those with astake in the policies they make. It also appears to make adierence to what they actually do (Geer ).

AcknowledgmentsI thank my colleague Michael G. Hagen for very helpfulcomments.

About the AuthorDr. Christopher Wlezien is Professor, Department ofPolitical Science, Temple University, Philadelphia. Whileat Oxford University, he co-founded the Spring Schoolin Quantitative Methods for Social Research. He is anelected member of the International Statistical Insti-tute (), and holds or has held visiting positions atColumbia University, the European University Institute,Institute Empresa, JuanMarch Institute,McGill University,L’Institut d’Etudes Politiques de Paris, and theUniversity ofManchester. He has authored or co-authored many papersand books, including Degrees of Democracy (Cambridge,), in which he develops and tests a thermostatic modelof public opinion and policy. Currently, he is founding co-editor of the Journal of Elections,Public Opinion and Partiesand Associate editor of Public Opinion Quarterly.

Page 98: International Encyclopedia of Statistical Science || Probit Analysis

P P–Values

Cross References7Margin of Error7Nonresponse in Surveys7Questionnaire7Social Statistics7Sociology, Statistics in7Telephone Sampling: Frames and Selection Techniques

References and Further ReadingAmerican Association for Public Opinion Research () An

evaluation of the methodology of the pre-electionprimary polls. Available at http://aapor.org/uploads/AAPOR_Rept_FINAL-Rev---.pdf

Clarke HD, Sanders D, Stewart MC, Whiteley P () Internetsurveys and national election studies: a symposium. Journal ofElections, Public Opinion and Parties :–

Erikson RS, Tedin KL () American public opinion. Longman,New York

Erikson RS, Wlezien C () Are political markets really superiorto polls as election predictions? Public Opin Quart :–

Geer J () From tea leaves to public opinion polls. ColumbiaUniversity Press, New York

Malhotra N, Krosnick JA () The effect of survey mode on infer-ences about political attitudes and behavior: Comparing the and ANES to internet surveys with non-probabilitysamples. Polit Anal :–

Page B, Shapiro RY () The rational public. University of ChicagoPress, Chicago

Sanders D, Clarke HD, Stewart MC, Whiteley P () Does modematter for modeling political choice: evidence from the britishelection study. Polit Anal :–

Stimson JN () Public opinion in american: moods, cycles andswings. Westview Press, Boulder, CO

Terhanian G () Changing times, changing modes: the future ofpublic opinion polling. Journal of Elections, Public Opinion andParties :–

Traugott MW () The accuracy of the pre-election polls in the presidential election. Public Opin Quart :–

Weissberg H () The total survey error approach. University ofChicago Press, Chicago

Wlezien C, Erikson RS () The timeline of presidential electioncampaigns. J Polit ():–

P–Values

RaymondHubbardomas F. Sheehan Distinguished Professor of MarketingDrake University, Des Moines, IA, USA

e origin of the p-value is credited to Karl Pearson (),who introduced it in connection with his chi-square test(see7Chi-Square Tests). However, it was Sir Ronald Fisherwho popularized signicance tests and p-values in the

multiple editions of his hugely inuential books StatisticalMethods for Research Workers and e Design of Experi-ments, rst published in and , respectively. Fisherused divergencies in the data to reject the null hypothe-sis by calculating the probability of the data on a true nullhypothesis, or Pr(x∣H). More formally, p = Pr(T(X) ≥

T(x)∣H).e p-value is the probability of getting a teststatistic T(X) larger than or equal to the observed result,T(x), as well as more extreme ones, assuming a true nullhypothesis, H, of no eect or relationship. us, the p-value is an index of the (im)plausibility of the actual obser-vations (together with more extreme, unobserved ones) ifthe null is true, and is a randomvariablewhose distributionis uniform over the interval [, ].

e reasoning is that if the data are viewed as beingrare or extremely unlikely underH, this constitutes induc-tive evidence against the null hypothesis. Fisher (, p. )immortalized a p-value of . for rejecting the null: “It isusual and convenient for experimenters to take per cent.as a standard level of signicance, in the sense that they areprepared to ignore all results which fail to reach this stan-dard.” Consequently, values like p < ., p < ., andso on, are said to furnish even stronger evidence againstH. So Fisher considered p-values to play an importantepistemological role (Hubbard and Bayarri ).Moreover, Fisher (, p. ) saw the p-value as an

objectivemeasure for judging the (im)plausibility of H:

7 “…the feeling induced by a test of significance has anobjective basis in that the probability statement on whichit is based is a fact communicable to and verifiable by otherrational minds. The level of significance in such cases fulfilsthe conditions of a measure of the rational grounds for thedisbelief [in the null hypothesis] it engenders.”

Researchers across the world have enthusiastically adoptedp-values as a “scientic” and “objective” criterion for certi-fying knowledge claims.Unfortunately, the p-value is neither an objective nor

very useful measure of evidence in statistical signicancetesting (see Hubbard and Lindsay , and the referencestherein). In particular, p-values exaggerate the evidenceagainst H. Because of this, the validity of much pub-lished research with comparatively small (including .)p-values must be called into question.Of great concern, though obviously no fault of the

index itself, members of the research community insiston investing p-values with capabilities they do not pos-sess (for critiques of this, see, among others, Carver ;Nickerson ). Some common misconceptions regard-ing the p-value are that it denotes an objective measure of:

Page 99: International Encyclopedia of Statistical Science || Probit Analysis

P-Values, Combining of P

P

e probability of the null hypothesis being true e probability (in the sense of − p) of the alternativehypothesis being true

e probability (again, in the sense of − p) that theresults will replicate

e magnitude of an eect e substantive or practical signicance of a result e Type I error rate e generalizability of a result

Despite its ubiquity, the p-value is of very limited use.Indeed, I agree with Nelder’s (, p. ) assertion thatthe most important task in developing a helpful statisticalscience is “to demolish the P-value culture.”

About the AuthorDr. Raymond Hubbard is omas F. Sheehan Distin-guished Professor of Marketing in the College of Businessand Public Administration, Drake University, DesMoines,Iowa, USA. He has served as the Chair of the MarketingDepartment (–; –; –). He isa member of the American Marketing Association,Academy of Marketing Science, and the Association forConsumer Research. He has authored or coauthored over journal articles, many of them methodological innature. He is presently working on a book (with R. MurrayLindsay), tentatively titled “From Signicant Dierence toSignicant Sameness: A Proposal for a Paradigm Shi inManagerial and Social Science Research.”

Cross References7Bayesian P-Values7Eect Size7False Discovery Rate7Marginal Probability: Its Use in Bayesian Statistics asModel Evidence7Misuse of Statistics7Null-Hypothesis Signicance Testing: Misconceptions7Psychology, Statistics in7P-Values, Combining of7Role of Statistics7Signicance Testing: An Overview7Signicance Tests, History and Logic of7Signicance Tests: A Critique7Statistical Evidence7Statistical Fallacies: Misconceptions, and Myths7Statistical Inference: An Overview7Statistical Signicance7Statistics: Controversies in Practice

References and Further ReadingCarver RP () The case against statistical significance testing.

Harvard Educ Rev :–Fisher RA () Statistical methods and scientific inference, nd

edn. Oliver and Boyd, Edinburgh. RevisedFisher RA () The design of experiments, th edn. Oliver and

Boyd, EdinburghHubbard R, Bayarri MJ () Confusion over measures of evi-

dence (p’s) versus errors (α’s) in classical statistical testing (withcomments). Am Stat :–

Hubbard R, Lindsay RM () Why P-values are not a usefulmeasure of evidence in statistical significance testing. TheoryPsychol :–

Nelder JA () From statistics to statistical science (with com-ments). Stat :–

Nickerson RS () Null hypothesis significance testing: a reviewof an old and continuing controversy. Psychol Meth :–

Pearson K () On the criterion that a given system of deviationsfrom the probable in the case of a correlated system of variablesis such that it can be reasonably supposed to have arisen fromrandom sampling. London Edinburgh Dublin Philos Mag J Sci:–

P-Values, Combining of

Dinis PestanaProfessor, Faculty of SciencesUniversidade de Lisboa, DEIO, and CEAUL – Centro deEstatística e Aplicações da Universidade de Lisboa,Lisboa, Portugal

Let us assume that the p-values pk are known for testingHk versus HAk, k = , . . . ,n, in n independent studies onsome common issue, and our aim is to achieve a decisionon the overall questionH∗ : all the Hk are true versusH∗A :some of the HAk are true. As there are many dierent waysin which H∗ can be false, selecting an appropriate test isin general unfeasible. On the other hand, combining theavailable pk’s so that T(p, . . . , pn) is the observed valueof a random variable whose sampling distribution underH∗ is known is a simple issue, since under H∗ , p is theobserved value of a random sample P = (P, . . . Pn) from aUniform(, ) population. In fact, several dierent sensiblecombined testing procedures are oen used.A rational combined procedure should of course be

monotone, in the sense that if one set of p-values p =

(p, . . . , pn) leads to rejection of the overall null hypoth-esis H∗ , any set of componentwise smaller p-values p′ =(p′, . . . , p′n), p′k ≤ pk, k = , . . . ,n, must also reject H

∗ .

Page 100: International Encyclopedia of Statistical Science || Probit Analysis

P P-Values, Combining of

Tippett () used the fact that P:n = minP, . . . ,Pn∣H∗Beta(,n) to reject H∗ if the minimum observed

p-value p:n < −(−α)/n.isTippett’s minimummethodis a special case of Wilkinson’s method (Wilkinson, ),advising rejection ofH∗ when some low rank order statis-tic pk:n < c; as Pk:n Beta(k,n+ − k), to rejectH∗ at levelα the cut-of-point c is the solution of ∫

c

uk−(−u)n−kdu =

α B(k,n + − k).e exact distribution of Pn =

n ∑nk= Pk is cum-

bersome, but for large n an approximation based onthe central limit theorem (see 7Central Limit e-orems) can be used to perform an overall test onH∗ vs.H∗A . On the other hand, the probability density func-tion of the 7geometric mean Gn = (∏

nk= Pk)

n of n inde-

pendent uniform random variables is readily computed,

fGn(x) =n (−n x ln x)n−

Γ(n)I(,)(x), leading to a more

powerful test; see, however, the discussion below on pub-lication bias.Another way of constructing combined p-values is

to use additive properties of simple functions of uni-form random variables. Fisher () used the fact thatPk Uniform(, ) Ô⇒ − lnPk χ , and there-

fore, −n

∑k=lnPk

∣H∗χn.en H∗ is rejected at the signif-

icance level α if the −n

∑k=ln pk > χ

n,−α . Stouer et al.

() used as test statisticn

∑k=

Φ−(Pk)√n

∣H∗Gaussian(, ),

where Φ− denotes the inverse of the distribution func-tion of the standard Gaussian, rejecting H∗ at level α if∣∑nk=

Φ−(Pk)√n

∣ > z−α .Another simple transformation of uniform random

variables Pk is the logit transformation, ln Pk−Pk

Logistic(, ). Asn

∑k=

ln Pk−Pk√

nπ(n+)(n+)

≈ tn+, rejectH∗ at the

signicance level α if −n

∑k=

ln pk−pk

nπ(n+)(n+

> tn+, −α .

Birnbaum () has shown that everymonotone com-bined test procedure is admissible, i.e., provides a mostpowerful test against some alternative hypothesis for com-bining some collection of tests, and is therefore opti-mal for some combined testing situation whose goal isto harmonize eventually conicting evidence, or to poolinconclusive evidence. In the context of social sciences

Mosteller and Bush () recommend Stouer’s method,but Littel and Folks (, ) have shown that undermild conditions Fisher’s method is optimal for combiningindependent tests.As in many other techniques used in 7meta-analysis,

publication bias can easily lead to erroneous conclusions.In fact, the set of available p-values comes only fromstudies considered worth publishing because the observedp-values were small, seeming to point out signicantresults.us the assumption that the pk’s are observationsfrom independentUniform(, ) random variables is ques-tionable, since in general they are in fact a set of low orderstatistics, given that p-values greater than . have not

been recorded. For instance, E (Gkn) = ( + k

n

)n

Ð→n→∞

e−k,

and in particular E (Gn) = ( nn+)

n↓

n→∞e ≈ ., the

standard deviation decreases to zero, the skewness steadilydecreases aer a maximum . for n = , and thekurtosis increases from −. (for n = ) towards .Whenever pn:n falls below the critical rejection point, thistest will lead to the rejection of H∗ , but pn:n smaller thanthe critical point ( for n ≥ , the expected value of Gnis greater than . and the standard deviation is smallerthan .) is what should be expected as a consequence ofpublication bias.Another important issue: H∗A states that some of the

HAk are true, and so a meta-decision on H∗ implicitlyassumes that some of the Pk may have non-uniform dis-tribution, cf. Hartung et al. (, pp. –) and Kulin-skaya et al. (, pp. –), and references therein,on the promising concepts of generalized and of ran-dom p-values. Gomes et al. () investigated the eectof augmenting the available set of p-values with uniformand with non uniform pseudo-p-values, using results suchas: Let Xm and Xm be independent random variables,Xm denoting a random variable with probability densityfunction fm(x) = (mx + −m

) I(,)(x), m ∈ [−, ],i.e., a convex mixture of uniform and Beta(, ) (if m ∈

[−, )), thus favoring pseudo-p-values near the sharperthe slope is, the slope m = corresponding to standarduniform, or of uniform and Beta(, ) (if m ∈ (, ]),in this case favoring the occurrence of p-values near .en min ( Xm

Xm, −Xm−Xm

) is a member of the same family –more precisely X m m

. In particular, if either m = or

m = , then min ( XmXm, −Xm−Xm

) can be used to generatea new set of uniform random variables, which moreoverare independent of the ones used to generate them.Extensive simulation, namely with computationally

augmented samples of p-values (Gomes et al. ;

Page 101: International Encyclopedia of Statistical Science || Probit Analysis

Pyramid Schemes P

P

Brilhante et al. ) led to the conclusion that inwhat con-cerns decreasing power, and increasing number of unre-ported cases needed to reverse the overall conclusion of ameta-analysis, the methods of combining p-values rank asfollows:

. Arithmetic mean. 7Geometric mean. Chi-square transformation (Fisher’s method). Logistic transformation. Gaussian transformation (Stouer’s method). Selected order statistics (Wilkinson’s method). Minimum (Tippett’s method)

About the AuthorProfessor Dinis Pestana has been President of theDepartment of Statistics and Operations Research, Uni-versity of Lisbon, for two consecutive terms (–),President of the Faculty of Sciences of Lisbon extensionin Madeira (–) before the University of Madeirahas been founded, President of the Center of Statisticsand Applications, Lisbon University, for three consecu-tive terms (–). He supervised the Ph. D. studiesof many students that played a leading role in the devel-opment of Statistics at the Portuguese universities, and isco-author of papers published in international journalsor as chapters of books, and many other papers, namelyexplaining Statistics to the layman.He launched the annualmeetings of the Portuguese Statistical Society, and hada leading role on the local organization of internationalevents, such as the European Meeting of Statisticiansin Funchal.

Cross References7Bayesian P-Values7Meta-Analysis7P-Values

References and Further ReadingBirnbaum A () Combining independent tests of significance.

J Amer Stat Assoc :–Brilhante MF, Pestana D, Sequeira F () Combining p-values and

random p-values. In: Luzar-Stiffler V, Jarec I, Bekic Z (eds)Proceedings of the ITI , nd International Conferenceon Information Technology Interfaces, IEEE CFP-PRT,–

Fisher RA () Statistical methods for research workers, th edn.Oliver and Boyd, London

Gomes MI, Pestana D, Sequeira F, Mendonça, S, Velosa S ()Uniformity of offsprings from uniform and non-uniform par-ents. In: Luzar-Stiffler V, Jarec I, Bekic Z (eds) Proceedingsof the ITI , st International Conference on InformationTechnology Interfaces, pp –

Hartung J, Knapp G, Sinha BK () Statistical meta-analysis withapplications. Wiley, New York

Kulinskaya E, Morgenthaler S, Staudte RG () Meta analysis.a guide to calibrating and combining statistical evidence. Wiley,Chichester

Littel RC, Folks LJ (, ) Asymptotic optimality of Fisher’smethod of combining independent tests, I and II. J Am StatAssoc :–, :–

Mosteller F, Bush R () Selected quantitative techniques In:Lidsey G (ed) Handbook of social psychology: theory andmethods, vol I. Addison-Wesley, Cambridge

Stouffer SA, Schuman EA, DeVinney LC, Star S, Williams RM() The American Soldier, vol I: Adjustment during armylife. Princeton University Press, Princeton

Tippett LHC () The methods of statistics. Williams and Norgate,London

Wilkinson B () A statistical consideration in psychologicalresearch. Psychol Bull :–

Pyramid Schemes

Robert T. SmytheProfessorOregon State University, Corvallis, OR, USA

Apyramid scheme is a businessmodel inwhich payment ismade primarily for enrolling other people into the scheme.Some schemes involve a legitimate business venture, butin others no product or services are delivered. A typicalpyramid scheme combines a plausible business opportu-nity (such as a dealership) with a recruiting operationthat promises substantial rewards. A recruited individualmakes an initial payment, and can earn money by recruit-ing others who also make a payment; the recruiter receivespart of these receipts, and a cut of future payments as thenew recruits go on to recruit others. In reality, because ofthe geometrical progression of (hypothetical) recruits, fewparticipants in a pyramid scheme will be able to recruitenough others to recover their initial investment, let alonemake a prot, because the pool of potential recruits israpidly exhausted.Although they are illegal in many countries, pyra-

mid schemes have existed for over a century. As recentlyas November , riots broke out in several towns inColombia aer the collapse of several pyramid schemes,and in Ireland launched a website to better educateconsumers to pyramid fraud aer a series of schemes wereperpetrated in Cork and Galway.

Page 102: International Encyclopedia of Statistical Science || Probit Analysis

P Pyramid Schemes

Perhaps the best-known type of pyramid scheme is achain letter, which oen does not involve even a ctitiousproduct. A chain letter may contain k names; purchasersof the letter invest $x, with $x paid to the name at the topof the letter and $x to the seller of the letter.e purchaserdeletes the name at the top of the list, adds his own at thebottom, and sells the letter to new recruits.e promoter’spitch is that if the purchaser, and each subsequent recruitfor k− stages, sells just two letters, therewill be k− peopleselling k letters featuring the purchaser’s name at the topof the list, so that the participant would net $kx from theventure.Many variants of this basic “get rich quick” schemehave been, and continue to be, promoted.A structure that can be used to model many pyramid

schemes is that of recursive trees. A tree with n verticeslabeled , , . . . ,n is a recursive tree if node is distinguishedas the root, and for each j with ≤ j ≤ n, the labels ofthe vertices in the unique path from the root to node jform an increasing sequence.e special case of randomor uniform recursive trees, in which all trees in the setof trees of given order n are equally probable, has beenextensively analyzed (cf. Smythe andMahmoud (), forexample); however, most pyramid schemes or chain let-ters in practice have restrictions making their probabilitymodels non-uniform.e number of places where the nextnodemay join the tree is then a randomvariable, unlike theuniform case. is complicates the analysis considerably(and may account for the relative sparsity of mathematicalanalysis of the properties of pyramid schemes).Bhattacharya and Gastwirth () analyze a chain let-

ter scheme allowing reentry, in which each purchaser maysell only two letters, unless he purchases a new letter tore-enter the chain. In terms of recursive trees, this meansthat a node of the tree is saturated once it has two o-spring nodes, and no further nodes can attach to it. It isfurther assumed that at each stage, participants who havenot yet sold two letters all have an equal chance to makethe next sale, i.e., all unsaturated nodes of the recursivetree have an equal chance of being the “parent” of the nextnode to be added. If Ln denotes the number of leaves of therecursive tree (nodes with no ospring) at stage n underthis growth rule, Ln/n corresponds to the proportion of“shutouts” (those receiving no revenue) in this chain letterscheme.e analysis of Bhattacharya and Gastwirth setsup a nonhomogeneous Markov chain model and derives adiusion approximation for large n. ey nd that Ln/nconverges to . in this model and that the (centeredand scaled) number of shutouts has a normally distributedlimit. Mahmoud () considers the height hn of the treeof order n in this same “random pyramid” scheme andshow that it converges with probability to .; the

proof involves embedding the discrete-time growth pro-cess of the pyramid in a continuous time birth-and-deathprocess. Mahmoud notes that a similar analysis could becarried out for schemes permitting the sale of m letters,provided that the probabilistic behavior of the total num-ber of shutouts could be derived (as it was in the binarycase by Bhattacharya and Gastwirth).Gastwirth () and Gastwirth and Bhattacharya

() analyze another variant of pyramid schemes, knownas a quota scheme. is places a limit on the maximumnumber of participants, so that the scheme corresponds toa recursive tree of some xed size n.is scheme derivesfrom a case in a Connecticut court (Naruk ) in whichpeople bought dealerships in a “Golden Book of Values,”thenwere paid to recruit other dealers. In this scheme, eachparticipant receives a commission from all of his descen-dants; thus for the jth participant, the size of the branchof the tree rooted at j determines his prot. If Sj denotesthe size of this branch, and j/n converges to a limit θ,Gastwirth and Bhattacharya showed that the distributionof Sj converges to the geometric law

P(Sj = i + ) = θ( − θ)ifor i = , , , . . .

(It was later shown (Mahmoud and Smythe ()) that ifj is xed, the limiting distribution of Sj/n is Beta(, j − ).)Calculations made by Gastwirth and Bhattacharya showthat, for example, when n is xed at , the th entry hasprobability only about . of recruiting two or more newentrants, and probability . of three or more recruits.Gastwirth () shows that for large n, the expected pro-portion of all participants who are able to recruit at least rpersons is −r .Other variants of pyramid schemes include the “-Ball

Model” and the “-Up System” (http://www.mathmotivation. com/money/pyramid-scheme.html). In the eight-Ballmodel, the participant again recruits two new entrants, butdoes not receive any payment until two further levels havebeen successfully recruited.us a person at any level inthe scheme would theoretically receive = times his“participation fee,” providing incentive to help those inlower levels succeed. In the two-Up scheme, the incomefrom a participant’s rst two recruits goes to the individualwho recruited the participant; if the participant succeeds inrecruiting three ormore new entrants, the income receivedfrom these goes to the participant, along with the incomefrom the rst two sales made by each subsequent recruit.is scheme creates considerable incentive to pursue thepotentially lucrativethirdrecruit.Forbothoftheseschemes,it is easily calculated that when the pool of prospectiverecruits is exhausted, themajority of the participants in thescheme end up losing money.

Page 103: International Encyclopedia of Statistical Science || Probit Analysis

Pyramid Schemes P

P

About the AuthorRobert Smythe is Professor of Statistics at Oregon StateUniversity. HewasChair of theDepartment of Statistics forten years and previously chaired the Department of Statis-tics at George Washington University for eight years. Hehas written in many areas of probability and statistics, andis a Fellow of the American Statistical Association and aFellow of the Institute for Mathematical Statistics.

Cross References7Beta Distribution7Geometric and Negative Binomial Distributions7Markov Chains

References and Further ReadingBhattacharya P, Gastwirth J () A non-homogeneous Markov

model of a chain letter scheme. In: Rizvi M, Rustagi J,

Siegmund D (eds) Recent advances in statistics: papers in honorof Herman Chernoff. Academic, New York, pp –

Gastwirth J () A probability model of a pyramid scheme. AmStat :–

Gastwirth J, Bhattacharya P () Two probability models of pyra-mids or chain letter schemes demonstrating that their promo-tional claims are unreliable. Oper Res :–

Mahmoud H () A strong law for the height of random binarypyramids. Ann Appl Probab :–

Mahmoud H, Smythe RT () On the distribution ofleaves in rooted subtrees of recursive trees. Ann Prob :–

http://www.mathmotivation.com/money/pyramid-scheme.htmlNaruk H () Memorandum of decision: State of Connecticut

versus Bull Investment Group, Conn. Sup. Smythe RT, Mahmoud H () A survey of recursive trees. Theor

Probab Math Stat :–

Page 104: International Encyclopedia of Statistical Science || Probit Analysis