Distributions of Kullback-Leibler Divergence and Its Application for ...
Transcript of Distributions of Kullback-Leibler Divergence and Its Application for ...
LSAC RESEARCH REPORT SERIES
� Distributions of Kullback–Leibler Divergence and Its Application for the LSAT Dmitry I. Belov Law School Admission Council Ronald D. Armstrong Rutgers University
� Law School Admission Council Research Report 09-02 October 2009
A Publication of the Law School Admission Council
The Law School Admission Council (LSAC) is a nonprofit corporation that provides unique, state-of-the-art admission products and services to ease the admission process for law schools and their applicants worldwide. More than 200 law schools in the United States, Canada, and Australia are members of the Council and benefit from LSAC's services. ©2010 by Law School Admission Council, Inc. LSAT, The Official LSAT PrepTest, The Official LSAT SuperPrep, ItemWise, and LSAC are registered marks of the Law School Admission Council, Inc. Law School Forums and LSAC Credential Assembly Service are service marks of the Law School Admission Council, Inc. 10 Actual, Official LSAT PrepTests; 10 More Actual, Official LSAT PrepTests; The Next 10 Actual, Official LSAT PrepTests; The New Whole Law School Package; ABA-LSAC Official Guide to ABA-Approved Law Schools; Whole Test Prep Packages; LLM Credential Assembly Service; ACES
2; ADMIT-LLM; Law School Admission Test; and Law School Admission Council are trademarks of
the Law School Admission Council, Inc. All rights reserved. No part of this work, including information, data, or other portions of the work published in electronic form, may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage and retrieval system, without permission of the publisher. For information, write: Communications, Law School Admission Council, 662 Penn Street, Box 40, Newtown, PA, 18940-0040. LSAC fees, policies, and procedures relating to, but not limited to, test registration, test administration, test score reporting, misconduct and irregularities, Credential Assembly Service (CAS), and other matters may change without notice at any time. Up-to-date LSAC policies and procedures are available at LSAC.org, or you may contact our candidate service representatives.
i
Table of Contents
Executive Summary .................................................................................................................1 Introduction...............................................................................................................................1 Distributions of Kullback–Leibler Divergence .......................................................................2
Lemma 1..........................................................................................................................2 Proposition 1....................................................................................................................4 Proposition 2....................................................................................................................5 Proposition 3....................................................................................................................6 Proposition 4....................................................................................................................6 Proposition 5....................................................................................................................7 Lemma 2..........................................................................................................................8
Bayesian Posteriors .................................................................................................................8 Simulated Experiments ..........................................................................................................10
Illustration of Proposition 1.............................................................................................11 Illustration of Proposition 2.............................................................................................13 Illustration of Proposition 3.............................................................................................15
Application for Test Security.................................................................................................19 Discussion ..............................................................................................................................24 References ..............................................................................................................................25 Acknowledgments .................................................................................................................26
1
Executive Summary
It is crucial for any testing organization to be able to recognize inconsistencies in test-taker performance across multiple standardized test administrations or across sections within a single administration. One method of recognizing inconsistent performance is to construct multiple posterior distributions and to compare the divergence between pairs of such distributions. (Note: In the context of a test taker’s performance, a posterior distribution summarizes what we know about the probability associated with certain levels of performance by the test taker.)
Comparing posteriors takes into account all information available from the responses and allows consideration of various partitions of the test taker’s responses to test questions (items). Practical partitions include scored versus unscored items, hard versus easy items, unexposed versus exposed items, uncompromised versus compromised items, and items of one type versus items of another type. A large divergence between posteriors indicates a significant change in a test taker’s performance. Such changes could be an indication of answer copying, item pre-knowledge, or test-taker pre-identification of the unscored section.
This paper evaluates the use of Kullback–Leibler divergence (KLD) to compare posterior distributions in the context of the Law School Admission Test. The statistical characteristics of KLD are presented and evaluated. KLD has been applied for this purpose in the context of magnetic resonance imaging, human gene analysis, stochastic complexity, and sample size selection. It is concluded that the properties of KLD support its use in the comparison of posterior distributions to identify inconsistent test-taker response cases.
Introduction
Various approaches can be used to assess the divergence between two probability density functions ( )g x and ( ),h x such as φ -divergence or φ( , )h -divergence measures (Pardo, 2006).
Kullback–Leibler divergence (KLD) is popular because it is a likelihood ratio approach that provides the relative entropy of a distribution to a reference measure. KLD is always non-negative and equals zero if and only if the two distributions are identical. The definition of KLD is valid for discrete and continuous distributions.
KLD has been applied in theory and practice. Arizono and Ohta (1989) used order statistics and KLD to test for the normality of a distribution based on sampling. Song (2002) also used order statistics and KLD to create a new nonparametric goodness-of-fit test. Li and Wang (2008) proposed a test for homogeneity based on KLD. Clarke (1999) applied KLD for stochastic complexity and sample size selection. Lin, Pittman, and Clarke (2007) used KLD to optimize sampling with an application in bioinformatics. Cabella et al. (2008) and Volkau, Bhanu Prakash, Anand, Aziz, and Nowinski (2006) presented applications of KLD for magnetic resonance imaging. Belov, Pashley, Lewis, and Armstrong (2007) used KLD to compare the performance of a test taker on two separate tests.
Generally, the theoretical distribution of KLD is unknown. A common approach to compute a critical value is via simulation (Arizono & Ohta, 1989; Li & Wang, 2008; Cabella et al., 2008).
2
Another approach is based on approximation. For example, Cabella et al. (2008) found that the distribution of the KLD mean was closely estimated by a gamma distribution. Belov and Armstrong (2009) used a lognormal approximation of KLD to ensure a low type I error rate.
The following is the only analytical result on the distribution of KLD. Consider a parametric
family = =( ) ( ) ( | ),g x h x f x τ where τ is an 0m -dimensional vector of parameters. Let τ̂ be a
maximum likelihood estimator of τ from a sample of size 0.n Kupperman (1957) proved that
under the hypothesis = 0,τ τ 02n times the KLD between ˆ( | )f x τ and 0( | )f x τ is asymptotically
chi-square with 0m degrees of freedom. Salicru, Morales, Menendez, and Pardo (1994)
generalized this result to φ -divergence and φ( , )h -divergence. Unfortunately, this result is
unusable in our case, because in psychometrics 0τ is the latent trait and it is unknown.
If ( )g x and ( )h x represent normal, exponential, or Poisson distributions, the resultant KLD
has a closed form (Li & Wang, 2008), which simplifies an analysis of the KLD distribution. Chang and Stout (1993) proved that in unidimensional item response theory (IRT), the posterior of the latent trait is asymptotically normal for a long test. Gelman, Carlin, Stern, & Rubin (2004) discuss the asymptotic normality of the posterior distribution, in general, for data that are independent and identically distributed (i.i.d.).
Distributions of Kullback–Leibler Divergence
Let ( )g x and ( )h x be two probability density functions defined over the same support. To
compute the KLD (Cover & Thomas, 1991; Kullback & Leibler, 1951) between the two distributions, the following formula is used:
+∞
−∞
= ∫�( )
( ) ( )ln .( )
g xD g h g x dx
h x (1)
This definition of KLD takes the ( )g x distribution to be the dominant distribution of the pair;
that is, the probability weights for the log ratio come from ( ).g x The KLD defined as
�( )D g h + �( )D h g is discussed at the end of this section.
Lemma 1
Assume that ( )g x and ( )h x are normal probability density functions:
µ
σ
σ π
−−
=
2
2
( )
21( )
2
g
g
x
g
g x e (2)
3
µ
σ
σ π
−−
=
2
2
( )
21( )
2
h
h
x
h
h x e . (3)
Then Equation (1) reduces to the following:
σ µ µσ
σ σ σ
−= + + −
�
2 22
2 2 2
( )1( ) ln 1
2
g g hh
g h h
D g h . (4)
Let σ ϕσ=2 2.g h Then Equation (4) transforms to the following:
µ µ
ϕϕ σ
−= + + −
�
2
2
( )1 1( ) ln 1 .
2
g h
h
D g h (5)
If 1ϕ = , then Equation (5) transforms to the following:
µ µ
σ
−=�
2
2
( )( )
2
g h
h
D g h . (6)
Proof
µ
σ µσ µ
σ σ σσ π
−+∞ +∞ −
−∞ −∞
− −= = − + =
∫ ∫�
2
2
( )2 2
2
2 2
( ) ( )( ) 1( ) ( )ln ln
( ) 2 22
g
g
x
gh h
g g hg
x xg xD g h g x dx e dx
h x
( ) ( )( )µ
σσσ σ σ µ σ µ σ µ σ µ
σ σ σσ π
−+∞ −
−∞
= + − + − + − =∫
2
2
( )
2 2 2 2 2 2 2 2 2 2
2 2
1 1ln 2
22
g
g
x
hg h h g g h g h h g
g g hg
e x x dx
( ) ( ) ( )( )σσ σ σ µ µ σ µ σ µ σ µ σ µ
σ σ σ= + − + + − + − =2 2 2 2 2 2 2 2 2 2
2 2
1ln 2
2h
g h g g g h g g h g h h g
g g h
( )( )σσ σ µ µ µ µ σ σ
σ σ σ= + + − + − =4 2 2 2 2 2
2 2
1ln 2
2h
g g g g h h h g
g g h
( )µ µσ σ µ µσ σ
σ σ σ σ σ σ
− −= + + − = + + −
22 2 22
2 2 2 2 2
( )1 1ln ln 1
2 2 2 2
g hg g g hh h
g h h g h h
4
The value of �( )D g h from (5) is equal to �( )D g h from (6) shifted by non-negative
constant ( )ϕ ϕ− −1
1 ln2
. When the value of ϕ is close to 1, the shift is small (see experiments
with simulated and real data below). For example, ϕ =0.8 gives a shift of 0.01 and ϕ =1.2 gives
a shift of 0.01. Therefore, the primary analysis will be given for the distribution of �( )D g h from
(6), and it will be shown that the derived results are applicable in practice. Let the values of �( )D g h from (6) be represented by the random variable .X To analyze
the asymptotic distribution of X in general populations, we have to assume that ( )g x and
( )h x are random. Since they are both from the same parametric family, it is sufficient to
assume that µ ,g µ ,h σ 2,g and σ 2
h are random variables with means µ( ),gE µ( ),hE σ 2( ),gE and
σ 2( )hE and with variances µ( ),gVar µ( ),hVar σ 2( ),gVar and σ 2( )hVar , respectively. We will
consider dependence only between µg and µ .h Then µ µ−g h is a random variable D with
mean and variance
µ µ
µ µ µ µ
= −
= + −
( ) ( ) ( )
( ) ( ) ( ) 2 ( , ).
g h
g h g h
E D E E
Var D Var Var Cov (7)
Proposition 1
Let µ µ= −g hD be normally distributed and 2( ) 0;hVar σ = so σ 2
h is constant, denoted by .s
Then X is distributed as scaled noncentral chi-square with one degree of freedom.
Proof
From the conditions of the proposition it follows that
µ µ
χ λσ ω
−= = =
2 22
12
( ) ( ) 1( ),
2 2 ( )
g h
h
Var D DX
s Var D (8)
where ω =2
( )
s
Var D is a scale and χ λ2
1 ( ) is a noncentral chi-square random variable with one
degree of freedom and noncentrality parameter λ =2( )
.( )
E D
Var D The cumulative distribution
function (CDF) of X is expressed as follows (Johnson, Kotz, & Balakrishnan, 1995):
5
ω
λ λ∞− + − −
+=
=Γ +
∑ ∫/2 1/2 1 /2
1/20 0
( / 2)( )
!2 (1/ 2 )
xii t
ii
F x e t e dti i
, ≥ 0x . (9)
Given a sample of ,X the method of moments can be applied to calibrate the distribution
from Proposition 1. Let ( )E X be the expectation and ( )Var X the variance. Then λ
ω
+=
1( )E X
and λ λ
ω
+ + ++ =
22
2
2(1 2 ) (1 )( ) ( ) .Var X E X The scale is ω α= +
2 ( ),
( )
E X
V X where
α
= − ≥
22 ( ) 2
0.( ) ( )
E X
Var X Var X The noncentrality parameter is λ ω= −( ) 1E X .
Proposition 2
Let µ µ= −g hD be normally distributed, µ µ=( ) ( )g hE E , and 2( ) 0;hVar σ = so σ 2
h is constant,
denoted by .s Then X is distributed as scaled chi-square with one degree of freedom.
Proof
From the conditions of the proposition it follows that
µ µ
χσ ω
−= = =
2 22
12
( ) ( ) 1
2 2 ( )
g h
h
Var D DX
s Var D, (10)
where ω =2
( )
s
Var D is a scale and χ 2
1 is a chi-square random variable with one degree of
freedom. The CDF of X is expressed as follows (Johnson, Kotz, & Balakrishnan, 1994):
ω
− −=Γ ∫
1/2 /2
1/2
0
1( )
2 (1/ 2)
x
tF x t e dt , ≥ 0x . (11)
6
Given a sample of ,X the method of moments can be applied to calibrate the distribution
from Proposition 2. Then the scale is ω = 1/ ( ).E X
Proposition 3
Let ~ (0,1);2
g h
h
Nµ µ
σ
− then X is distributed as chi-square with one degree of freedom.
Proof
From the conditions of the proposition it follows that
µ µ
χσ
−= =
2
2
12
( ),
2
g h
h
X (12)
where µ µ
σ
−∼ (0,1)
2
g h
h
N and, therefore, χ 2
1 is a chi-square random variable with one degree of
freedom.
Proposition 4
Let σ 2
h be distributed as a noncentral chi-square with ν degrees of freedom and
noncentrality parameter η ; then X is distributed as scaled doubly noncentral F.
Proof
From the conditions of the proposition it follows that
ν ν
µ µ χ λ χ λτ
σ σ χ η χ η ν
−= = = =
2
2 2 2
1 1
2 2 2 2
( ) ( ) ( )( ) ( )( )
2 2 2 ( ) ( ) /
g h
h h
D
Var D Var DVar DX , (13)
where τν
=( )
2
Var D is a scale factor, χ λ2
1 ( ) is a noncentral chi-square random variable with one
degree of freedom and noncentrality parameter λ =2( )
( )
E D
Var D, and σ χ η=2 2( ).h v It is known that
7
ν
χ λ
χ η ν=
2
1
2
( )
( ) /f is doubly noncentral F-distributed with 1, ν degrees of freedom (Johnson et al.
1995).
Proposition 5
Let σ 2
h be distributed as scaled chi-square with ν degrees of freedom and µ µ=( ) ( )g hE E ;
then X is distributed as scaled F.
Proof
From the conditions of the proposition it follows that
ν ν
µ µ χ χτ
σ σ ωχ χ ν
−= = = =
2
2 2 2
1 1
2 2 2 2
( ) ( ) ( )( ),
2 2 2 /
g h
h h
D
Var D Var DVar DX (14)
where των
=( )
2
Var D is a scale factor, χ 2
1 is a chi-square random variable with one degree of
freedom, and σ ωχ=2 2.h v It is known that ν
χ
χ ν=
2
1
2 /f is F-distributed with 1, ν degrees of
freedom (Johnson et al. 1995).
Consider the following modification of Equation (1) that provides symmetric divergence between ( )g x and ( )h x :
+∞ +∞
−∞ −∞
= + = +∫ ∫� �( ) ( )
( , ) ( ) ( ) ( )ln ( )ln .( ) ( )
g x h xD g h D g h D h g g x dx h x dx
h x g x (15)
Analogous results with this definition can be developed. We restate Lemma 1 without proof.
8
Lemma 2
Let the conditions of Lemma 1 be satisfied. Then Equation (15) reduces to the following:
σ µ µ µ µσ
σ σ σ σ
− −= + + + −
2 2 22
2 2 2 2
( ) ( )1( , ) 2
2
g g h h gh
h g h g
D g h . (16)
Let σ ϕσ=2 2.g h Then Equation (16) transforms to the following:
µ µϕ ϕ
ϕ ϕ σ
−+ += + −
22
2
( )1 1 1( , ) 2 .
2
g h
h
D g h (17)
If ϕ = 1, then Equation (17) transforms to the following:
µ µ
σ
−=
2
2
( )( , )
g h
h
D g h . (18)
Comparing (18) with (6), one can see that =( , ) 2 ( || ).D g h D g h Then Propositions 1–5 are
immediately applicable for 1
( , ).2
D g h The first two terms inside the parentheses of (16) can be
considered with an F distribution (doubly noncentral); however, our results have indicated that
σ σ
σ σ+ −
2 2
2 22
g h
h g
is small enough to be ignored from a practical standpoint.
Bayesian Posteriors
Consider two test takers with latent traits (abilities) θg and θh taking two different tests gT
and hT with m and n items, respectively. Let = 1 2( , ,..., )g g g gmr r rr and = 1 2( , ,..., )h h h hnr r rr
represent binary response vectors of test takers to gT and ,hT respectively. Bayesian analysis
(Gelman et al. 2004) is used to approximate the distribution of θg and θ .h In this study, we
take a finite set { }θ θ θ1 2, ,..., k of ability levels equally spaced in the interval [−5, +5], where k =
41. A uniform prior is used, and Bayesian posteriors are computed based on the responses.
The posterior probabilities for θg based on responses gr to the test gT are
9
θ
θ
θ
=
= =
=∏
∑∏
1
1 1
( )
( )
( )
m
gj ij
i mk
gj ll j
P r
G
P r
, 1,...,i k= , (19)
where θ( )gj iP r is the probability of response ∈ {0,1}gjr given ability level θ .i Similarly, the
posterior probabilities for θh based on responses hr to the test hT are
θ
θ
θ
=
= =
=∏
∑∏
1
1 1
( )
( )
( )
n
hj ij
i nk
hj ll j
P r
H
P r
, = 1,..., .i k (20)
The KLD between posteriors G and H is computed by the following formula:
θ
θθ=
=∑1
( )( || ) ( )ln
( )
ki
ii i
GD G H G
H. (21)
Under the five regularity conditions common in educational practice, Chang and Stout (1993) proved that G and H are asymptotically (2) and (3), respectively.
Given a population of test takers, the latent traits θg , θh are random variables with means
θ( )gE , θ( )hE and variances θ( )gVar , θ( )hVar , respectively. If θ θ=g h and tests gT , hT are
parallel, then σ σ=2 2
g h and (21) is approximated by (6). If θ θ≠g h and tests gT , hT are parallel
with flat information functions, then σ σ=2 2
g h and (21) is approximated by (6); also, in this case
σ =2( ) 0,hVar so σ 2
h is constant, denoted by .s Two tests are parallel if they provide the same
score distribution for a given sample of abilities. Usually this implies that =m n and that their
information functions are similar to each other.
Assume that σ σ=2 2
g h and σ =2
h s is constant; then the following linear model holds:
θ µ ε ε
θ µ ε ε
= +
= +
∼
∼
, (0, )
, (0, ).
g g g g
h h h h
N s
N s (22)
From Equation (22) it follows that µ µ= −g hD is normally distributed when θ θ−g h is a normal
random variable, or θ θ−g h is a constant. Also Equation (22) allows transforming Equation (7)
into:
θ θ
θ θ θ θ ε ε
= −
= + + − −
( ) ( ) ( )
( ) ( ) ( ) 2 2 ( , ) 2 ( , ).
g h
g h g h g h
E D E E
Var D Var Var s Cov Cov (23)
10
Consider two parallel tests gT and hT with smooth information functions administered to a
test taker with ability θ. According to Chang and Stout (1993), the following holds:
2
2
(0,1), 1/ ( , )
(0,1), 1/ ( , ),
g
g g g g
g
hh h h h
h
N J T
N J T
θ µξ σ µ
σ
θ µξ σ µ
σ
−= =
−= =
∼
∼
(24)
where µ( , )g gJ T is Fisher information of test gT for ability µ .g Since tests gT and hT are
parallel and have smooth information functions, then σ σ=2 2
g h . Since tests gT and hT are
different, then ξg and ξh are independent and
σ σ
θ µ µ µθ µξ ξ
σ σ σ
µ µ
σ
=
− −−− = − =
−
2 2~ (0,2)
~ (0,1).2
g h
g g hhh g
h g h
g h
h
N
N
(25)
Thus, given two parallel tests administered to a population, according to Proposition 3 the resultant statistic �( )D G H is asymptotically distributed as chi-square with one degree of
freedom. This is an interesting result because it does not depend on the distribution of θ and
allows random σ σ=2 2.g h
One can estimate the influence of tails of the population on the approximation represented
by Equation (25). If assumption σ σ=2 2
g h does not hold, then
θ µ σ σ µθ µ µ
ξ ξ θσ σ σ σ σ σ
− −−− = − = + − .
g g h gh hh g
h g h g g h
(26)
Simulated Experiments
This section presents four simulated experiments that illustrate some of the above
propositions. The Kolmogorov–Smirnov statistic DK is employed to measure fit between the
empirical CDF ( )DF x of �( )D G H and the CDF ( )F x of the asymptotic distribution:
= −sup ( ) ( ) .D Dx
K F x F x (27)
11
The posterior distributions of the ability were computed by (19) and (20), followed by �( )D G H
calculated by (21).
Illustration of Proposition 1
Consider the following simulation experiment to illustrate Proposition 1. Two independent
groups of 10,000 test takers θ ∼ (0.5,1)g N and θ −∼ ( 0.5,1)h N were given one test. The test
consisted of 50 items modeled by the three-parameter logistic (3PL) IRT model (Lord, 1980) with the following parameters: a = 1, − +∼ ( 5, 5),b U c = 0.1. Test takers from the two groups
were randomly paired to provide the two posterior distributions. Then from the above section
on Bayesian posteriors it follows that σ σ= =2 2
g h s is constant, and from Equation (23) it follows
that =( ) 1E D and = +( ) 2 2 .Var D s Figure 1 shows the empirical CDF of �( )D G H and CDF of
the asymptotic distribution from Proposition 1 with parameters λ = 0.352 and ω = 0.124
computed from the data. One can observe an almost perfect fit with = 0.009DK .
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 5.9 11.8 17.7 23.7 29.6 35.5 41.4 47.3 53.2
D(G||H)
CDF Asymptotic
Empirical
FIGURE 1. Illustration of Proposition 1: Empirical and asymptotic CDFs of ( )D G H� computed from simulated
data
12
Figure 2 shows the distribution of σ 2
g and σ 2
h with means 0.16, 0.17 and variances 0.0005,
0.0009, respectively. One can see that these distributions are not correlated (correlation coefficient = 0.01) and are mostly concentrated around their means with a slight right tail. From
the above it follows that λ = = ≈+ ×
2( ) 10.4
( ) 2 2 0.17
E D
Var D and
ω×
= = = ≈+ + ×
2 2 2 0.170.1
( ) 2 2 2 2 0.17
s s
Var D s, which is close to parameters computed from the data.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
FIGURE 2. Illustration of Proposition 1: Distribution of empirical 2gσ (abscissa) and 2
hσ (ordinate)
13
Illustration of Proposition 2
Consider the following simulation experiment to illustrate Proposition 2. Two independent
groups of 10,000 test takers θ ∼ (0,1)g N and θ ∼ (0,1)h N were given one test. The test
consisted of 30 items modeled by the 3PL IRT model (Lord, 1980) with the following parameters: a = 1, b ~ U(−5,+5), c = 0.1. Test takers from the two groups were randomly paired to provide the two posterior distributions. Then from the above section on Bayesian
posteriors it follows that σ σ= =2 2
g h s is constant, and from Equation (23) it follows that
=( ) 0E D and = +( ) 2 2 .Var D s Figure 3 shows the empirical CDF of �( )D G H and CDF of the
asymptotic distribution from Proposition 2 with parameter ω = 0.314 computed from the data.
One can observe a good fit, with = 0.018DK .
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 2.3 4.6 6.9 9.3 11.6 13.9 16.2 18.5 20.8
D(G||H)
CDF Asymptotic
Empirical
FIGURE 3. Illustration of Proposition 2: Empirical and asymptotic CDFs of ( )D G H� computed from simulated
data
14
Figure 4 shows the distribution of σ 2
g and σ 2
h with means 0.41, 0.40 and variances 0.005,
0.005, respectively. One can see that these distributions are not correlated (correlation coefficient = −0.02) and are mostly concentrated around their means with a right tail. From the
above it follows that ω×
= = = ≈+ + ×
2 2 2 0.400.3,
( ) 2 2 2 2 0.40
s s
Var D s which is close to the parameter
computed from the data.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
FIGURE 4. Illustration of Proposition 2: Distribution of empirical 2gσ (abscissa) and 2
hσ (ordinate)
15
Illustration of Proposition 3
Consider the following simulation experiment to illustrate Proposition 3. Ten thousand test
takers θ θ= ∼ (0,1)g h N were given one test two times. The test consisted of 100 items modeled
by the 3PL IRT model (Lord, 1980) with the following parameters: a = 1, b ~ N(0,1), c = 0.1.
Then from the above section on Bayesian posteriors it follows that σ σ=2 2
g h and
µ µ
σ
−~ (0,1)
2
g h
h
N .
Figure 5 shows the empirical CDF of �( )D G H and CDF of the asymptotic distribution from
Proposition 3. One can observe a good fit, with = 0.012.DK
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 1.0 2.0 2.9 3.9 4.9 5.9 6.9 7.8 8.8
D(G||H)
CDF Asymptotic
Empirical
FIGURE 5. Illustration of Proposition 3: Empirical and asymptotic CDFs of ( )D G H� computed from simulated
data
16
Figure 6 shows the observed distribution of σ 2
g and σ 2
h with means 0.048, 0.049 and
variances 0.006, 0.007, respectively. These distributions are correlated (correlation coefficient = 0.64) and mostly concentrated around their means with a right tail.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
FIGURE 6. Illustration of Proposition 3: Distribution of empirical 2gσ (abscissa) and 2
hσ (ordinate)
The tails on Figure 6 are now larger. To make them extreme, we changed the test such that its length was 50 items with parameter b ~ N(1,1).
17
Figure 7 shows the empirical CDF of �( )D G H and CDF of the asymptotic distribution from
Proposition 3. One can observe a good fit, with = 0.009.DK
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.7 1.4 2.1 2.8 3.4 4.1 4.8 5.5 6.2
D(G||H)
CDF Asymptotic
Empirical
FIGURE 7. Illustration of Proposition 3 (extreme case): Empirical and asymptotic CDFs of ( )D G H� computed
from simulated data
18
Figure 8 shows the distribution of σ 2
g and σ 2
h with means 0.50, 0.50 and variances 0.39,
0.40, respectively. These distributions are correlated (correlation coefficient = 0.70) and have large right tail.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
FIGURE 8. Illustration of Proposition 3 (extreme case): Distribution of empirical σ 2
g (abscissa) and σ 2
h (ordinate)
Both experiments of this subsection demonstrated that even if σ σ=2 2
g h is random, the
statistic �( )D G H is asymptotically chi-square with one degree of freedom (see Equation (25)).
Also the setup of these experiments is closer to practice than the previous ones.
19
Application for Test Security
The major advantage of comparing posteriors for test security is that it takes into account all information available from the responses and allows various ways to partition the test
taker’s response vector into gr and .hr Practical partitions include: operational items versus
variable items, hard items versus easy items, unexposed items versus exposed items, uncompromised items versus compromised items, and items of one type versus items of another type. A large divergence between posteriors indicates a significant change in the test taker’s performance (Belov et al., 2007).
Consider a test taker taking two distinct but parallel tests gT , hT . The value of �( )D G H
from (21) provides an index of similarity between the performances on the two tests. Large divergence values indicate a significant change in performance. From the above section on Bayesian posteriors it follows that the posteriors G and H are approximated by (2) and (3),
σ σ=2 2
g h and µ µ
σ
−~ (0,1).
2
g h
h
N Therefore, the conditions of Proposition 3 are satisfied. To
conduct a statistical hypothesis test to establish an aberrant behavior, compute a critical value for a given significance level α. Values of �( )D G H greater than the critical value give a critical
region for a hypothesis test (Lehmann, 1986) on the consistency of performance between two tests.
Let gT and hT represent the operational and variable parts of a high-stakes test,
respectively. The operational part is the same for all test takers taking the test, while the variable part is generally unique in the adjacent seating area. If a test taker copies answers from his or her neighbor, then the change of the test taker’s performance between the operational part and the variable part should be large. Values of �( )D G H falling into the
critical region indicate a significant change in the test taker’s performance between the operational part and the variable part. This can be applied to identify test takers involved in answer copying (Belov & Armstrong, 2009).
In general, the number of items (m ) on the operational part and the number of items (n ) on the variable part can be different. For example, in the Law School Admission Test (LSAT),
≈ 4 .m n Therefore, to compute the critical value for a given α, one cannot apply
Proposition 3 directly. To address this problem we introduce two distinct methods to satisfy
= .m n
Method 1: Assume that operational part w has a subset 0w of responses to the items of
the same type as the variable part .v If 0w has a size different from ,v apply
bootstrapping. Then assign ∪( \ )0w w v to the variable part and, thus, = .m n As a result,
however, the responses to operational and variable parts have a substantial overlap.
Therefore, µg and µh are becoming dependent and Equation (25) does not hold. One can
assume σ σ= =2 2
g h s is constant and apply Proposition 2. Since ε ε= −( ) 2 ( , )g hVar D s Cov
and ε ε >( , ) 0g hCov (see Equation (23)), then scale ω = >2
1( )
s
Var D.
20
Method 2: Assume that operational part w has a subset 0w of responses to the items of
the same type as the variable part .v Then consider = ∪ (even responses of \ )0v v w w ,
= ∪ (odd responses of \ )0 0w w w w . If the resultant w has a size different from ,v apply
bootstrapping to satisfy = .m n
Results with empirical data derived from an administration of the LSAT are now given. Items in the operational and variable parts were modeled by the 3PL IRT model (Lord, 1980). The operational part included 100 items. There were 10 variable parts uniformly distributed over about 22,000 test takers. The posterior distributions of the ability were computed by (19) and (20), followed by �( )D G H calculated by (21).
Apply Method 1 and Proposition 2. Figure 9 shows the empirical CDF of �( )D G H and CDF
of the asymptotic distribution from Proposition 2 with parameter ω = 3.26 computed from the
data. One can observe a good fit, with = 0.018.DK
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.4 0.8 1.2 1.6 2.0 2.5 2.9 3.3 3.7
D(G||H)
CDF Asymptotic
Empirical
FIGURE 9. Illustration of Proposition 2 and Method 1: Empirical and asymptotic CDFs of ( )D G H� computed
from real data
21
Figure 10 shows the distribution of σ 2
g and σ 2
h with means 0.065, 0.067 and variances
0.002, 0.002, respectively. One can see that these distributions are correlated (correlation coefficient = 0.9) and mostly concentrated around their means with a slight right tail.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
FIGURE 10. Illustration of Proposition 2 and Method 1: Distribution of empirical 2gσ (abscissa) and 2
hσ (ordinate)
22
Apply Method 2 and Proposition 3. Figure 11 shows the empirical CDF of �( )D G H and
CDF of the asymptotic distribution from Proposition 3. One can observe a good fit, with
= 0.020.DK
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 1.2 2.4 3.6 4.8 5.9 7.1 8.3 9.5 10.7
D(G||H)
CDF Asymptotic
Empirical
FIGURE 11. Illustration of Proposition 3 and Method 2: Empirical and asymptotic CDFs of ( )D G H� computed
from real data
23
Figure 12 shows the distribution of σ 2
g and σ 2
h with means 0.11, 0.12 and variances 0.005,
0.006, respectively. These distributions are correlated (correlation coefficient = 0.63) and mostly concentrated around their means with a right tail.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
FIGURE 12. Illustration of Proposition 3 and Method 2: Distribution of empirical 2gσ (abscissa) and 2
hσ (ordinate)
Since empirical variances of σ 2
g and σ 2
h are close to zero, one can apply Proposition 2,
where ω should be close to 1.
24
Figure 13 shows the empirical CDF of �( )D G H and CDF of the asymptotic distribution
from Proposition 2 with parameter ω = 0.92 computed from the data. One can observe an
almost perfect fit, with = 0.012.DK
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 1.2 2.4 3.6 4.8 5.9 7.1 8.3 9.5 10.7
D(G||H)
CDF Asymptotic
Empirical
FIGURE 13. Illustration of Proposition 2 and Method 2: Empirical and asymptotic CDFs of ( )D G H� computed
from real data
Discussion
Kullback–Leibler divergence (KLD) is used in many areas to compute the fit of two distributions. The references show that KLD is applied in standardized testing, magnetic resonance imaging, human gene analysis, stochastic complexity, and sample size selection. Thus, results from the study of the distribution of KLD are important to several fields.
The assumptions made in this paper to identify the distribution of KLD are commonly observed in educational practice. Normality of the Bayesian posterior in various settings has
been shown in the references. Assumptions σ σ=2 2
g h and µ µ
σ
−~ (0,1)
2
g h
h
N hold as soon as two
parallel tests are administered to a population. It is possible, in some cases, to control the variance of the posterior distribution to be close to a chosen value—in other words, to hold
assumption σ σ=2 2.g h For example, in a computerized adaptive test, the stopping criterion could
25
be driven by this variance. For different combinations of assumptions, we proved that KLD is asymptotically distributed
as scaled (noncentral) chi-square with one degree of freedom or scaled (doubly noncentral) F. For details, see Propositions 1–5. Computer experiments with simulated and real data demonstrated a good fit between asymptotic and empirical distributions.
The results of this paper are directly applicable to educational practice. In particular, one can use Propositions 1–5 to compute the critical value of KLD for a given partitioning of the test: operational items versus variable items, hard items versus easy items, unexposed items versus exposed items, uncompromised items versus compromised items, and items of one type versus items of another type. These propositions, therefore, can be used to identify aberrant responding in a high-stakes test such as the LSAT.
References
Arizono, I., & Ohta, H. (1989). A test for normality based on Kullback-Leibler information. The American Statistician, 43, 20–22.
Belov, D. I., & Armstrong, R. D. (2009). Automatic detection of answer copying via Kullback–Leibler divergence and K-Index (Research Report 09-01). Newtown, PA: Law School Admission Council.
Belov, D. I., Pashley, P. J., Lewis, C., & Armstrong, R. D. (2007). Detecting aberrant responses with Kullback–Leibler distance. In K. Shigemasu, A. Okada, T. Imaizumi, & T. Hoshino (Eds.), New Trends in Psychometrics (pp. 7–14). Tokyo: Universal Academy Press.
Cabella, B. C. T., Sturzbecher, M. J., Tedeschi, W., Filho, O. B., de Araujo, D. B., & Neves, U. P. C. (2008). A numerical study of the Kullback-Leibler distance in functional magnetic resonance imaging. Brazilian Journal of Physics, 38, 20–25.
Chang, H. H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 37–52.
Clarke, B. (1999). Asymptotic normality of the posterior in relative entropy. IEEE Transactions on Information Theory, 45, 165–176.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: John Wiley & Sons, Inc.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. Boca Raton: Chapman & Hall.
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1994). Continuous univariate distributions, Volume 1 (2nd ed.). New York: Wiley.
26
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995). Continuous univariate distributions, Volume 2 (2nd ed.). New York: Wiley.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86.
Kupperman, M. (1957). Further applications of information theory to multivariate analysis and statistical inference (Ph.D. Dissertation). The George Washington University.
Lehmann, E. L. (1986). Testing statistical hypotheses (2nd ed.). New York: Wiley.
Li, Y., & Wang, L. (2008). Testing for homogeneity in mixture using weighted relative entropy. Communications in Statistics—Simulation and Computation, 37, 1981–1995.
Lin, X., Pittman, J., & Clarke, B. (2007). Information conversion, effective samples, and parameter size. IEEE Transactions on Information Theory, 53, 4438–4456.
Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Pardo, L. (2006). Statistical inference based on divergence measures. Chapman & Hall/CRC, New York.
Salicru, M., Morales, D., Menendez, M. L., & Pardo, L. (1994). On the applications of divergence type measures in testing statistical hypotheses. Journal of Multivariate Analysis, 51, 372–391.
Song, K. S. (2002). Goodness-of-fit tests based on Kullback-Leibler discrimination information. IEEE Transactions on Information Theory, 48, 1103–1117.
Volkau, I., Bhanu Prakash, K. N., Anand, A., Aziz, A., & Nowinski, W. L. (2006). Extraction of the midsagittal plane from morphological neuroimages using the Kullback-Leibler’s measure. Medical Image Analysis, 10, 863–874.
Acknowledgments
We would like to thank Charles Lewis for his valuable comments and suggestions on previous versions of this manuscript.