Distributions of Kullback-Leibler Divergence and Its Application for ...

LSAC RESEARCH REPORT SERIES

� Distributions of Kullback–Leibler Divergence and Its Application for the LSAT Dmitry I. Belov Law School Admission Council Ronald D. Armstrong Rutgers University

� Law School Admission Council Research Report 09-02 October 2009

A Publication of the Law School Admission Council

The Law School Admission Council (LSAC) is a nonprofit corporation that provides unique, state-of-the-art admission products and services to ease the admission process for law schools and their applicants worldwide. More than 200 law schools in the United States, Canada, and Australia are members of the Council and benefit from LSAC's services. ©2010 by Law School Admission Council, Inc. LSAT, The Official LSAT PrepTest, The Official LSAT SuperPrep, ItemWise, and LSAC are registered marks of the Law School Admission Council, Inc. Law School Forums and LSAC Credential Assembly Service are service marks of the Law School Admission Council, Inc. 10 Actual, Official LSAT PrepTests; 10 More Actual, Official LSAT PrepTests; The Next 10 Actual, Official LSAT PrepTests; The New Whole Law School Package; ABA-LSAC Official Guide to ABA-Approved Law Schools; Whole Test Prep Packages; LLM Credential Assembly Service; ACES

2; ADMIT-LLM; Law School Admission Test; and Law School Admission Council are trademarks of

the Law School Admission Council, Inc. All rights reserved. No part of this work, including information, data, or other portions of the work published in electronic form, may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage and retrieval system, without permission of the publisher. For information, write: Communications, Law School Admission Council, 662 Penn Street, Box 40, Newtown, PA, 18940-0040. LSAC fees, policies, and procedures relating to, but not limited to, test registration, test administration, test score reporting, misconduct and irregularities, Credential Assembly Service (CAS), and other matters may change without notice at any time. Up-to-date LSAC policies and procedures are available at LSAC.org, or you may contact our candidate service representatives.

i

Table of Contents

Executive Summary .................................................................................................................1 Introduction...............................................................................................................................1 Distributions of Kullback–Leibler Divergence .......................................................................2

Lemma 1..........................................................................................................................2 Proposition 1....................................................................................................................4 Proposition 2....................................................................................................................5 Proposition 3....................................................................................................................6 Proposition 4....................................................................................................................6 Proposition 5....................................................................................................................7 Lemma 2..........................................................................................................................8

Bayesian Posteriors .................................................................................................................8 Simulated Experiments ..........................................................................................................10

Illustration of Proposition 1.............................................................................................11 Illustration of Proposition 2.............................................................................................13 Illustration of Proposition 3.............................................................................................15

Application for Test Security.................................................................................................19 Discussion ..............................................................................................................................24 References ..............................................................................................................................25 Acknowledgments .................................................................................................................26

1

Executive Summary

It is crucial for any testing organization to be able to recognize inconsistencies in test-taker performance across multiple standardized test administrations or across sections within a single administration. One method of recognizing inconsistent performance is to construct multiple posterior distributions and to compare the divergence between pairs of such distributions. (Note: In the context of a test taker’s performance, a posterior distribution summarizes what we know about the probability associated with certain levels of performance by the test taker.)

Comparing posteriors takes into account all information available from the responses and allows consideration of various partitions of the test taker’s responses to test questions (items). Practical partitions include scored versus unscored items, hard versus easy items, unexposed versus exposed items, uncompromised versus compromised items, and items of one type versus items of another type. A large divergence between posteriors indicates a significant change in a test taker’s performance. Such changes could be an indication of answer copying, item pre-knowledge, or test-taker pre-identification of the unscored section.

This paper evaluates the use of Kullback–Leibler divergence (KLD) to compare posterior distributions in the context of the Law School Admission Test. The statistical characteristics of KLD are presented and evaluated. KLD has been applied for this purpose in the context of magnetic resonance imaging, human gene analysis, stochastic complexity, and sample size selection. It is concluded that the properties of KLD support its use in the comparison of posterior distributions to identify inconsistent test-taker response cases.

Introduction

Various approaches can be used to assess the divergence between two probability density functions ( )g x and ( ),h x such as φ -divergence or φ( , )h -divergence measures (Pardo, 2006).

Kullback–Leibler divergence (KLD) is popular because it is a likelihood ratio approach that provides the relative entropy of a distribution to a reference measure. KLD is always non-negative and equals zero if and only if the two distributions are identical. The definition of KLD is valid for discrete and continuous distributions.

KLD has been applied in theory and practice. Arizono and Ohta (1989) used order statistics and KLD to test for the normality of a distribution based on sampling. Song (2002) also used order statistics and KLD to create a new nonparametric goodness-of-fit test. Li and Wang (2008) proposed a test for homogeneity based on KLD. Clarke (1999) applied KLD for stochastic complexity and sample size selection. Lin, Pittman, and Clarke (2007) used KLD to optimize sampling with an application in bioinformatics. Cabella et al. (2008) and Volkau, Bhanu Prakash, Anand, Aziz, and Nowinski (2006) presented applications of KLD for magnetic resonance imaging. Belov, Pashley, Lewis, and Armstrong (2007) used KLD to compare the performance of a test taker on two separate tests.

Generally, the theoretical distribution of KLD is unknown. A common approach to compute a critical value is via simulation (Arizono & Ohta, 1989; Li & Wang, 2008; Cabella et al., 2008).

2

Another approach is based on approximation. For example, Cabella et al. (2008) found that the distribution of the KLD mean was closely estimated by a gamma distribution. Belov and Armstrong (2009) used a lognormal approximation of KLD to ensure a low type I error rate.

The following is the only analytical result on the distribution of KLD. Consider a parametric

family = =( ) ( ) ( | ),g x h x f x τ where τ is an 0m -dimensional vector of parameters. Let τ̂ be a

maximum likelihood estimator of τ from a sample of size 0.n Kupperman (1957) proved that

under the hypothesis = 0,τ τ 02n times the KLD between ˆ( | )f x τ and 0( | )f x τ is asymptotically

chi-square with 0m degrees of freedom. Salicru, Morales, Menendez, and Pardo (1994)

generalized this result to φ -divergence and φ( , )h -divergence. Unfortunately, this result is

unusable in our case, because in psychometrics 0τ is the latent trait and it is unknown.

If ( )g x and ( )h x represent normal, exponential, or Poisson distributions, the resultant KLD

has a closed form (Li & Wang, 2008), which simplifies an analysis of the KLD distribution. Chang and Stout (1993) proved that in unidimensional item response theory (IRT), the posterior of the latent trait is asymptotically normal for a long test. Gelman, Carlin, Stern, & Rubin (2004) discuss the asymptotic normality of the posterior distribution, in general, for data that are independent and identically distributed (i.i.d.).

Distributions of Kullback–Leibler Divergence

Let ( )g x and ( )h x be two probability density functions defined over the same support. To

compute the KLD (Cover & Thomas, 1991; Kullback & Leibler, 1951) between the two distributions, the following formula is used:

+∞

−∞

= ∫�( )

( ) ( )ln .( )

g xD g h g x dx

h x (1)

This definition of KLD takes the ( )g x distribution to be the dominant distribution of the pair;

that is, the probability weights for the log ratio come from ( ).g x The KLD defined as

�( )D g h + �( )D h g is discussed at the end of this section.

Lemma 1

Assume that ( )g x and ( )h x are normal probability density functions:

µ

σ

σ π

−−

=

2

2

( )

21( )

2

g

g

x

g

g x e (2)

3

µ

σ

σ π

−−

=

2

2

( )

21( )

2

h

h

x

h

h x e . (3)

Then Equation (1) reduces to the following:

σ µ µσ

σ σ σ

−= + + −

�

2 22

2 2 2

( )1( ) ln 1

2

g g hh

g h h

D g h . (4)

Let σ ϕσ=2 2.g h Then Equation (4) transforms to the following:

µ µ

ϕϕ σ

−= + + −

�

2

2

( )1 1( ) ln 1 .

2

g h

h

D g h (5)

If 1ϕ = , then Equation (5) transforms to the following:

µ µ

σ

−=�

2

2

( )( )

2

g h

h

D g h . (6)

Proof

µ

σ µσ µ

σ σ σσ π

−+∞ +∞ −

−∞ −∞

− −= = − + =

∫ ∫�

2

2

( )2 2

2

2 2

( ) ( )( ) 1( ) ( )ln ln

( ) 2 22

g

g

x

gh h

g g hg

x xg xD g h g x dx e dx

h x

( ) ( )( )µ

σσσ σ σ µ σ µ σ µ σ µ

σ σ σσ π

−+∞ −

−∞

= + − + − + − =∫

2

2

( )

2 2 2 2 2 2 2 2 2 2

2 2

1 1ln 2

22

g

g

x

hg h h g g h g h h g

g g hg

e x x dx

( ) ( ) ( )( )σσ σ σ µ µ σ µ σ µ σ µ σ µ

σ σ σ= + − + + − + − =2 2 2 2 2 2 2 2 2 2

2 2

1ln 2

2h

g h g g g h g g h g h h g

g g h

( )( )σσ σ µ µ µ µ σ σ

σ σ σ= + + − + − =4 2 2 2 2 2

2 2

1ln 2

2h

g g g g h h h g

g g h

( )µ µσ σ µ µσ σ

σ σ σ σ σ σ

− −= + + − = + + −

22 2 22

2 2 2 2 2

( )1 1ln ln 1

2 2 2 2

g hg g g hh h

g h h g h h

4

The value of �( )D g h from (5) is equal to �( )D g h from (6) shifted by non-negative

constant ( )ϕ ϕ− −1

1 ln2

. When the value of ϕ is close to 1, the shift is small (see experiments

with simulated and real data below). For example, ϕ =0.8 gives a shift of 0.01 and ϕ =1.2 gives

a shift of 0.01. Therefore, the primary analysis will be given for the distribution of �( )D g h from

(6), and it will be shown that the derived results are applicable in practice. Let the values of �( )D g h from (6) be represented by the random variable .X To analyze

the asymptotic distribution of X in general populations, we have to assume that ( )g x and

( )h x are random. Since they are both from the same parametric family, it is sufficient to

assume that µ ,g µ ,h σ 2,g and σ 2

h are random variables with means µ( ),gE µ( ),hE σ 2( ),gE and

σ 2( )hE and with variances µ( ),gVar µ( ),hVar σ 2( ),gVar and σ 2( )hVar , respectively. We will

consider dependence only between µg and µ .h Then µ µ−g h is a random variable D with

mean and variance

µ µ

µ µ µ µ

= −

= + −

( ) ( ) ( )

( ) ( ) ( ) 2 ( , ).

g h

g h g h

E D E E

Var D Var Var Cov (7)

Proposition 1

Let µ µ= −g hD be normally distributed and 2( ) 0;hVar σ = so σ 2

h is constant, denoted by .s

Then X is distributed as scaled noncentral chi-square with one degree of freedom.

Proof

From the conditions of the proposition it follows that

µ µ

χ λσ ω

−= = =

2 22

12

( ) ( ) 1( ),

2 2 ( )

g h

h

Var D DX

s Var D (8)

where ω =2

( )

s

Var D is a scale and χ λ2

1 ( ) is a noncentral chi-square random variable with one

degree of freedom and noncentrality parameter λ =2( )

.( )

E D

Var D The cumulative distribution

function (CDF) of X is expressed as follows (Johnson, Kotz, & Balakrishnan, 1995):

5

ω

λ λ∞− + − −

+=

=Γ +

∑ ∫/2 1/2 1 /2

1/20 0

( / 2)( )

!2 (1/ 2 )

xii t

ii

F x e t e dti i

, ≥ 0x . (9)

Given a sample of ,X the method of moments can be applied to calibrate the distribution

from Proposition 1. Let ( )E X be the expectation and ( )Var X the variance. Then λ

ω

+=

1( )E X

and λ λ

ω

+ + ++ =

22

2

2(1 2 ) (1 )( ) ( ) .Var X E X The scale is ω α= +

2 ( ),

( )

E X

V X where

α

= − ≥

22 ( ) 2

0.( ) ( )

E X

Var X Var X The noncentrality parameter is λ ω= −( ) 1E X .

Proposition 2

Let µ µ= −g hD be normally distributed, µ µ=( ) ( )g hE E , and 2( ) 0;hVar σ = so σ 2

h is constant,

denoted by .s Then X is distributed as scaled chi-square with one degree of freedom.

Proof


µ µ

χσ ω

−= = =

2 22

12

( ) ( ) 1

2 2 ( )

g h

h

Var D DX

s Var D, (10)

where ω =2

( )

s

Var D is a scale and χ 2

1 is a chi-square random variable with one degree of

freedom. The CDF of X is expressed as follows (Johnson, Kotz, & Balakrishnan, 1994):

ω

− −=Γ ∫

1/2 /2

1/2

0

1( )

2 (1/ 2)

x

tF x t e dt , ≥ 0x . (11)

6

Given a sample of ,X the method of moments can be applied to calibrate the distribution

from Proposition 2. Then the scale is ω = 1/ ( ).E X

Proposition 3

Let ~ (0,1);2

g h

h

Nµ µ

σ

− then X is distributed as chi-square with one degree of freedom.

Proof


µ µ

χσ

−= =

2

2

12

( ),

2

g h

h

X (12)

where µ µ

σ

−∼ (0,1)

2

g h

h

N and, therefore, χ 2


freedom.

Proposition 4

Let σ 2

h be distributed as a noncentral chi-square with ν degrees of freedom and

noncentrality parameter η ; then X is distributed as scaled doubly noncentral F.

Proof


ν ν

µ µ χ λ χ λτ

σ σ χ η χ η ν

−= = = =

2

2 2 2

1 1

2 2 2 2

( ) ( ) ( )( ) ( )( )

2 2 2 ( ) ( ) /

g h

h h

D

Var D Var DVar DX , (13)

where τν

=( )

2

Var D is a scale factor, χ λ2

1 ( ) is a noncentral chi-square random variable with one

degree of freedom and noncentrality parameter λ =2( )

( )

E D

Var D, and σ χ η=2 2( ).h v It is known that

7

ν

χ λ

χ η ν=

2

1

2

( )

( ) /f is doubly noncentral F-distributed with 1, ν degrees of freedom (Johnson et al.

1995).

Proposition 5

Let σ 2

h be distributed as scaled chi-square with ν degrees of freedom and µ µ=( ) ( )g hE E ;

then X is distributed as scaled F.

Proof


ν ν

µ µ χ χτ

σ σ ωχ χ ν

−= = = =

2

2 2 2

1 1

2 2 2 2

( ) ( ) ( )( ),

2 2 2 /

g h

h h

D

Var D Var DVar DX (14)

where των

=( )

2

Var D is a scale factor, χ 2


freedom, and σ ωχ=2 2.h v It is known that ν

χ

χ ν=

2

1

2 /f is F-distributed with 1, ν degrees of

freedom (Johnson et al. 1995).

Consider the following modification of Equation (1) that provides symmetric divergence between ( )g x and ( )h x :

+∞ +∞

−∞ −∞

= + = +∫ ∫� �( ) ( )

( , ) ( ) ( ) ( )ln ( )ln .( ) ( )

g x h xD g h D g h D h g g x dx h x dx

h x g x (15)

Analogous results with this definition can be developed. We restate Lemma 1 without proof.

8

Lemma 2

Let the conditions of Lemma 1 be satisfied. Then Equation (15) reduces to the following:

σ µ µ µ µσ

σ σ σ σ

− −= + + + −

2 2 22

2 2 2 2

( ) ( )1( , ) 2

2

g g h h gh

h g h g

D g h . (16)

Let σ ϕσ=2 2.g h Then Equation (16) transforms to the following:

µ µϕ ϕ

ϕ ϕ σ

−+ += + −

22

2

( )1 1 1( , ) 2 .

2

g h

h

D g h (17)

If ϕ = 1, then Equation (17) transforms to the following:

µ µ

σ

−=

2

2

( )( , )

g h

h

D g h . (18)

Comparing (18) with (6), one can see that =( , ) 2 ( || ).D g h D g h Then Propositions 1–5 are

immediately applicable for 1

( , ).2

D g h The first two terms inside the parentheses of (16) can be

considered with an F distribution (doubly noncentral); however, our results have indicated that

σ σ

σ σ+ −

2 2

2 22

g h

h g

is small enough to be ignored from a practical standpoint.

Bayesian Posteriors

Consider two test takers with latent traits (abilities) θg and θh taking two different tests gT

and hT with m and n items, respectively. Let = 1 2( , ,..., )g g g gmr r rr and = 1 2( , ,..., )h h h hnr r rr

represent binary response vectors of test takers to gT and ,hT respectively. Bayesian analysis

(Gelman et al. 2004) is used to approximate the distribution of θg and θ .h In this study, we

take a finite set { }θ θ θ1 2, ,..., k of ability levels equally spaced in the interval [−5, +5], where k =

41. A uniform prior is used, and Bayesian posteriors are computed based on the responses.

The posterior probabilities for θg based on responses gr to the test gT are

9

θ

θ

θ

=

= =

=∏

∑∏

1

1 1

( )

( )

( )

m

gj ij

i mk

gj ll j

P r

G

P r

, 1,...,i k= , (19)

where θ( )gj iP r is the probability of response ∈ {0,1}gjr given ability level θ .i Similarly, the

posterior probabilities for θh based on responses hr to the test hT are

θ

θ

θ

=

= =

=∏

∑∏

1

1 1

( )

( )

( )

n

hj ij

i nk

hj ll j

P r

H

P r

, = 1,..., .i k (20)

The KLD between posteriors G and H is computed by the following formula:

θ

θθ=

=∑1

( )( || ) ( )ln

( )

ki

ii i

GD G H G

H. (21)

Under the five regularity conditions common in educational practice, Chang and Stout (1993) proved that G and H are asymptotically (2) and (3), respectively.

Given a population of test takers, the latent traits θg , θh are random variables with means

θ( )gE , θ( )hE and variances θ( )gVar , θ( )hVar , respectively. If θ θ=g h and tests gT , hT are

parallel, then σ σ=2 2

g h and (21) is approximated by (6). If θ θ≠g h and tests gT , hT are parallel

with flat information functions, then σ σ=2 2

g h and (21) is approximated by (6); also, in this case

σ =2( ) 0,hVar so σ 2

h is constant, denoted by .s Two tests are parallel if they provide the same

score distribution for a given sample of abilities. Usually this implies that =m n and that their

information functions are similar to each other.

Assume that σ σ=2 2

g h and σ =2

h s is constant; then the following linear model holds:

θ µ ε ε

θ µ ε ε

= +

= +

∼

∼

, (0, )

, (0, ).

g g g g

h h h h

N s

N s (22)

From Equation (22) it follows that µ µ= −g hD is normally distributed when θ θ−g h is a normal

random variable, or θ θ−g h is a constant. Also Equation (22) allows transforming Equation (7)

into:

θ θ

θ θ θ θ ε ε

= −

= + + − −

( ) ( ) ( )

( ) ( ) ( ) 2 2 ( , ) 2 ( , ).

g h

g h g h g h

E D E E

Var D Var Var s Cov Cov (23)

10

Consider two parallel tests gT and hT with smooth information functions administered to a

test taker with ability θ. According to Chang and Stout (1993), the following holds:

2

2

(0,1), 1/ ( , )

(0,1), 1/ ( , ),

g

g g g g

g

hh h h h

h

N J T

N J T

θ µξ σ µ

σ

θ µξ σ µ

σ

−= =

−= =

∼

∼

(24)

where µ( , )g gJ T is Fisher information of test gT for ability µ .g Since tests gT and hT are

parallel and have smooth information functions, then σ σ=2 2

g h . Since tests gT and hT are

different, then ξg and ξh are independent and

σ σ

θ µ µ µθ µξ ξ

σ σ σ

µ µ

σ

=

− −−− = − =

−

2 2~ (0,2)

~ (0,1).2

g h

g g hhh g

h g h

g h

h

N

N

(25)

Thus, given two parallel tests administered to a population, according to Proposition 3 the resultant statistic �( )D G H is asymptotically distributed as chi-square with one degree of

freedom. This is an interesting result because it does not depend on the distribution of θ and

allows random σ σ=2 2.g h

One can estimate the influence of tails of the population on the approximation represented

by Equation (25). If assumption σ σ=2 2

g h does not hold, then

θ µ σ σ µθ µ µ

ξ ξ θσ σ σ σ σ σ

− −−− = − = + − .

g g h gh hh g

h g h g g h

(26)

Simulated Experiments

This section presents four simulated experiments that illustrate some of the above

propositions. The Kolmogorov–Smirnov statistic DK is employed to measure fit between the

empirical CDF ( )DF x of �( )D G H and the CDF ( )F x of the asymptotic distribution:

= −sup ( ) ( ) .D Dx

K F x F x (27)

11

The posterior distributions of the ability were computed by (19) and (20), followed by �( )D G H

calculated by (21).

Illustration of Proposition 1

Consider the following simulation experiment to illustrate Proposition 1. Two independent

groups of 10,000 test takers θ ∼ (0.5,1)g N and θ −∼ ( 0.5,1)h N were given one test. The test

consisted of 50 items modeled by the three-parameter logistic (3PL) IRT model (Lord, 1980) with the following parameters: a = 1, − +∼ ( 5, 5),b U c = 0.1. Test takers from the two groups

were randomly paired to provide the two posterior distributions. Then from the above section

on Bayesian posteriors it follows that σ σ= =2 2

g h s is constant, and from Equation (23) it follows

that =( ) 1E D and = +( ) 2 2 .Var D s Figure 1 shows the empirical CDF of �( )D G H and CDF of

the asymptotic distribution from Proposition 1 with parameters λ = 0.352 and ω = 0.124

computed from the data. One can observe an almost perfect fit with = 0.009DK .

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 5.9 11.8 17.7 23.7 29.6 35.5 41.4 47.3 53.2

D(G||H)

CDF Asymptotic

Empirical

FIGURE 1. Illustration of Proposition 1: Empirical and asymptotic CDFs of ( )D G H� computed from simulated

data

12

Figure 2 shows the distribution of σ 2

g and σ 2

h with means 0.16, 0.17 and variances 0.0005,

0.0009, respectively. One can see that these distributions are not correlated (correlation coefficient = 0.01) and are mostly concentrated around their means with a slight right tail. From

the above it follows that λ = = ≈+ ×

2( ) 10.4

( ) 2 2 0.17

E D

Var D and

ω×

= = = ≈+ + ×

2 2 2 0.170.1

( ) 2 2 2 2 0.17

s s

Var D s, which is close to parameters computed from the data.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

FIGURE 2. Illustration of Proposition 1: Distribution of empirical 2gσ (abscissa) and 2

hσ (ordinate)

13


Consider the following simulation experiment to illustrate Proposition 2. Two independent

groups of 10,000 test takers θ ∼ (0,1)g N and θ ∼ (0,1)h N were given one test. The test

consisted of 30 items modeled by the 3PL IRT model (Lord, 1980) with the following parameters: a = 1, b ~ U(−5,+5), c = 0.1. Test takers from the two groups were randomly paired to provide the two posterior distributions. Then from the above section on Bayesian

posteriors it follows that σ σ= =2 2

g h s is constant, and from Equation (23) it follows that

=( ) 0E D and = +( ) 2 2 .Var D s Figure 3 shows the empirical CDF of �( )D G H and CDF of the

asymptotic distribution from Proposition 2 with parameter ω = 0.314 computed from the data.

One can observe a good fit, with = 0.018DK .

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 2.3 4.6 6.9 9.3 11.6 13.9 16.2 18.5 20.8

D(G||H)

CDF Asymptotic

Empirical


data

14


g and σ 2


0.005, respectively. One can see that these distributions are not correlated (correlation coefficient = −0.02) and are mostly concentrated around their means with a right tail. From the

above it follows that ω×

= = = ≈+ + ×

2 2 2 0.400.3,

( ) 2 2 2 2 0.40

s s

Var D s which is close to the parameter

computed from the data.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2


hσ (ordinate)

15


Consider the following simulation experiment to illustrate Proposition 3. Ten thousand test

takers θ θ= ∼ (0,1)g h N were given one test two times. The test consisted of 100 items modeled

by the 3PL IRT model (Lord, 1980) with the following parameters: a = 1, b ~ N(0,1), c = 0.1.

Then from the above section on Bayesian posteriors it follows that σ σ=2 2

g h and

µ µ

σ

−~ (0,1)

2

g h

h

N .

Figure 5 shows the empirical CDF of �( )D G H and CDF of the asymptotic distribution from

Proposition 3. One can observe a good fit, with = 0.012.DK

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 1.0 2.0 2.9 3.9 4.9 5.9 6.9 7.8 8.8

D(G||H)

CDF Asymptotic

Empirical


data

16

Figure 6 shows the observed distribution of σ 2

g and σ 2

h with means 0.048, 0.049 and

variances 0.006, 0.007, respectively. These distributions are correlated (correlation coefficient = 0.64) and mostly concentrated around their means with a right tail.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2


hσ (ordinate)

The tails on Figure 6 are now larger. To make them extreme, we changed the test such that its length was 50 items with parameter b ~ N(1,1).

17

Figure 7 shows the empirical CDF of �( )D G H and CDF of the asymptotic distribution from

Proposition 3. One can observe a good fit, with = 0.009.DK

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.7 1.4 2.1 2.8 3.4 4.1 4.8 5.5 6.2

D(G||H)

CDF Asymptotic

Empirical

FIGURE 7. Illustration of Proposition 3 (extreme case): Empirical and asymptotic CDFs of ( )D G H� computed

from simulated data

18


g and σ 2


0.40, respectively. These distributions are correlated (correlation coefficient = 0.70) and have large right tail.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

FIGURE 8. Illustration of Proposition 3 (extreme case): Distribution of empirical σ 2

g (abscissa) and σ 2

h (ordinate)

Both experiments of this subsection demonstrated that even if σ σ=2 2

g h is random, the

statistic �( )D G H is asymptotically chi-square with one degree of freedom (see Equation (25)).

Also the setup of these experiments is closer to practice than the previous ones.

19

Application for Test Security

The major advantage of comparing posteriors for test security is that it takes into account all information available from the responses and allows various ways to partition the test

taker’s response vector into gr and .hr Practical partitions include: operational items versus

variable items, hard items versus easy items, unexposed items versus exposed items, uncompromised items versus compromised items, and items of one type versus items of another type. A large divergence between posteriors indicates a significant change in the test taker’s performance (Belov et al., 2007).

Consider a test taker taking two distinct but parallel tests gT , hT . The value of �( )D G H

from (21) provides an index of similarity between the performances on the two tests. Large divergence values indicate a significant change in performance. From the above section on Bayesian posteriors it follows that the posteriors G and H are approximated by (2) and (3),

σ σ=2 2

g h and µ µ

σ

−~ (0,1).

2

g h

h

N Therefore, the conditions of Proposition 3 are satisfied. To

conduct a statistical hypothesis test to establish an aberrant behavior, compute a critical value for a given significance level α. Values of �( )D G H greater than the critical value give a critical

region for a hypothesis test (Lehmann, 1986) on the consistency of performance between two tests.

Let gT and hT represent the operational and variable parts of a high-stakes test,

respectively. The operational part is the same for all test takers taking the test, while the variable part is generally unique in the adjacent seating area. If a test taker copies answers from his or her neighbor, then the change of the test taker’s performance between the operational part and the variable part should be large. Values of �( )D G H falling into the

critical region indicate a significant change in the test taker’s performance between the operational part and the variable part. This can be applied to identify test takers involved in answer copying (Belov & Armstrong, 2009).

In general, the number of items (m ) on the operational part and the number of items (n ) on the variable part can be different. For example, in the Law School Admission Test (LSAT),

≈ 4 .m n Therefore, to compute the critical value for a given α, one cannot apply

Proposition 3 directly. To address this problem we introduce two distinct methods to satisfy

= .m n

Method 1: Assume that operational part w has a subset 0w of responses to the items of

the same type as the variable part .v If 0w has a size different from ,v apply

bootstrapping. Then assign ∪( \ )0w w v to the variable part and, thus, = .m n As a result,

however, the responses to operational and variable parts have a substantial overlap.

Therefore, µg and µh are becoming dependent and Equation (25) does not hold. One can

assume σ σ= =2 2

g h s is constant and apply Proposition 2. Since ε ε= −( ) 2 ( , )g hVar D s Cov

and ε ε >( , ) 0g hCov (see Equation (23)), then scale ω = >2

1( )

s

Var D.

20

Method 2: Assume that operational part w has a subset 0w of responses to the items of

the same type as the variable part .v Then consider = ∪ (even responses of \ )0v v w w ,

= ∪ (odd responses of \ )0 0w w w w . If the resultant w has a size different from ,v apply

bootstrapping to satisfy = .m n

Results with empirical data derived from an administration of the LSAT are now given. Items in the operational and variable parts were modeled by the 3PL IRT model (Lord, 1980). The operational part included 100 items. There were 10 variable parts uniformly distributed over about 22,000 test takers. The posterior distributions of the ability were computed by (19) and (20), followed by �( )D G H calculated by (21).

Apply Method 1 and Proposition 2. Figure 9 shows the empirical CDF of �( )D G H and CDF

of the asymptotic distribution from Proposition 2 with parameter ω = 3.26 computed from the

data. One can observe a good fit, with = 0.018.DK

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.4 0.8 1.2 1.6 2.0 2.5 2.9 3.3 3.7

D(G||H)

CDF Asymptotic

Empirical

FIGURE 9. Illustration of Proposition 2 and Method 1: Empirical and asymptotic CDFs of ( )D G H� computed

from real data

21


g and σ 2

h with means 0.065, 0.067 and variances

0.002, 0.002, respectively. One can see that these distributions are correlated (correlation coefficient = 0.9) and mostly concentrated around their means with a slight right tail.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

FIGURE 10. Illustration of Proposition 2 and Method 1: Distribution of empirical 2gσ (abscissa) and 2

hσ (ordinate)

22

Apply Method 2 and Proposition 3. Figure 11 shows the empirical CDF of �( )D G H and

CDF of the asymptotic distribution from Proposition 3. One can observe a good fit, with

= 0.020.DK

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 1.2 2.4 3.6 4.8 5.9 7.1 8.3 9.5 10.7

D(G||H)

CDF Asymptotic

Empirical


from real data

23


g and σ 2


0.006, respectively. These distributions are correlated (correlation coefficient = 0.63) and mostly concentrated around their means with a right tail.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

FIGURE 12. Illustration of Proposition 3 and Method 2: Distribution of empirical 2gσ (abscissa) and 2

hσ (ordinate)

Since empirical variances of σ 2

g and σ 2

h are close to zero, one can apply Proposition 2,

where ω should be close to 1.

24

Figure 13 shows the empirical CDF of �( )D G H and CDF of the asymptotic distribution

from Proposition 2 with parameter ω = 0.92 computed from the data. One can observe an

almost perfect fit, with = 0.012.DK

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 1.2 2.4 3.6 4.8 5.9 7.1 8.3 9.5 10.7

D(G||H)

CDF Asymptotic

Empirical


from real data

Discussion

Kullback–Leibler divergence (KLD) is used in many areas to compute the fit of two distributions. The references show that KLD is applied in standardized testing, magnetic resonance imaging, human gene analysis, stochastic complexity, and sample size selection. Thus, results from the study of the distribution of KLD are important to several fields.

The assumptions made in this paper to identify the distribution of KLD are commonly observed in educational practice. Normality of the Bayesian posterior in various settings has

been shown in the references. Assumptions σ σ=2 2

g h and µ µ

σ

−~ (0,1)

2

g h

h

N hold as soon as two

parallel tests are administered to a population. It is possible, in some cases, to control the variance of the posterior distribution to be close to a chosen value—in other words, to hold

assumption σ σ=2 2.g h For example, in a computerized adaptive test, the stopping criterion could

25

be driven by this variance. For different combinations of assumptions, we proved that KLD is asymptotically distributed

as scaled (noncentral) chi-square with one degree of freedom or scaled (doubly noncentral) F. For details, see Propositions 1–5. Computer experiments with simulated and real data demonstrated a good fit between asymptotic and empirical distributions.

The results of this paper are directly applicable to educational practice. In particular, one can use Propositions 1–5 to compute the critical value of KLD for a given partitioning of the test: operational items versus variable items, hard items versus easy items, unexposed items versus exposed items, uncompromised items versus compromised items, and items of one type versus items of another type. These propositions, therefore, can be used to identify aberrant responding in a high-stakes test such as the LSAT.

References

Arizono, I., & Ohta, H. (1989). A test for normality based on Kullback-Leibler information. The American Statistician, 43, 20–22.

Belov, D. I., & Armstrong, R. D. (2009). Automatic detection of answer copying via Kullback–Leibler divergence and K-Index (Research Report 09-01). Newtown, PA: Law School Admission Council.

Belov, D. I., Pashley, P. J., Lewis, C., & Armstrong, R. D. (2007). Detecting aberrant responses with Kullback–Leibler distance. In K. Shigemasu, A. Okada, T. Imaizumi, & T. Hoshino (Eds.), New Trends in Psychometrics (pp. 7–14). Tokyo: Universal Academy Press.

Cabella, B. C. T., Sturzbecher, M. J., Tedeschi, W., Filho, O. B., de Araujo, D. B., & Neves, U. P. C. (2008). A numerical study of the Kullback-Leibler distance in functional magnetic resonance imaging. Brazilian Journal of Physics, 38, 20–25.

Chang, H. H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 37–52.

Clarke, B. (1999). Asymptotic normality of the posterior in relative entropy. IEEE Transactions on Information Theory, 45, 165–176.

Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: John Wiley & Sons, Inc.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. Boca Raton: Chapman & Hall.

Johnson, N. L., Kotz, S., & Balakrishnan, N. (1994). Continuous univariate distributions, Volume 1 (2nd ed.). New York: Wiley.

26

Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995). Continuous univariate distributions, Volume 2 (2nd ed.). New York: Wiley.

Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86.

Kupperman, M. (1957). Further applications of information theory to multivariate analysis and statistical inference (Ph.D. Dissertation). The George Washington University.

Lehmann, E. L. (1986). Testing statistical hypotheses (2nd ed.). New York: Wiley.

Li, Y., & Wang, L. (2008). Testing for homogeneity in mixture using weighted relative entropy. Communications in Statistics—Simulation and Computation, 37, 1981–1995.

Lin, X., Pittman, J., & Clarke, B. (2007). Information conversion, effective samples, and parameter size. IEEE Transactions on Information Theory, 53, 4438–4456.

Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

Pardo, L. (2006). Statistical inference based on divergence measures. Chapman & Hall/CRC, New York.

Salicru, M., Morales, D., Menendez, M. L., & Pardo, L. (1994). On the applications of divergence type measures in testing statistical hypotheses. Journal of Multivariate Analysis, 51, 372–391.

Song, K. S. (2002). Goodness-of-fit tests based on Kullback-Leibler discrimination information. IEEE Transactions on Information Theory, 48, 1103–1117.

Volkau, I., Bhanu Prakash, K. N., Anand, A., Aziz, A., & Nowinski, W. L. (2006). Extraction of the midsagittal plane from morphological neuroimages using the Kullback-Leibler’s measure. Medical Image Analysis, 10, 863–874.

Acknowledgments

We would like to thank Charles Lewis for his valuable comments and suggestions on previous versions of this manuscript.

Distributions of Kullback-Leibler Divergence and Its Application for ...

Documents

Transcript of Distributions of Kullback-Leibler Divergence and Its Application for ...