A Test for the Two-Sample Problem using Mutual Information ...

A Test for the Two-Sample Problem usingMutual Information to Fix Information Leak in

e-Passports

Apratim Guha ∗

School of Mathematics, University of Birmingham, Birmingham, U.K.and

Tom ChothiaSchool of Computer Science, University of Birmingham, Birmingham, U.K.

February 8, 2011

Abstract

For two independently drawn samples from continuous distributions, a statistical testbased on the mutual information statistic is proposed for the null hypothesis that both thesamples originate from the same population. It is demonstrated through simulation that thistest is more powerful than the commonly used nonparametric tests in many situations. Asan application we discuss the information leak of e-passports. Using the mutual informationbased tests, we show that it is possible to trace a passport by comparing the time taken torespond to a recorded message. We establish that the mutual information based tests detectdifferences where other tests fail. We also explore the effect of adding an artifical fixed-time delay in specific situations to stop the information leak, and observe that the mutualinformation based test detects a leak in a situation where the other non-parametric tests fail.

Keywords: Nonparametric Test, Information Theory, Test of Independence, Kernel Density Esti-matation, Bandwidth, Anonymity.

∗We are grateful to Y. Xiao for his comments on the manuscript as well as his help with the coding. The maintheorem in this paper derives from of A. Guha’s Ph.D. thesis supervised by D. Brillinger, whom he thanks for hisencouragement, help and guidance.

1 Introduction

The motivation of this work arises from an anonymity problem in computer science. Chothia and

Smirnov (2010) discusses a time-based tracebility attack on an e-passport where by eavesdropping

on its communication with a reader it is possible to trace the passport later. In that work, one

session between the passport and a legitimate reader was recorded, and then by comparing the

response time to a previously recorded message, a leak was inferred on the basis of a visual inspec-

tion of the plot of the response times for the same and a different passport. However, no quantative

measure was used. To quantify the difference of the response time for different passports, we intro-

duce a mutual information (MI) based test statistic to compare two independently drawn samples

from continuous distributions to test the null hypothesis that both samples originate from the

same population.

When the underlying distributions are continuous, several well-known non-parametric tests

are available. Most of them are based on the empirical distribution function: the Kolmogorov-

Smirnov (KS) test, the Wilcoxon test (Gibbons and Chakraborti 2010), the Anderson-Darling

(AD) test (Pettitt 1976) and the Cramer-von Mises (CVM) test (Anderson 1962) are some of the

most-popular ones. A two-sample t-test can be used when a difference in location is suspected. A

modification of the Wilcoxon test proposed by Baumgartner, Weiß and Schindler (1998), hence-

forth referred to as the BWS test, has superior power compared to the CVM and the KS tests

in a wide variety of situations. However, it should be used with caution as this test does not

control the type I error rate in some cases, see Neuhauser (2005). Among the available parametric

and the semiparametric choices for two sample tests, Zhang (2006) and Wylupek (2010) provide

two newest examples: the former introduces a likelihood-ratio based parametric test, the latter

discusses a “data-driven” semi-parametric test. However, as we may observe from the density

estimates of the response times of various passports in Chothia and Smirnov (2010), parametric

or semiparametric distribution-based apporach would not work well for the passports as some of

them have bimodal response times and some do not. Hence in this article we limit our discussion

to non-parametric tests only.

For distributions with the same location parameter but different otherwise, the t test or the

Wilcoxon test do not work well. Such a situation arises during the analysis of e-passport informa-

tion leak in section 4. The AD test, CVM test, the KS test and the BWS test do work reasonably

well in such situations; however, we will see in sections 3 and 4 that the MI test works better than

these tests in many situations.

Examples of applications of various two-sample tests in analysing computer science data can

be found in literature, for a recent example see Jeske, Lockhart, Stephens and Zhang (2008). The

application of MI in computer science data analysis is also popular, some recent examples are

Alvim, Andres and Palamidessi (2010), Chatzikokolakis, Palamidessi and Panangden (2008) and

Chatzikokolakis, Chothia and Guha (2010). Applications also exist in other areas of science, for

some examples see Paninski (2003), Biswas and Guha (2010) and the references within. MI has

also been used in the time series context, for example see Brillinger (2004) and Brillinger & Guha

(2007). However, as far as we know, a two-sample test based on MI has never been used before.

To fix ideas, let us start with k random variables X1, X2, · · · , Xk with joint density pX1X2···Xk(·)

with respect to some measure µ. Shannon (1948) introduced the concept of mutual information

(he called it ‘relative entropy’), defined as

IX1,X2,··· ,Xk=

∫pX1X2···Xk

(pX1X2···Xk

pX1(x1)pX2(x2) · · · pXk(xk)

)pX1X2···Xk

(x) dµ(x) (1)

where x = (x1, x2, · · · , xk) and pXj(·), j = 1, · · · , k are the marginal densities of X1, X2, · · · , Xk

respectively. Notice that IX1,X2,··· ,Xk= 0 if the variables are independent. It may be noted

here that it is customary in the area of information theory and computer science to consider the

logarithm in entropic measures to the base 2. We are following that convention in this paper; and

henceforth the base in a logarithmic expression, unless mentioned otherwise, should be understood

to be 2.

When the joint distribution is continuous, the Lebesgue measure can be used as the dominating

measure and in a discrete setup, one may use the counting measure. In a hybrid setup, i.e. when

some random variables are discrete and the rest are continuous, µ is an appropriate product of the

two measures. The mutual information (MI) is non-negative; it is zero when the random variables

are mutually independent and attains its maximum when the concerned random variables have a

perfect functional relationship (Cover and Thomas (1991), Biswas and Guha (2009)). Hence it is a

very useful extension of the correlation techniques which is only useful to study linear dependence.

As a special case of (1), the MI statistics for two random variables X and Y with joint density

function pXY (x, y) with respect to some dominating measure µ may be defined as

∫ ∫pXY (x,y)>0

(pXY (x, y)

pX(x)pY (y)

)pXY (x, y) dµ(x, y); (2)

where pX(x) and pY (y) are the respective marginals.

Now, consider a hybrid pair (X, Y ) where X is a binary random variable and Y is a continuous

random variable. We may write

∫y:p(0,y)>0

(p(0, y)

p0p(y)

)p(0, y) dy +

∫y:p(1,y)>0

(p(1, y)

p1p(y)

)p(1, y)dy, (3)

where the joint density parameters are defined as

P [X = 0, y < Y < y + dy] = p(0, y) dy

P [X = 1, y < Y < y + dy] = p(1, y) dy (4)

and the order 1 parameters are given by

P [y < Y < y + dy] = p(y)dy; P [X = 1] = P (1) = p1 P [X = 0] = p0 = 1− p1; (5)

so that p(y) = p(0, y) + p(1, y).

In this paper, we utilize the form of the MI statistic as described in (4) to assess the indepen-

dence of two samples. Towards that, let us denote the two samples by Y0 := {Y01, Y02, · · · , Y0n}

and Y1 := {Y11, Y12, · · · , Y1m}. We assume here that the samples are from two continuous dis-

tributions: Y01, Y02, · · · , Y0n are independently sampled from the distribution F0, and further

Y11, Y12, · · · , Y1m are independentently sampled from the distribution F1. Let us set the null

hypothesis H0: F0 = F1 and the alternative H1: F0 6= F1.

The idea of this test comes from the fact that the MI between two random variables is zero only

when they are independent. To utilise this idea, we combine Y0 and Y1 in one single vector Y,

and create a 0-1 valued vector X, whose j-th element, say Xj, is 0 or 1 according to whether the

jth element of Y , say Yj, is from Y0 or Y1. In other words, X is a vector of length N := (n+m)

with n zeroes followed by m 1s. Under H0, Y would be independent of X in the sense that whether

Xj is 0 or 1 will have no bearing on the value of Yj and hence the estimated MI between Y and

X would differ from the estimated MI between Y′, a typical sample of length N from F0 and X′,

a typical sample of length N from the Bernoulli distribution with P (1) = p1 = m/N only due to

random error. We may note here that as F0 and the above mentioned Bernoulli distribution are

independent, the true MI between these two distributions is 0.

Now under H1, X is completely fixed by our choice of sample, and hence it is clearly related

with Y. Hence the MI is higher, and so should typically be the estimated MI. Therefore we reject

H0 if the estimated MI statistic is ‘large’. A more precise rejection criterion is discussed in Section

2. It can easily be shown that the MI between a continuous random variable and a binary random

variable is bounded by 1, and hence values of MI close to 1 suggest a high degree of dependence

between X and Y, which in turn may be considered as a strong evidence against H0.

The rest of the paper is divided as follows. We develop the estimates of the mutual informa-

tion based test statistic using kernel density estimates and its properties in Section 2. We also

provide the asymptotic distributions of the mutual information statistic under certain regularity

conditions, and describe the test procedure with greater details. In Section 3 we compare MI

tests with the previously mentioned existing tests through simulation. The passport data and our

experiments to explore the information leaks are discussed in Section 4. Section 5 concludes.

2 Mutual Information and the Test Statistic

In this section we introduce an estimate of the mutual information statistic between one discrete

and one continuous random variable, describe its asymptotic distribution under some regularity

conditions when the two variables are independent and finally discuss the construction of the

critical region of the test based on MI obtained in Section 1.

2.1 Estimation of Mutual Information

An estimate of the MI between X and Y, IXY , can be used as a check for dependence by testing

whether IXY is significantly different from zero. Many competing estimates exist, for some ex-

amples see Moddemeijer (1989), Paninski (2003) and Antos and Kontoyiannis (2001). A popular

choice is the “plug-in” estimate, Antos and Kontoyiannis (2001). It is obtained by substituting

suitable density estimates into (2):

∫ ∫(x,y):pXY (x,y)>0

(pXY (x, y)

pX(x)pY (y)

)pXY (x, y) dµ(x, y). (6)

In the hybrid situation, (6) reduces to

∫y:p(0,y)>0

(p(0, y)

p0p(y)

)p(0, y) dy +

∫y:p(1,y)>0

(p(1, y)

p1p(y)

)p(1, y) dy. (7)

The obvious choice for p1 is the sample proportion of 1’s, i.e.

N∑i=1

χ{Xi=1}

where χB is the indicator function of a set B. The continuous density estimates may be estimated

using a suitable kernel K as follows

p(y) =1

N∑i=1

(Yi − yhN

p(1, y) =1

N∑i=1

(Yi − yhN

)χ{Xi=1}, p(0, y) = p(y)− p(1, y), (8)

where hN is an appropriately chosen bandwidth. We discuss the choice of kernels and bandwidths

in Section 2.2.

2.2 A Distribution of the Mutual Information Statistic Under Inde-

pendence

It is known that there is no universal rate at which the error in estimation of MI goes to zero, no

matter what estimator we pick, see Antos and Kontoyiannis (2001) andPaninski (2003). However,

a better and more positive result can be obtained under reasonable regularity conditions for a

smaller class of distributions. We now derive such a result.

Let us assume that the pairs {Xi, Yi}, 1 ≤ i ≤ N are independent and identically distributed

(IID) satisfying

A1. Yi’s are bounded continuous real-valued random variables with finite support;

A2. For u = 0, 1, p(u, y) has a continuous bounded second derivative in y;

A3. K has a finite support symmetric around zero, and integrates to 1.

A4. hN → 0, Nh2N →∞ and Nh4

N → 0 as N →∞.

The subscript N on hN will be omitted henceforth.

Under the null hypothesis

H0 : X and Y are independent, (9)

a large sample distribution for IXY may given by the following theorem.

Theorem 1. Under H0 and assumptions A1-A4, Nh−1/2(IXY /log (e)− C1/(Nh)) converges to a

normal distribution with mean 0 and variance C2 = 0.5∫

(∫K(w)K(v +w)dw)2dv

∫χp(y)>0 dy as

N →∞, where C1 = 0.5∫K2(v) dv

∫χp(y)>0 dy.

An outline of the proof is given in Appendix A.

One may utilize the large sample distributions of the MI statistics from Theorem 1 to test H0

against the alternative H1 : X and Y are not independent. (10)

Notice that H0 is equivalent to saying that IXY = 0, and similarly H1 is equivalent to saying that

IXY > 0.

The idea of Theorem 1 is similar to Proposition 1 of Fernandes and Neri (2008), which discusses

the large sample distribution of a generalised entropic measure between two continuous processes

in the time series situation, of which the MI is a specific case. Similar to their result, the asymptotic

distribution of the MI statistic depends on X and Y only through the length of support of Y in

the hybrid situation as well. Whereas this may be considered as an advantage, for Theorem 1 to

apply Y needs to be bounded. However, in real life situation this problem may not be too critical

as we rarely observe a distribution with infinite support in practice.

The presence of bias, which grows as h−1/2 in the hybrid setup, is one of the known problems of

the MI estimates. It renders the estimation of bias essential. For simple kernels and for bounded

distributions whose length of support is known, Theorem 1 can be used to estimate the bias.

In more general situations, where Theorem 1 may not apply, data-driven methods we discuss in

Section 3 can be employed to estimate the bias.

2.3 A Two Sample Test Using Mutual Information

We have already seen in Section 1 that the MI can be utilised to test whether two or more samples

were obtained from the same or different distributions. We restrict our discussion in this work to

the two sample case, but a generalization can easily be achieved.

Let us now describe the test procedure in more detail. As set in 1, we are going to discuss

the method to construct a test statistic when the two samples are from continuous distributions,

but a test when they are from discrete distributions can also be obtained; for motivation see

Chatzikokolakis, Chothia and Guha (2010).

Suppose we have n independent observations Y01, Y02, · · · , Y0n from a distribution F0 and fur-

ther m independent observations Y11, Y12, · · · , Y1m from a possibly different distribution F1. Let

us write Y0 := {Y01, Y02, · · · , Y0n} and Y1 := {Y11, Y12, · · · , Y1m}. We want to test the null

hypothesis H0: F0 = F1 against the alternative H1: F0 6= F1. We further combine Y = (Y0,Y1)

and create a 0 − 1 valued vector X, whose j-th element, Xj, is 0 or 1 according to whether the

jth element of Y , Yj, is from Y0 or Y1, i.e. X is a vector of length N := (n + m) with n zeroes

followed by m 1s.

The test statistic to be used is IXY , as described in (7). Note that choice of the bandwidth

is an issue. Optimum choice of bandwidth for the kernel density estimate is well studied, and

a number of options are available to suit different situations, see Silverman (1986) and Sheather

and Jones (1991). Following Silverman (1986), to expedite the computation process we choose a

“rule-of-thumb” optimal bandwidth function which is best suited when the original distribution

is Gaussian but is also known to work well for distributions which are not heavily skewed:

hOPT = 1.06SD(Y )n−1/5, (11)

where SD(Y ) is the standard deviation of Y . Although this choice works fairly well in the situations

we encounter during simulations and data analysis, an optimal choice of bandwidth for mutual

information estimates remains a subject of futue studies.

The asymptotic normality of IXY under H0 : F0 = F1 and assumptions A1-A4 could be utilized

to construct a critical region for the MI test. However, the normal approximation often does not

work very well for the MI estimates except for large samples, a phenomenon also observed by

Fernandes and Neri (2008). Moreover, the assumption of finite support of Y is restrictive, and

requires the estimation of the length of support from the sample. A more robust procedure is to

use bootstrap techniques to obtain the critical region. An advantage is that this technique also

works when the samples are not necessarily from distributions with finite support, and hence is

less restrictive.

A bootstrap-based critical region for the above mentioned test at level of significance α when

comparing the two samples Y0 and Y1 of size n and m respectively can be obtained through the

following steps:

1. Combine Y0 and Y1 in one single sample which we denote by Y.

2. Simulate a random sample of size N := m+ n, say X1 from the Bernoulli distribution with

p1 = m/N and compute its mutual information with Y. Let us denote the value of the

estimated mutual information by I1. We repeat this step a large number of times, say K,

and obtain Ij; j = 1, 2 · · · , K.

3. Use the 100(1− α)th percentile of the sampling distribution of I1, · · · , IK as the cut-off for

the test with level α.

Alternatively, an estimated p-value of the test statistic can be reported when a test sample is

available, which may be computed as the proportion of I1, · · · , IK exceeding the observed MI for

the test sample, say I.

3 Some Simulations

We now compare the power of the MI test to the power of some conventional tests, namely the

KS test, the CVM test, the AD test and the BWS test at 5% level of significance.

To obtain the power of the MI test when m samples from the distribution F0 and n samples

from distribution F1 are to be compared, we use an Epanechnikov kernel based estimate with

the bandwidth chosen according to (11) to obtain the MI test statistics, and use the following

algorithm to compute the power of the MI test:

1. Obtain samples Y0 of length m from F0 and Y1 of length n from F1. Combine them in one

single vector which we denote by Y.

2. We simulate a random sample of size N := m + n, say X1 from the Bernoulli distribution

with P (1) = m/N and compute its mutual information with Y. Let us denote the value

of the estimated mutual information by I1. We repeat this step 10,000 times, and obtain

Ij; j = 1, 2 · · · , 10, 000.

3. We use the 100(1−α)th percentile of the sampling distribution of I1, · · · , I10,000 as the cut-off

to be used for rejection for the test with level α.

4. We now again simulate samples Y0 of length m from F0 and Y1 of length n from F1, and

again combine them in one single sample which we denote by Y.

5. Define a 0-1 valued vector X, whose j-th element, Xj, is 0 or 1 according to whether the

jth element of Y , Yj, is from Y0 or Y1, i.e. X is a vector of length N := (n + m) with n

zeroes followed by m 1s. Compute the MI estimate between Y obtained in step 4 with the

obtained X. H0 is rejected if the obtained mutual information estimate is greater than the

cut-off obtained in step 3.

6. Repeat steps 4 and 5 1000 times to estimate the power of the test, given by the proportion

of rejections of H0.

Following the methods described in Xiao et al. (2007) and Baumgartner, Weiß and Schindler

(1998), the p-values and the cut-offs of the CVM test and the BWS tests for 5% level are estimated

using the appropriate quantiles of the sampling distributions of the said test statistics for relevant

values of m and n based on samples from the standard normal distribution of sizes m and n.

The sampling distribution is computed based on 10000 samples. For the KS and AD tests we

use the asymptotic properties of the test statistics following Conover (1971) and Scholz (1987)

respectively. To estimate the power of the tests we compare the percentage of rejections of H0 by

the tests based on 1000 random pairs of samples with m = n = 100 and 500. When the sample

sizes grow larger, all the tests start performing well; we have also explored the power of the tests

when the samples are of size 2500. As all the tests perform very well for that size, we omit the

large size samples from our discussion for the sake of brevity. It may be noted here that the

proposed MI test statistic is based on density estimates so for very small sample sizes it should

not be used.

We already know about the existence of a number of efficient tests of location. When comparing

two normal distributions with equal variances, the UMPU test is the two sample t-test, Lehmann

and Romano (2010). However, the main goal of our work, as will further be explained in the

next section, is to obtain a test when two samples are matched for their location. Hence, during

the simulation exercise we primarily concentrate on comparing the power of different situations

when the first moments of the distributions are matched. Towards that, we compare samples from

standard normal distribution with samples from normal distribution with zero mean and variance

increasing away from 1. The results are summarized in Figure 1. The MI test clearly appears

to be the most powerful, as can be seen in Figure 1, followed closely by the AD and the BWS

tests. As examples of distributions where the support is finite, so that Theorem 1 holds and

● ●

1.0 1.2 1.4 1.6

● MI testKS testCVM testAD testBWS test

(a) Sample size m=n=100

●● ●

1.0 1.1 1.2 1.3 1.4 1.5

(b) Sample size m=n=500

Figure 1: Plot of the estimated power of various tests based on 1000 replications comparing N(0, 1)samples with N(0, σ2) samples.

●●

5 10 15 20

● ●

5 10 15 20

Figure 2: Plot of the estimated power of various tests based on 1000 replications comparing N(0, 1)samples with samples from t distributions with various degrees of freedom.

the MI based test statistic works well, we compare the uniform distribution on (−1, 1) with other

symmetric uniform distributions on intervals (−a, a), and also the uniform distribution on the

unit interval with the uniform distributions on intervals (0, a), for values of a gradually increasing

from 1. The results are shown respectively in Figures 3 and 4. The MI test once again appears

the most powerful.

1.0 1.1 1.2 1.3 1.4 1.5

● ● ●

1.0 1.1 1.2 1.3 1.4 1.5

Figure 3: Plot of the estimated power of various tests based on 1000 replications comparingU(−1, 1) samples with samples from U(−a, a) distributions for different values of a.

● ●

1.0 1.1 1.2 1.3 1.4 1.5

● ● ● ●

1.0 1.1 1.2 1.3 1.4 1.5

Figure 4: Plot of the estimated power of various tests based on 1000 replications comparing U(0, 1)samples with samples from U(0, a) distributions for different values of a.

It is of interest in observing the discrimination power of the tests when a sequence of distribu-

tions is compared with their eventual limit. Towards that, we compare samples from the standard

normal distribution with t-distributions with gradually increasing degrees of freedom. The results

are presented in Figure 2. The MI test exhibits superior discriminating power when comparing

the standard normal distribution with t-distributions with different degrees of freedom. Whereas

for the other tests power decreases to around 0.05 very qucikly, the power of the MI test decreases

much more slowly; for example, when tested on the basis of 500 samples, the MI test is the only

test with any discriminating power between the standard normal and the t distribution with 20

degrees of freedom.

Finally, for further comparison of two distributions with different shapes but equal means and

variances, we compare the N(0, 1/12) distribution with the uniform distribution on the interval

● ● ● ● ● ●

100 200 300 400 500 600 700

sample size(m=n)R

Figure 5: Plot of the estimated power of various tests based on 1000 replications comparingN(0, 1/12) samples with samples from U(−0.5, 0.5) distribution for different sample sizes.

(−0.5, 0.5) for various sample sizes. The results are presented in Figure 5. Similar to the previous

cases, the MI test is again the most powerful followed by the BWS test and the AD test, whereas

the KS and the CVM tests do not perform very well.

4 Analysis of Passport Information Leakage

4.1 e-Passports

In this section we use the MI to analyze the security of the radio frequency identification (RFID)

chip embedded into e-passports. These chips broadcast the information printed on the passport

and a JPEG copy of the picture with the aim of making immigration controls faster and more

secure. These chips may also optionally include fingerprints, iris scans and additional personal

information. e-Passports are specified by the International Civil Aviation Organisation (ICAO)

and most countries implement their own version.

Read access to the data on the RFID chip is protected by a cryptographic key based on the

date of birth of the passport holder, the date of expiry of the passport and the passport number.

The idea behind this is to allow read access to anyone who has physical access to the passport, but

to stop covert “skimming” of the data without the owners knowledge. These passports also aim

to be untraceable: if you do not know the passport’s cryptographic key, it should be impossible

to distinguish one passport from any other, and in particular, it should be impossible to recognize

a passport that you have seen before from the radio messages it transmits.

A reader that knows the date of birth and expiry, and passport number can use these to

generate an encryption key and a message authentication code (MAC) key (which is used for error

checking). Both these keys are unique to the individual passport. The reader and the passport

then exchange a number of messages that let them prove to each other that they both know the

cryptographic keys. The reader powers up the passport, and the passport then sends a random

number back to the reader. The reader then generates its own random number and encrypts both

the random numbers using the passport encryption key. The MAC key is used to generate a short

error checking code, for the encrypted numbers, and then both the encrypted numbers and the

error checking code are sent to the passport.

The passport uses its unique MAC key to check that it has received the message containing the

encrypted numbers correctly. If this check is passed, the passport then decrypts the message and

checks that it contains the random number it sent to the reader. This proves to the passport that

the reader really knows the passport’s unique cryptographic key and is not, for instance, replaying

an old message.

The passport then encrypts the reader’s random number and sends it back to the reader, with

another MAC error checking code. Once the reader successfully receives its own number back

it can be sure that it is communicating with the passport (or, at least, a device that knows the

passport’s key). After this exchange of messages the passport will allow the reader access to the

information stored on the chip.

4.1.1 Tracing e-Passports

To an outside observer, all of the messages exchanged by the passport and reader appear to be

completely random. If an attacker tried to record a message and replay it to the passport during

a later session the messages would be rejected, because the random numbers would not match.

However, while investigating actual passports, we found that there was a way to identify a passport

without knowing the passport’s cryptographic key.

To be able to trace a passport attackers must first observe an exchange between the passport

they wish to trace and a legitimate reader. While doing so, they must record the message from

the reader that includes the passport’s random number and the error checking code, produced

using its unique MAC key.

When the attacker comes across a passport later and wants to remotely check if it is the same

passport as before, it starts a new run of the protocol. The passport sends the attacker a random

number, and the attacker then replays the messages it previously recorded. There are now two

possibilities, first it may be a different passport than before; in this case the check of the MAC

fails, because each passport uses a different MAC key, and the passport sends an error message.

Second, it might be the same passport again. In this case the check of the MAC succeeds, the

message is then decrypted. However, the random number in this message would be from the old

session, so it would not match the random number the passport expects, and the passport would

stop the exchange and send an error message.

In both cases the replayed message is rejected and the attacker is denied access to the data on

the chip. However, when we experimented with actual passports we observed that it took longer

for a passport to reject the replayed message in the second case, i.e., when it was the passport

we were trying to trace. This is because the message uses the passport’s unique MAC key, so the

MAC check succeeds and the passport has to go on to do the decryption. On the other hand, if

it was a different passport the MAC check would have failed and the message would have been

rejected sooner.

This difference in response times can be used to detect particular passports. While the range

of the RFID chips is limited it would certainly be possible to, for instance, build a device that

would sit next to a door way and remotely detect when certain particular people entered or left a

building. When the existence of this attack was announced in at a computer science conference

(Chothia and Smirnov 2010), this gained some media attention, see Goodin (2010). In this paper,

we present a full analysis of the timing information and consider ways to fix this information leak.

4.2 Analysing e-Passports

We now apply the MI to analyze the extent to which passports from different countries can be

traced and to assess the effectiveness of some possible fixes. In this setting, we replay a message

to a passport and look for any relationship between the time it takes a passport to respond and

whether or not the message came from that particular passport. In terms of our set up in Section

1, X is 1 if the passport we replay the message to is the same one used in the session where the

message was recorded, and X is 0 if the message did not come from this particular passport. The

continuous variable Y in this example is the time it takes to reject the message. The passport is

considered to be secure if, and only if, no evidence of dependence between X and Y is present.

4.2.1 Data Collection

Each country implements its own version of the e-passport, and the time taken for passports to

communicate with a reader always have the same distribution when they have the same nationality.

We therefore tested one passport each from four different countries: Germany, Greece, Ireland and

the UK. For each of these we first calculated the passport’s cryptographic key from the date of

birth, date of expiry and passport number. Then, using a basic RFID reader, we ran the access

protocol and recorded the message we needed to replay. For the German, Greek and Irish passports

we replayed the message to the passport 500 times, and then sent a random message 500 times

(simulating a message from a different passport). For the British passport we replayed the message

to the passport 1000 times, and then sent a random message 1000 times. We added a clock to our

computer program to exactly measure the time between when the replayed message was sent and

when the passport’s error message was received by the reader.

4.2.2 Analyzing the Times

The response times are shown in Figure 6. Here the solid lines show the response time when

the passport is the same, and the dotted lines show the times when the passport is different. In

each of these cases the time difference is clear, however there is some overlap between the times.

The first two columns of Table 1 presents the values of the MI test statistics computed following

the methods as described in Section 3 and the corresponding p-value estimates based on 10000

bootstrapped samples. It is observed that the MI estimates are very near to 1 for all four passports

considered and hence it is obvious that the passports can be traced.

4.2.3 Testing a Fix for the Passport

To fix the leak in the passports, it is intuitive that if the passport goes into decryption after

the MAC check regardless the result of the check, then the information leak discussed in Section

4.2.1 may be blocked. However, by looking at the plots in Figure 6, it apparently may seem that

by somehow applying an artificial “time-padding” to the response time, (i.e. applying a shift of

0.663 0.664 0.665 0.666 0.667 0.668 0.669 0.670

00Density Comparison for British Data

ated D

Original PassportDifferent Passport

(a) UK passport on reader

0.044 0.046 0.048 0.050 0.052

Density Comparison for Greek Data

ated D

(b) Greek passport on reader

0.044 0.046 0.048 0.050 0.052

Density Comparison for Irish Data

ated D

(c) Irish passport on reader

0.130 0.132 0.134 0.136 0.138 0.140

Density Comparison for German Passport Data

(d) German passport on reader

Figure 6: Response times for replaying a message to passports

location,) when the passport does not go into the decryption stage, a quick fix may be achieved.

To examine this idea, we experimented with “time padding” by various constants, including the

difference of means and difference of medians of the response times for the same and the different

passports. Adding the difference of medians seemed to work the best in terms of reduction of the

MI estimates. The corresponding MI test statistics are presented in the third column of Table 1.

All the MI values show significant reduction, and hence the fix may seem to be working. However,

if we look at the estimated p-values presented in the last column of Table 1, we can see a different

story. Only for the Greek passport the p-value increases from 0; hence the problem is not solved

for any of the other 3 passports. For the Greek passport though at a 5% level of significance it

may be concluded that it has been “fixed”.

Table 2 compares the p-values of the other nonparametric tests discussed in this paper for

the fix applied to various passports. They agree to the conclusion of the MI test for the British,

Irish and Greek passports. However, every other test fail to detect a leak in the fixed German

Nationality MI before padding p-value MI after padding p-value

British 0.9542736 0 0.09446402 0Irish 0.9999755 0 0.04872853 0

Greek 0.9795026 0 0.01775579 0.075German 0.983794 0 0.03101871 0

Table 1: A comparison of the mutual information estimates obtained for different passports beforeand after applying the time padding based on difference of medians.

Nationality MI test KS test CVM test AD test BWS test

British 0 3×10−12 0 0 0Irish 0 8×10−10 0 0 0

Greek 0.075 0.718 0.544 0.3671 0.408German 0 0.2574316 0.7425 0.3017 0.2705

Table 2: A comparison of the p-values of different test statistics obtained for different passportsafter applying the time padding based on difference of medians.

passport, which clearly demonstrates the superior sensitivity of the MI test, and justifies its use

in this situation.

The case of the Greek passport also needs further discussion. It seems that all the tests agree

that the “fix” works for the Greek passport, at least at a 5% level of significance. However, if

we look at the p-value of the MI test, it is only 0.079, which indicates that there might be some

difference.

Hence, it seems that overall this fix may not be very efficient, although it does seem at least

to work partially, and at least reduce the chance of the detection of a leak significantly. However,

to device a completely leak-proof passport, a better fix is obviously required.

5 Discussion

In this work we proposed a MI based two-sample test for samples from continuous distributions.

We have discussed a kernel density estimate based estimate of the test statistic and provided an

asymptotic distribution of the test statistic when the two samples are from the same distribution

under some restrictive conditions. It was established through simulations that the test works

well even when some of the conditions of the asymptotic result, most notably the condition that

0.048 0.049 0.050 0.051 0.052 0.053

0Density Comparison for Greek Data

(a) Greek passport on reader

0.135 0.136 0.137 0.138 0.139 0.140

Density Comparison for German Passport Data after Median Correction of the Response to Incorrect Passport

(b) German passport on reader

Figure 7: Comparing the response times after padding the reply times for the Greek and theGerman passports by the difference of the median response times.

the support of the distributions be finite, are violated. We presented some simulation-based

comparison with other tests in various situations, and the MI test appeared superior in terms of

power when comparing samples varying in scale; and also for samples from different distributions

with identical location and scale parameters. Finally, we justified its use in the present analysis by

demonstrating an example where it found a leak in a “fixed” e-passport where other tests failed.

In the examples we mention above, the cases of the German and the Greek passports for which

all other tests failed to show a leak but a visual inspection made us suspicious otherwise, see

Figure 7, were the main motivation to think of a more sensitive test. This led to our idea of the

MI based test which supported our suspicion that there still was a substantial leak. Whereas the

use of MI is quite popular in various areas of applied sciences, as far as we know our application

of this statistic for a two sample test is new. A point to be noted here that we do not claim that

our test is the best in all situations. For example if the object of interest is the test of location,

one might as well use the t-test which is UMPU for normal distributions and is quite robust

in other situations, and is by far the simplest to understand and apply. Our interest is mostly

in detecting differences beyond difference in location, and as we have established here through

simulations and examples, the MI test has superior power in a variety of situations. Similarly,

dedicated tests of scale, eg. the test proposed by Levene (1960), may work better than general

tests in specific situations, but may not be of any use in some other situations, for example in the

problem of improving a passport that we discuss in Section 4. That is why all our studies have

been dedicated to comparing tests of more general nature.

It should be noted that although Theorem 1 only applies to distributions with bounded sup-

port, the simulations in Section 3 establish that the MI test works fairly well in situations with

unbounded supports as well. We hope to explore these more general situations in a future work.

Perhaps the best feature of the MI test is that we can extend this procedure to compare k(≥ 2)

samples quite naturally to check whether all of them are from the same distribution by testing

whether the MI between a combined sample and an index vector of all samples with different

indicators for each samples is 0 or not. The expression for the MI will be obtained by a simple

extension of (3) to the case where X can take values 0, 1, · · · , k − 1. We have taken this up in a

parallel ongoing work. It should be noted here that the other two-sample tests discussed here also

have extensions to multi-sample cases, see Zhang and Wu (2007) for a detailed discussion. However,

the extensions depend on choice of weight functions, and the choice of optimum combination of

weights are not obvious. Such issues do not arise with MI based tests.

A notable difference of the MI test from the well-known non-parametric two-sample tests is that

it is not a rank-based test. Whereas the rank-based tests have many advantages, their problems

with ties are well-documented, which often require special treatments. The problems with ties

are usually more severe for discrete data, although they may also arise with larger samples from

continuous distributions when the data values are rounded off, as often is the case in real life.

Some versions of the rank-based tests do exist for discrete data in special cases, for example see

Scholz (1987) for the AD test. However, their applications are not easy and often suffer from loss

of power. The MI has a more obvious extension to the discrete situation as it can be estimated

based on the frequencies of the different values of the discrete variables, and hence ties would not

impact the performance of the test statistic computed based on MI. See Chatzikokolakis, Chothia

and Guha (2010) and Biswas and Guha (2009) for some examples of usage of the MI in the discrete

situation.

Finally, setting h = N−(1/4+δ) for a small δ, 0 < δ < 1/4, a rate of convergence of N−(3/4−δ) for

the MI estimate (to 0) can be achieved, which is superior to the best rate for the class of estimates

discussed by Stone (1980). Hence, it may be concluded that despite being biased, the MI estimate

is an efficient one due to its superior rate of convergence, and hence the performance of the MI

test statistic may also be expected to improve quickly with larger sample sizes compared to other

non-parametric tests.

References

Alvim, M., Andres, Mi. & Palamidessi, C. (2010), “Entropy and Attack Models in Information

Flow”,Theoretical Computer Science, IFIP Advances in Information and Communication

Technology, 323, 53–54.

Anderson, T. W. (1962), “On the Distribution of the Two-Sample Cramer-von Mises Criterion”,

Annals of Mathematical Statistics 33, 1148-1159.

Antos, A. & Kontoyiannis, Y. (2001), “Convergence properties of functional estimates for discrete

distributions”, Random Structures & Algorithms 19, 163–193.

Biswas, A. & Guha, A. (2009), “Time series analysis of categorical data on infant sleep status using

auto-mutual information”, Journal of Statistical Planning and Inference 139, 3076–3087.

Biswas, A. & Guha, A. (2010), “Time series analysis of hybrid neurophysiological data and appli-

cation of mutual information”, Journal of Computational Neuroscience 29, 35–47.

Brillinger, D. R., (2004), “Some data analysis using mutual information”, Brazilian Journal of

Probability and Statistics 18,163–183.

Brillinger, D. R. & Guha, A. (2007), “Mutual information in the frequency domain”, Journal of

Statistical Planning and Inference 137, 1074–1086.

Baumgartner, W., Weiß, P. & Schindler, H. (1998), “A Nonparametric Test for the General Two-

Sample Problem”, Biometrics 54, 1129–1135.

Bosq, D. (1996), Nonparametric Statistics for Stochastic Processes, Springer-Verlag, New York.

Chatzikokolakis, K., Palamidessi, C. & Panangaden, P. (2008), “Anonymity protocols as noisy

channels ”, Information and Computation 206, 378–401

Chatzikokolakis, K., Chothia, T. & Guha, A. (2010), “Statistical Measurement of Information

Leakage”. Proceedings of TACAS 2010, 390–404.

Chothia, T. & Smirnov V. (2010), “A traceability attack against e-passports”, Proceedings of the

14th International Conference on Financial Cryptography and Data Security, Springer, LNCS

Conover, W. J. (1971), Practical Nonparametric Statistics, John Wiley & Sons, New York.

Cover, T. M. & Thomas, J. A., (1991), Elements of Information Theory, Wiley, New York.

Fernandes, M. & Neri, B. (2008), “Nonparametric entropy-based tests of independence between

stochastic processes”, Econometric Reviews, , 276–306.

Gibbons, J. D. & Chakraborti, S. (2010), Nonparametric Statistical Inference, 5th edition, Chap-

man and Hall, London.

Goodin, D. (2010), “Defects in e-passports allow real-time tracking”, The Register, www.

theregister.co.uk/2010/01/26/epassport_rfid_weakness/

Guha, A. (2005), Analysis of Dependence Structures of Hybrid Stochastic Processes Using Mutual

Information, Ph.D. Thesis, University of California, Berkeley.

Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application, Academic Press,

San Fransisco.

Jeske, D. R., Lockhart, R. A., Stephens, M. A. and Zhang, Q. (2008), “Cramer-von Mises tests

for the compatibility of two software operating environments”,Technometrics, 50, 53–63.

Lehmann, E. L. & Romano, J. P. (2010), Testing Statistical Hypotheses, 3rd edition, Springer,

New York.

Levene, H. (1960), “Robust tests for equality of variances.” In: Contributions to Probability and

Statistics. Stanford University Press, Stanford. 278-292.

Moddemeijer, R. (1989), “On estimation of entropy and mutual information of continuous distri-

butions”, Signal Processing. 16, 233–248.

Neuhauser, M. (2005), “Exact tests based on the Baumgartner-Weiss-Schindler statistic - a sur-

vey”, Statistical Papers, 46, 1–29.

Paninski, L. (2003), “Estimation of entropy and mutual information”, Neural Computation

15, 1191–1253.

Pettitt, A. N. (1976), “A two-sample Anderson-Darling rank statistic”, Biometrika 63, 161–168.

Scholz, F. W. & Stephens, M. A. (1987), “K-sample Anderson-Darling Tests”, Journal of the

American Statistical Association 82, 918-924.

Shannon, C. E. (1948), “A mathematical theory of communication”, Bell System Technical Journal

27, 379–423 & 623–656.

Sheather, S. J. & Jones M. C. (1991), “A reliable data-based bandwidth selection method for

kernel density estimation”, Journal of the Royal Statistical Society B, 53, 683–690.

Silverman, B. W. (1986), Density estimation for statistics and data analysis, Chapman and Hall,

London.

Stone, C. J. (1980), “Optimal rates of convergence for nonparametric estimators”, Annals of

Statistics 8, 1348–1360.

Wald, A. & J. Wolfowitz, J. (1940), “On a test whether two samples are from the same population”,

Annals of Mathematical Statistics 11, 147–162.

Wylupek, G. (2010), “Data-driven k-sample tests”. Technometrics 52, 107–123.

Xiao, Y., Gordon A. & Yakovlev, A. (2007), “A C++ program for the Cramer-von Mises two

sample test”, Journal of Statistical Software 17.

Zhang, J. (2006), “Powerful two sample tests based on the likelihood ratio”, Technometrics 48, 95–

Zhang, J. and Wu, Y. (2007), “k-sample tests based on the likelihood ratio”, Technometrics

51, 4682–4691.

A Proof of Theorem 1

We now provide a brief idea of the proof of Theorem 1. The details are similar to Guha (2005)

and are provided as supplimentary materials.

Firstly, note that

IXY /log (e) =

∫y:p(0,y)>0

(p(0, y)

p0p(y)

)p(0, y) dy +

∫y:p(1,y)>0

(p(1, y)

p1p(y)

)p(1, y) dy,

where IXY is as in (6), and ln(x) = loge(x). For simplification of notation, denote

Khi(y) = K

(Yi − yh

); χij = χXi=j; χ

0ij = χij − pj; j = 0, 1. (12)

Now using arguments similar to Fernandes and Neri (2008), it can be shown that when assumption

A2 holds and H0 is true, then

IXY =1

∫ (f 2(0, y)

p(0, y)+f 2(1, y)

p(1, y)− f 2(y)

)dy + op

), (13)

f(y) = p(y)− p(y); f(j, y) = p(j, y)− p(j, y); fj = pj − pj forj = 0, 1. (14)

When H0 is true, the first expression in the right hand side of (13) can be broken into a sum of

two parts as 1

2(Nh)2p1p0

(T1 + T2) (15)

T1 =N∑i=1

2∫K2hi(y)

p(y)dy; T2 = 2

N∑i=1

∑j<i

χ0i1χ

∫Khi(y)Khj(y)

p(y)dy. (16)

Now, by an application of the Billingsley’s inequality (Bosq 1996), under assumptions A1-A4 and

H0 it can be shown that

E(T1) =N∑i=1

(χ0i1

2∫K2hi(y)

p(y)dy

)= Nhp1p0

∫Ip(y)>0dy

∫K2(u)du+ o(Nh),

V ar(T1) = V ar

( N∑i=1

2∫K2hi(y)

p(y)dy

)= O(Nh); (17)

E(T2) = 0;V ar(T2) = 2N2h3

∫ (∫K(w)K(u+ w) du

dw (p1p0)2 + o(N2h3).

From the above it follows that

∫ (1Nh

∑iKhi(y)χ0

)2p(y)

)≈ C1

Nh; Var

∫ (1Nh

∑iKhi(y)χ0

)2p(y)

)≈ C2

Let us next defineκh;ij =

p(y)Khi(y)Khj(y) dy. (18)

so that T2 = 2∑N

∑j<i κh;ij χ

0j1. Now define,

∑j<i

κh;ij · χ0i1 χ

0j1. (19)

To prove Theorem 1, it is now enough to show thatN∑i=1

Zi ⇒ N(0;C2p20p

21) (20)

if assumptions A1-A4 are true. This result follows by an application of Theorem 3.2 from Hall

and Heyde (1980).

A Test for the Two-Sample Problem using Mutual Information ...

Documents

Transcript of A Test for the Two-Sample Problem using Mutual Information ...

Sample Problem 4.2

sample development focused problem spaces

The SPSS Sample Problem - elkstatistics.com

SEMAPHORE SOLUTIONS FOR GENERAL MUTUAL EXCLUSION …/67531/metadc... · The mutual exclusion problem is a classical problem in the parallel programming paradigm. Among the many synchronization

Sample Process Guide - Problem Management

TEMA Sample Problem Book

THIRD PART Algorithms for Concurrent Distributed Systems: The Mutual Exclusion problem.

Stage One: Define the Research Problem Draft Only · The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional

Sample IFRS Financial Statements, Mutual Insurance - bdo.ca · Page 1 BDO Canada LLP . Sample Mutual Insurance Company Illustrative IFRS Financial Statements . For the year ended

Sample Problem #19

Process Costing ( Power Point and Sample Problem )

Sample Problem 4.6

Sample Problem Question

SAMPLE MUTUAL INSURANCE COMPANY ILLUSTRATIVE ... - … · Sample Mutual Insurance Company Illustrative IFRS Financial Statements . For the year ended December 31, 2019 . Page 2 BDO

29921906 Sample Problem 7

Out-of-sample performance of mutual fund predictorschristoj/pdf/jones_mo.pdfOut-of-sample performance of mutual fund predictors ... mutual funds, out-of-sample ... in a study closely

SIMPLEX Sample Problem

PI&PD Controller Sample Problem

Safety Interlocking as a Distributed Mutual Exclusion Problem · Safety Interlocking as a Distributed Mutual Exclusion Problem Alessandro Fantechi1 and Anne E. Haxthausen2 1 DINFO,

sample problem- Thermodynamics