Post on 26-May-2022
A Test for the Two-Sample Problem usingMutual Information to Fix Information Leak in
e-Passports
Apratim Guha ∗
School of Mathematics, University of Birmingham, Birmingham, U.K.and
Tom ChothiaSchool of Computer Science, University of Birmingham, Birmingham, U.K.
February 8, 2011
Abstract
For two independently drawn samples from continuous distributions, a statistical testbased on the mutual information statistic is proposed for the null hypothesis that both thesamples originate from the same population. It is demonstrated through simulation that thistest is more powerful than the commonly used nonparametric tests in many situations. Asan application we discuss the information leak of e-passports. Using the mutual informationbased tests, we show that it is possible to trace a passport by comparing the time taken torespond to a recorded message. We establish that the mutual information based tests detectdifferences where other tests fail. We also explore the effect of adding an artifical fixed-time delay in specific situations to stop the information leak, and observe that the mutualinformation based test detects a leak in a situation where the other non-parametric tests fail.
Keywords: Nonparametric Test, Information Theory, Test of Independence, Kernel Density Esti-matation, Bandwidth, Anonymity.
∗We are grateful to Y. Xiao for his comments on the manuscript as well as his help with the coding. The maintheorem in this paper derives from of A. Guha’s Ph.D. thesis supervised by D. Brillinger, whom he thanks for hisencouragement, help and guidance.
1
1 Introduction
The motivation of this work arises from an anonymity problem in computer science. Chothia and
Smirnov (2010) discusses a time-based tracebility attack on an e-passport where by eavesdropping
on its communication with a reader it is possible to trace the passport later. In that work, one
session between the passport and a legitimate reader was recorded, and then by comparing the
response time to a previously recorded message, a leak was inferred on the basis of a visual inspec-
tion of the plot of the response times for the same and a different passport. However, no quantative
measure was used. To quantify the difference of the response time for different passports, we intro-
duce a mutual information (MI) based test statistic to compare two independently drawn samples
from continuous distributions to test the null hypothesis that both samples originate from the
same population.
When the underlying distributions are continuous, several well-known non-parametric tests
are available. Most of them are based on the empirical distribution function: the Kolmogorov-
Smirnov (KS) test, the Wilcoxon test (Gibbons and Chakraborti 2010), the Anderson-Darling
(AD) test (Pettitt 1976) and the Cramer-von Mises (CVM) test (Anderson 1962) are some of the
most-popular ones. A two-sample t-test can be used when a difference in location is suspected. A
modification of the Wilcoxon test proposed by Baumgartner, Weiß and Schindler (1998), hence-
forth referred to as the BWS test, has superior power compared to the CVM and the KS tests
in a wide variety of situations. However, it should be used with caution as this test does not
control the type I error rate in some cases, see Neuhauser (2005). Among the available parametric
and the semiparametric choices for two sample tests, Zhang (2006) and Wylupek (2010) provide
two newest examples: the former introduces a likelihood-ratio based parametric test, the latter
discusses a “data-driven” semi-parametric test. However, as we may observe from the density
estimates of the response times of various passports in Chothia and Smirnov (2010), parametric
or semiparametric distribution-based apporach would not work well for the passports as some of
them have bimodal response times and some do not. Hence in this article we limit our discussion
to non-parametric tests only.
For distributions with the same location parameter but different otherwise, the t test or the
Wilcoxon test do not work well. Such a situation arises during the analysis of e-passport informa-
tion leak in section 4. The AD test, CVM test, the KS test and the BWS test do work reasonably
2
well in such situations; however, we will see in sections 3 and 4 that the MI test works better than
these tests in many situations.
Examples of applications of various two-sample tests in analysing computer science data can
be found in literature, for a recent example see Jeske, Lockhart, Stephens and Zhang (2008). The
application of MI in computer science data analysis is also popular, some recent examples are
Alvim, Andres and Palamidessi (2010), Chatzikokolakis, Palamidessi and Panangden (2008) and
Chatzikokolakis, Chothia and Guha (2010). Applications also exist in other areas of science, for
some examples see Paninski (2003), Biswas and Guha (2010) and the references within. MI has
also been used in the time series context, for example see Brillinger (2004) and Brillinger & Guha
(2007). However, as far as we know, a two-sample test based on MI has never been used before.
To fix ideas, let us start with k random variables X1, X2, · · · , Xk with joint density pX1X2···Xk(·)
with respect to some measure µ. Shannon (1948) introduced the concept of mutual information
(he called it ‘relative entropy’), defined as
IX1,X2,··· ,Xk=
∫pX1X2···Xk
(x)>0
log2
(pX1X2···Xk
(x)
pX1(x1)pX2(x2) · · · pXk(xk)
)pX1X2···Xk
(x) dµ(x) (1)
where x = (x1, x2, · · · , xk) and pXj(·), j = 1, · · · , k are the marginal densities of X1, X2, · · · , Xk
respectively. Notice that IX1,X2,··· ,Xk= 0 if the variables are independent. It may be noted
here that it is customary in the area of information theory and computer science to consider the
logarithm in entropic measures to the base 2. We are following that convention in this paper; and
henceforth the base in a logarithmic expression, unless mentioned otherwise, should be understood
to be 2.
When the joint distribution is continuous, the Lebesgue measure can be used as the dominating
measure and in a discrete setup, one may use the counting measure. In a hybrid setup, i.e. when
some random variables are discrete and the rest are continuous, µ is an appropriate product of the
two measures. The mutual information (MI) is non-negative; it is zero when the random variables
are mutually independent and attains its maximum when the concerned random variables have a
perfect functional relationship (Cover and Thomas (1991), Biswas and Guha (2009)). Hence it is a
very useful extension of the correlation techniques which is only useful to study linear dependence.
As a special case of (1), the MI statistics for two random variables X and Y with joint density
function pXY (x, y) with respect to some dominating measure µ may be defined as
3
IXY =
∫ ∫pXY (x,y)>0
log
(pXY (x, y)
pX(x)pY (y)
)pXY (x, y) dµ(x, y); (2)
where pX(x) and pY (y) are the respective marginals.
Now, consider a hybrid pair (X, Y ) where X is a binary random variable and Y is a continuous
random variable. We may write
IXY =
∫y:p(0,y)>0
log
(p(0, y)
p0p(y)
)p(0, y) dy +
∫y:p(1,y)>0
log
(p(1, y)
p1p(y)
)p(1, y)dy, (3)
where the joint density parameters are defined as
P [X = 0, y < Y < y + dy] = p(0, y) dy
P [X = 1, y < Y < y + dy] = p(1, y) dy (4)
and the order 1 parameters are given by
P [y < Y < y + dy] = p(y)dy; P [X = 1] = P (1) = p1 P [X = 0] = p0 = 1− p1; (5)
so that p(y) = p(0, y) + p(1, y).
In this paper, we utilize the form of the MI statistic as described in (4) to assess the indepen-
dence of two samples. Towards that, let us denote the two samples by Y0 := {Y01, Y02, · · · , Y0n}
and Y1 := {Y11, Y12, · · · , Y1m}. We assume here that the samples are from two continuous dis-
tributions: Y01, Y02, · · · , Y0n are independently sampled from the distribution F0, and further
Y11, Y12, · · · , Y1m are independentently sampled from the distribution F1. Let us set the null
hypothesis H0: F0 = F1 and the alternative H1: F0 6= F1.
The idea of this test comes from the fact that the MI between two random variables is zero only
when they are independent. To utilise this idea, we combine Y0 and Y1 in one single vector Y,
and create a 0-1 valued vector X, whose j-th element, say Xj, is 0 or 1 according to whether the
jth element of Y , say Yj, is from Y0 or Y1. In other words, X is a vector of length N := (n+m)
with n zeroes followed by m 1s. Under H0, Y would be independent of X in the sense that whether
Xj is 0 or 1 will have no bearing on the value of Yj and hence the estimated MI between Y and
X would differ from the estimated MI between Y′, a typical sample of length N from F0 and X′,
a typical sample of length N from the Bernoulli distribution with P (1) = p1 = m/N only due to
4
random error. We may note here that as F0 and the above mentioned Bernoulli distribution are
independent, the true MI between these two distributions is 0.
Now under H1, X is completely fixed by our choice of sample, and hence it is clearly related
with Y. Hence the MI is higher, and so should typically be the estimated MI. Therefore we reject
H0 if the estimated MI statistic is ‘large’. A more precise rejection criterion is discussed in Section
2. It can easily be shown that the MI between a continuous random variable and a binary random
variable is bounded by 1, and hence values of MI close to 1 suggest a high degree of dependence
between X and Y, which in turn may be considered as a strong evidence against H0.
The rest of the paper is divided as follows. We develop the estimates of the mutual informa-
tion based test statistic using kernel density estimates and its properties in Section 2. We also
provide the asymptotic distributions of the mutual information statistic under certain regularity
conditions, and describe the test procedure with greater details. In Section 3 we compare MI
tests with the previously mentioned existing tests through simulation. The passport data and our
experiments to explore the information leaks are discussed in Section 4. Section 5 concludes.
2 Mutual Information and the Test Statistic
In this section we introduce an estimate of the mutual information statistic between one discrete
and one continuous random variable, describe its asymptotic distribution under some regularity
conditions when the two variables are independent and finally discuss the construction of the
critical region of the test based on MI obtained in Section 1.
2.1 Estimation of Mutual Information
An estimate of the MI between X and Y, IXY , can be used as a check for dependence by testing
whether IXY is significantly different from zero. Many competing estimates exist, for some ex-
amples see Moddemeijer (1989), Paninski (2003) and Antos and Kontoyiannis (2001). A popular
choice is the “plug-in” estimate, Antos and Kontoyiannis (2001). It is obtained by substituting
suitable density estimates into (2):
IXY =
∫ ∫(x,y):pXY (x,y)>0
log
(pXY (x, y)
pX(x)pY (y)
)pXY (x, y) dµ(x, y). (6)
5
In the hybrid situation, (6) reduces to
IXY =
∫y:p(0,y)>0
log
(p(0, y)
p0p(y)
)p(0, y) dy +
∫y:p(1,y)>0
log
(p(1, y)
p1p(y)
)p(1, y) dy. (7)
The obvious choice for p1 is the sample proportion of 1’s, i.e.
p1 =1
N
N∑i=1
χ{Xi=1}
where χB is the indicator function of a set B. The continuous density estimates may be estimated
using a suitable kernel K as follows
p(y) =1
NhN
N∑i=1
K
(Yi − yhN
);
p(1, y) =1
NhN
N∑i=1
K
(Yi − yhN
)χ{Xi=1}, p(0, y) = p(y)− p(1, y), (8)
where hN is an appropriately chosen bandwidth. We discuss the choice of kernels and bandwidths
in Section 2.2.
2.2 A Distribution of the Mutual Information Statistic Under Inde-
pendence
It is known that there is no universal rate at which the error in estimation of MI goes to zero, no
matter what estimator we pick, see Antos and Kontoyiannis (2001) andPaninski (2003). However,
a better and more positive result can be obtained under reasonable regularity conditions for a
smaller class of distributions. We now derive such a result.
Let us assume that the pairs {Xi, Yi}, 1 ≤ i ≤ N are independent and identically distributed
(IID) satisfying
A1. Yi’s are bounded continuous real-valued random variables with finite support;
A2. For u = 0, 1, p(u, y) has a continuous bounded second derivative in y;
A3. K has a finite support symmetric around zero, and integrates to 1.
A4. hN → 0, Nh2N →∞ and Nh4
N → 0 as N →∞.
6
The subscript N on hN will be omitted henceforth.
Under the null hypothesis
H0 : X and Y are independent, (9)
a large sample distribution for IXY may given by the following theorem.
Theorem 1. Under H0 and assumptions A1-A4, Nh−1/2(IXY /log (e)− C1/(Nh)) converges to a
normal distribution with mean 0 and variance C2 = 0.5∫
(∫K(w)K(v +w)dw)2dv
∫χp(y)>0 dy as
N →∞, where C1 = 0.5∫K2(v) dv
∫χp(y)>0 dy.
An outline of the proof is given in Appendix A.
One may utilize the large sample distributions of the MI statistics from Theorem 1 to test H0
against the alternative H1 : X and Y are not independent. (10)
Notice that H0 is equivalent to saying that IXY = 0, and similarly H1 is equivalent to saying that
IXY > 0.
The idea of Theorem 1 is similar to Proposition 1 of Fernandes and Neri (2008), which discusses
the large sample distribution of a generalised entropic measure between two continuous processes
in the time series situation, of which the MI is a specific case. Similar to their result, the asymptotic
distribution of the MI statistic depends on X and Y only through the length of support of Y in
the hybrid situation as well. Whereas this may be considered as an advantage, for Theorem 1 to
apply Y needs to be bounded. However, in real life situation this problem may not be too critical
as we rarely observe a distribution with infinite support in practice.
The presence of bias, which grows as h−1/2 in the hybrid setup, is one of the known problems of
the MI estimates. It renders the estimation of bias essential. For simple kernels and for bounded
distributions whose length of support is known, Theorem 1 can be used to estimate the bias.
In more general situations, where Theorem 1 may not apply, data-driven methods we discuss in
Section 3 can be employed to estimate the bias.
2.3 A Two Sample Test Using Mutual Information
We have already seen in Section 1 that the MI can be utilised to test whether two or more samples
were obtained from the same or different distributions. We restrict our discussion in this work to
7
the two sample case, but a generalization can easily be achieved.
Let us now describe the test procedure in more detail. As set in 1, we are going to discuss
the method to construct a test statistic when the two samples are from continuous distributions,
but a test when they are from discrete distributions can also be obtained; for motivation see
Chatzikokolakis, Chothia and Guha (2010).
Suppose we have n independent observations Y01, Y02, · · · , Y0n from a distribution F0 and fur-
ther m independent observations Y11, Y12, · · · , Y1m from a possibly different distribution F1. Let
us write Y0 := {Y01, Y02, · · · , Y0n} and Y1 := {Y11, Y12, · · · , Y1m}. We want to test the null
hypothesis H0: F0 = F1 against the alternative H1: F0 6= F1. We further combine Y = (Y0,Y1)
and create a 0 − 1 valued vector X, whose j-th element, Xj, is 0 or 1 according to whether the
jth element of Y , Yj, is from Y0 or Y1, i.e. X is a vector of length N := (n + m) with n zeroes
followed by m 1s.
The test statistic to be used is IXY , as described in (7). Note that choice of the bandwidth
is an issue. Optimum choice of bandwidth for the kernel density estimate is well studied, and
a number of options are available to suit different situations, see Silverman (1986) and Sheather
and Jones (1991). Following Silverman (1986), to expedite the computation process we choose a
“rule-of-thumb” optimal bandwidth function which is best suited when the original distribution
is Gaussian but is also known to work well for distributions which are not heavily skewed:
hOPT = 1.06SD(Y )n−1/5, (11)
where SD(Y ) is the standard deviation of Y . Although this choice works fairly well in the situations
we encounter during simulations and data analysis, an optimal choice of bandwidth for mutual
information estimates remains a subject of futue studies.
The asymptotic normality of IXY under H0 : F0 = F1 and assumptions A1-A4 could be utilized
to construct a critical region for the MI test. However, the normal approximation often does not
work very well for the MI estimates except for large samples, a phenomenon also observed by
Fernandes and Neri (2008). Moreover, the assumption of finite support of Y is restrictive, and
requires the estimation of the length of support from the sample. A more robust procedure is to
use bootstrap techniques to obtain the critical region. An advantage is that this technique also
works when the samples are not necessarily from distributions with finite support, and hence is
less restrictive.
8
A bootstrap-based critical region for the above mentioned test at level of significance α when
comparing the two samples Y0 and Y1 of size n and m respectively can be obtained through the
following steps:
1. Combine Y0 and Y1 in one single sample which we denote by Y.
2. Simulate a random sample of size N := m+ n, say X1 from the Bernoulli distribution with
p1 = m/N and compute its mutual information with Y. Let us denote the value of the
estimated mutual information by I1. We repeat this step a large number of times, say K,
and obtain Ij; j = 1, 2 · · · , K.
3. Use the 100(1− α)th percentile of the sampling distribution of I1, · · · , IK as the cut-off for
the test with level α.
Alternatively, an estimated p-value of the test statistic can be reported when a test sample is
available, which may be computed as the proportion of I1, · · · , IK exceeding the observed MI for
the test sample, say I.
3 Some Simulations
We now compare the power of the MI test to the power of some conventional tests, namely the
KS test, the CVM test, the AD test and the BWS test at 5% level of significance.
To obtain the power of the MI test when m samples from the distribution F0 and n samples
from distribution F1 are to be compared, we use an Epanechnikov kernel based estimate with
the bandwidth chosen according to (11) to obtain the MI test statistics, and use the following
algorithm to compute the power of the MI test:
1. Obtain samples Y0 of length m from F0 and Y1 of length n from F1. Combine them in one
single vector which we denote by Y.
2. We simulate a random sample of size N := m + n, say X1 from the Bernoulli distribution
with P (1) = m/N and compute its mutual information with Y. Let us denote the value
of the estimated mutual information by I1. We repeat this step 10,000 times, and obtain
Ij; j = 1, 2 · · · , 10, 000.
9
3. We use the 100(1−α)th percentile of the sampling distribution of I1, · · · , I10,000 as the cut-off
to be used for rejection for the test with level α.
4. We now again simulate samples Y0 of length m from F0 and Y1 of length n from F1, and
again combine them in one single sample which we denote by Y.
5. Define a 0-1 valued vector X, whose j-th element, Xj, is 0 or 1 according to whether the
jth element of Y , Yj, is from Y0 or Y1, i.e. X is a vector of length N := (n + m) with n
zeroes followed by m 1s. Compute the MI estimate between Y obtained in step 4 with the
obtained X. H0 is rejected if the obtained mutual information estimate is greater than the
cut-off obtained in step 3.
6. Repeat steps 4 and 5 1000 times to estimate the power of the test, given by the proportion
of rejections of H0.
Following the methods described in Xiao et al. (2007) and Baumgartner, Weiß and Schindler
(1998), the p-values and the cut-offs of the CVM test and the BWS tests for 5% level are estimated
using the appropriate quantiles of the sampling distributions of the said test statistics for relevant
values of m and n based on samples from the standard normal distribution of sizes m and n.
The sampling distribution is computed based on 10000 samples. For the KS and AD tests we
use the asymptotic properties of the test statistics following Conover (1971) and Scholz (1987)
respectively. To estimate the power of the tests we compare the percentage of rejections of H0 by
the tests based on 1000 random pairs of samples with m = n = 100 and 500. When the sample
sizes grow larger, all the tests start performing well; we have also explored the power of the tests
when the samples are of size 2500. As all the tests perform very well for that size, we omit the
large size samples from our discussion for the sake of brevity. It may be noted here that the
proposed MI test statistic is based on density estimates so for very small sample sizes it should
not be used.
We already know about the existence of a number of efficient tests of location. When comparing
two normal distributions with equal variances, the UMPU test is the two sample t-test, Lehmann
and Romano (2010). However, the main goal of our work, as will further be explained in the
next section, is to obtain a test when two samples are matched for their location. Hence, during
the simulation exercise we primarily concentrate on comparing the power of different situations
10
when the first moments of the distributions are matched. Towards that, we compare samples from
standard normal distribution with samples from normal distribution with zero mean and variance
increasing away from 1. The results are summarized in Figure 1. The MI test clearly appears
to be the most powerful, as can be seen in Figure 1, followed closely by the AD and the BWS
tests. As examples of distributions where the support is finite, so that Theorem 1 holds and
● ●
●
●
●
●
●
1.0 1.2 1.4 1.6
0.0
0.2
0.4
0.6
0.8
1.0
σ
Rej
ectio
ns(%
)
● MI testKS testCVM testAD testBWS test
(a) Sample size m=n=100
●
●
●
●● ●
1.0 1.1 1.2 1.3 1.4 1.5
0.0
0.2
0.4
0.6
0.8
1.0
σR
ejec
tions
(%)
● MI testKS testCVM testAD testBWS test
(b) Sample size m=n=500
Figure 1: Plot of the estimated power of various tests based on 1000 replications comparing N(0, 1)samples with N(0, σ2) samples.
●
●
●
●●
5 10 15 20
0.0
0.1
0.2
0.3
0.4
0.5
0.6
df
Rej
ectio
ns(%
)
● MI testKS testCVM testAD testBWS test
(a) Sample size m=n=100
● ●
●
●
●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
df
Rej
ectio
ns(%
)
● MI testKS testCVM testAD testBWS test
(b) Sample size m=n=500
Figure 2: Plot of the estimated power of various tests based on 1000 replications comparing N(0, 1)samples with samples from t distributions with various degrees of freedom.
the MI based test statistic works well, we compare the uniform distribution on (−1, 1) with other
symmetric uniform distributions on intervals (−a, a), and also the uniform distribution on the
unit interval with the uniform distributions on intervals (0, a), for values of a gradually increasing
from 1. The results are shown respectively in Figures 3 and 4. The MI test once again appears
the most powerful.
11
●
●
●
●
1.0 1.1 1.2 1.3 1.4 1.5
0.0
0.2
0.4
0.6
0.8
1.0
a
Rej
ectio
ns(%
)
● MI testKS testCVM testAD testBWS test
(a) Sample size m=n=100
●
●
●
●
● ● ●
1.0 1.1 1.2 1.3 1.4 1.5
0.0
0.2
0.4
0.6
0.8
1.0
a
Rej
ectio
ns(%
)
● MI testKS testCVM testAD testBWS test
(b) Sample size m=n=500
Figure 3: Plot of the estimated power of various tests based on 1000 replications comparingU(−1, 1) samples with samples from U(−a, a) distributions for different values of a.
●
●
● ●
1.0 1.1 1.2 1.3 1.4 1.5
0.0
0.2
0.4
0.6
0.8
1.0
a
Rej
ectio
ns(%
)
● MI testKS testCVM testAD testBWS test
(a) Sample size m=n=100
●
●
●
● ● ● ●
1.0 1.1 1.2 1.3 1.4 1.5
0.0
0.2
0.4
0.6
0.8
1.0
a
Rej
ectio
ns(%
)
● MI testKS testCVM testAD testBWS test
(b) Sample size m=n=500
Figure 4: Plot of the estimated power of various tests based on 1000 replications comparing U(0, 1)samples with samples from U(0, a) distributions for different values of a.
It is of interest in observing the discrimination power of the tests when a sequence of distribu-
tions is compared with their eventual limit. Towards that, we compare samples from the standard
normal distribution with t-distributions with gradually increasing degrees of freedom. The results
are presented in Figure 2. The MI test exhibits superior discriminating power when comparing
the standard normal distribution with t-distributions with different degrees of freedom. Whereas
for the other tests power decreases to around 0.05 very qucikly, the power of the MI test decreases
much more slowly; for example, when tested on the basis of 500 samples, the MI test is the only
test with any discriminating power between the standard normal and the t distribution with 20
degrees of freedom.
Finally, for further comparison of two distributions with different shapes but equal means and
variances, we compare the N(0, 1/12) distribution with the uniform distribution on the interval
12
●
● ● ● ● ● ●
100 200 300 400 500 600 700
0.0
0.2
0.4
0.6
0.8
1.0
sample size(m=n)R
ejec
tions
(%)
● MI testKS testCVM testAD testBWS test
Figure 5: Plot of the estimated power of various tests based on 1000 replications comparingN(0, 1/12) samples with samples from U(−0.5, 0.5) distribution for different sample sizes.
(−0.5, 0.5) for various sample sizes. The results are presented in Figure 5. Similar to the previous
cases, the MI test is again the most powerful followed by the BWS test and the AD test, whereas
the KS and the CVM tests do not perform very well.
4 Analysis of Passport Information Leakage
4.1 e-Passports
In this section we use the MI to analyze the security of the radio frequency identification (RFID)
chip embedded into e-passports. These chips broadcast the information printed on the passport
and a JPEG copy of the picture with the aim of making immigration controls faster and more
secure. These chips may also optionally include fingerprints, iris scans and additional personal
information. e-Passports are specified by the International Civil Aviation Organisation (ICAO)
and most countries implement their own version.
Read access to the data on the RFID chip is protected by a cryptographic key based on the
date of birth of the passport holder, the date of expiry of the passport and the passport number.
The idea behind this is to allow read access to anyone who has physical access to the passport, but
to stop covert “skimming” of the data without the owners knowledge. These passports also aim
to be untraceable: if you do not know the passport’s cryptographic key, it should be impossible
to distinguish one passport from any other, and in particular, it should be impossible to recognize
a passport that you have seen before from the radio messages it transmits.
13
A reader that knows the date of birth and expiry, and passport number can use these to
generate an encryption key and a message authentication code (MAC) key (which is used for error
checking). Both these keys are unique to the individual passport. The reader and the passport
then exchange a number of messages that let them prove to each other that they both know the
cryptographic keys. The reader powers up the passport, and the passport then sends a random
number back to the reader. The reader then generates its own random number and encrypts both
the random numbers using the passport encryption key. The MAC key is used to generate a short
error checking code, for the encrypted numbers, and then both the encrypted numbers and the
error checking code are sent to the passport.
The passport uses its unique MAC key to check that it has received the message containing the
encrypted numbers correctly. If this check is passed, the passport then decrypts the message and
checks that it contains the random number it sent to the reader. This proves to the passport that
the reader really knows the passport’s unique cryptographic key and is not, for instance, replaying
an old message.
The passport then encrypts the reader’s random number and sends it back to the reader, with
another MAC error checking code. Once the reader successfully receives its own number back
it can be sure that it is communicating with the passport (or, at least, a device that knows the
passport’s key). After this exchange of messages the passport will allow the reader access to the
information stored on the chip.
4.1.1 Tracing e-Passports
To an outside observer, all of the messages exchanged by the passport and reader appear to be
completely random. If an attacker tried to record a message and replay it to the passport during
a later session the messages would be rejected, because the random numbers would not match.
However, while investigating actual passports, we found that there was a way to identify a passport
without knowing the passport’s cryptographic key.
To be able to trace a passport attackers must first observe an exchange between the passport
they wish to trace and a legitimate reader. While doing so, they must record the message from
the reader that includes the passport’s random number and the error checking code, produced
using its unique MAC key.
14
When the attacker comes across a passport later and wants to remotely check if it is the same
passport as before, it starts a new run of the protocol. The passport sends the attacker a random
number, and the attacker then replays the messages it previously recorded. There are now two
possibilities, first it may be a different passport than before; in this case the check of the MAC
fails, because each passport uses a different MAC key, and the passport sends an error message.
Second, it might be the same passport again. In this case the check of the MAC succeeds, the
message is then decrypted. However, the random number in this message would be from the old
session, so it would not match the random number the passport expects, and the passport would
stop the exchange and send an error message.
In both cases the replayed message is rejected and the attacker is denied access to the data on
the chip. However, when we experimented with actual passports we observed that it took longer
for a passport to reject the replayed message in the second case, i.e., when it was the passport
we were trying to trace. This is because the message uses the passport’s unique MAC key, so the
MAC check succeeds and the passport has to go on to do the decryption. On the other hand, if
it was a different passport the MAC check would have failed and the message would have been
rejected sooner.
This difference in response times can be used to detect particular passports. While the range
of the RFID chips is limited it would certainly be possible to, for instance, build a device that
would sit next to a door way and remotely detect when certain particular people entered or left a
building. When the existence of this attack was announced in at a computer science conference
(Chothia and Smirnov 2010), this gained some media attention, see Goodin (2010). In this paper,
we present a full analysis of the timing information and consider ways to fix this information leak.
4.2 Analysing e-Passports
We now apply the MI to analyze the extent to which passports from different countries can be
traced and to assess the effectiveness of some possible fixes. In this setting, we replay a message
to a passport and look for any relationship between the time it takes a passport to respond and
whether or not the message came from that particular passport. In terms of our set up in Section
1, X is 1 if the passport we replay the message to is the same one used in the session where the
message was recorded, and X is 0 if the message did not come from this particular passport. The
15
continuous variable Y in this example is the time it takes to reject the message. The passport is
considered to be secure if, and only if, no evidence of dependence between X and Y is present.
4.2.1 Data Collection
Each country implements its own version of the e-passport, and the time taken for passports to
communicate with a reader always have the same distribution when they have the same nationality.
We therefore tested one passport each from four different countries: Germany, Greece, Ireland and
the UK. For each of these we first calculated the passport’s cryptographic key from the date of
birth, date of expiry and passport number. Then, using a basic RFID reader, we ran the access
protocol and recorded the message we needed to replay. For the German, Greek and Irish passports
we replayed the message to the passport 500 times, and then sent a random message 500 times
(simulating a message from a different passport). For the British passport we replayed the message
to the passport 1000 times, and then sent a random message 1000 times. We added a clock to our
computer program to exactly measure the time between when the replayed message was sent and
when the passport’s error message was received by the reader.
4.2.2 Analyzing the Times
The response times are shown in Figure 6. Here the solid lines show the response time when
the passport is the same, and the dotted lines show the times when the passport is different. In
each of these cases the time difference is clear, however there is some overlap between the times.
The first two columns of Table 1 presents the values of the MI test statistics computed following
the methods as described in Section 3 and the corresponding p-value estimates based on 10000
bootstrapped samples. It is observed that the MI estimates are very near to 1 for all four passports
considered and hence it is obvious that the passports can be traced.
4.2.3 Testing a Fix for the Passport
To fix the leak in the passports, it is intuitive that if the passport goes into decryption after
the MAC check regardless the result of the check, then the information leak discussed in Section
4.2.1 may be blocked. However, by looking at the plots in Figure 6, it apparently may seem that
by somehow applying an artificial “time-padding” to the response time, (i.e. applying a shift of
16
0.663 0.664 0.665 0.666 0.667 0.668 0.669 0.670
010
0020
0030
0040
00Density Comparison for British Data
Estim
ated D
ensit
y
Original PassportDifferent Passport
(a) UK passport on reader
0.044 0.046 0.048 0.050 0.052
010
0020
0030
0040
0050
0060
0070
00
Density Comparison for Greek Data
Estim
ated D
ensit
y
Original PassportDifferent Passport
(b) Greek passport on reader
0.044 0.046 0.048 0.050 0.052
050
0010
000
1500
0
Density Comparison for Irish Data
Estim
ated D
ensit
y
Original PassportDifferent Passport
(c) Irish passport on reader
0.130 0.132 0.134 0.136 0.138 0.140
010
0020
0030
0040
0050
00
Density Comparison for German Passport Data
Estim
ated
Den
sity
Original PassportDifferent Passport
(d) German passport on reader
Figure 6: Response times for replaying a message to passports
location,) when the passport does not go into the decryption stage, a quick fix may be achieved.
To examine this idea, we experimented with “time padding” by various constants, including the
difference of means and difference of medians of the response times for the same and the different
passports. Adding the difference of medians seemed to work the best in terms of reduction of the
MI estimates. The corresponding MI test statistics are presented in the third column of Table 1.
All the MI values show significant reduction, and hence the fix may seem to be working. However,
if we look at the estimated p-values presented in the last column of Table 1, we can see a different
story. Only for the Greek passport the p-value increases from 0; hence the problem is not solved
for any of the other 3 passports. For the Greek passport though at a 5% level of significance it
may be concluded that it has been “fixed”.
Table 2 compares the p-values of the other nonparametric tests discussed in this paper for
the fix applied to various passports. They agree to the conclusion of the MI test for the British,
Irish and Greek passports. However, every other test fail to detect a leak in the fixed German
17
Nationality MI before padding p-value MI after padding p-value
British 0.9542736 0 0.09446402 0Irish 0.9999755 0 0.04872853 0
Greek 0.9795026 0 0.01775579 0.075German 0.983794 0 0.03101871 0
Table 1: A comparison of the mutual information estimates obtained for different passports beforeand after applying the time padding based on difference of medians.
Nationality MI test KS test CVM test AD test BWS test
British 0 3×10−12 0 0 0Irish 0 8×10−10 0 0 0
Greek 0.075 0.718 0.544 0.3671 0.408German 0 0.2574316 0.7425 0.3017 0.2705
Table 2: A comparison of the p-values of different test statistics obtained for different passportsafter applying the time padding based on difference of medians.
passport, which clearly demonstrates the superior sensitivity of the MI test, and justifies its use
in this situation.
The case of the Greek passport also needs further discussion. It seems that all the tests agree
that the “fix” works for the Greek passport, at least at a 5% level of significance. However, if
we look at the p-value of the MI test, it is only 0.079, which indicates that there might be some
difference.
Hence, it seems that overall this fix may not be very efficient, although it does seem at least
to work partially, and at least reduce the chance of the detection of a leak significantly. However,
to device a completely leak-proof passport, a better fix is obviously required.
5 Discussion
In this work we proposed a MI based two-sample test for samples from continuous distributions.
We have discussed a kernel density estimate based estimate of the test statistic and provided an
asymptotic distribution of the test statistic when the two samples are from the same distribution
under some restrictive conditions. It was established through simulations that the test works
well even when some of the conditions of the asymptotic result, most notably the condition that
18
0.048 0.049 0.050 0.051 0.052 0.053
01
00
02
00
03
00
04
00
05
00
06
00
07
00
0Density Comparison for Greek Data
Est
ima
ted
De
nsi
ty
Original PassportDifferent Passport
(a) Greek passport on reader
0.135 0.136 0.137 0.138 0.139 0.140
01
00
02
00
03
00
04
00
05
00
0
Density Comparison for German Passport Data after Median Correction of the Response to Incorrect Passport
Est
ima
ted
De
nsi
ty
Original PassportDifferent Passport
(b) German passport on reader
Figure 7: Comparing the response times after padding the reply times for the Greek and theGerman passports by the difference of the median response times.
the support of the distributions be finite, are violated. We presented some simulation-based
comparison with other tests in various situations, and the MI test appeared superior in terms of
power when comparing samples varying in scale; and also for samples from different distributions
with identical location and scale parameters. Finally, we justified its use in the present analysis by
demonstrating an example where it found a leak in a “fixed” e-passport where other tests failed.
In the examples we mention above, the cases of the German and the Greek passports for which
all other tests failed to show a leak but a visual inspection made us suspicious otherwise, see
Figure 7, were the main motivation to think of a more sensitive test. This led to our idea of the
MI based test which supported our suspicion that there still was a substantial leak. Whereas the
use of MI is quite popular in various areas of applied sciences, as far as we know our application
of this statistic for a two sample test is new. A point to be noted here that we do not claim that
our test is the best in all situations. For example if the object of interest is the test of location,
one might as well use the t-test which is UMPU for normal distributions and is quite robust
in other situations, and is by far the simplest to understand and apply. Our interest is mostly
in detecting differences beyond difference in location, and as we have established here through
simulations and examples, the MI test has superior power in a variety of situations. Similarly,
dedicated tests of scale, eg. the test proposed by Levene (1960), may work better than general
tests in specific situations, but may not be of any use in some other situations, for example in the
19
problem of improving a passport that we discuss in Section 4. That is why all our studies have
been dedicated to comparing tests of more general nature.
It should be noted that although Theorem 1 only applies to distributions with bounded sup-
port, the simulations in Section 3 establish that the MI test works fairly well in situations with
unbounded supports as well. We hope to explore these more general situations in a future work.
Perhaps the best feature of the MI test is that we can extend this procedure to compare k(≥ 2)
samples quite naturally to check whether all of them are from the same distribution by testing
whether the MI between a combined sample and an index vector of all samples with different
indicators for each samples is 0 or not. The expression for the MI will be obtained by a simple
extension of (3) to the case where X can take values 0, 1, · · · , k − 1. We have taken this up in a
parallel ongoing work. It should be noted here that the other two-sample tests discussed here also
have extensions to multi-sample cases, see Zhang and Wu (2007) for a detailed discussion. However,
the extensions depend on choice of weight functions, and the choice of optimum combination of
weights are not obvious. Such issues do not arise with MI based tests.
A notable difference of the MI test from the well-known non-parametric two-sample tests is that
it is not a rank-based test. Whereas the rank-based tests have many advantages, their problems
with ties are well-documented, which often require special treatments. The problems with ties
are usually more severe for discrete data, although they may also arise with larger samples from
continuous distributions when the data values are rounded off, as often is the case in real life.
Some versions of the rank-based tests do exist for discrete data in special cases, for example see
Scholz (1987) for the AD test. However, their applications are not easy and often suffer from loss
of power. The MI has a more obvious extension to the discrete situation as it can be estimated
based on the frequencies of the different values of the discrete variables, and hence ties would not
impact the performance of the test statistic computed based on MI. See Chatzikokolakis, Chothia
and Guha (2010) and Biswas and Guha (2009) for some examples of usage of the MI in the discrete
situation.
Finally, setting h = N−(1/4+δ) for a small δ, 0 < δ < 1/4, a rate of convergence of N−(3/4−δ) for
the MI estimate (to 0) can be achieved, which is superior to the best rate for the class of estimates
discussed by Stone (1980). Hence, it may be concluded that despite being biased, the MI estimate
is an efficient one due to its superior rate of convergence, and hence the performance of the MI
20
test statistic may also be expected to improve quickly with larger sample sizes compared to other
non-parametric tests.
References
Alvim, M., Andres, Mi. & Palamidessi, C. (2010), “Entropy and Attack Models in Information
Flow”,Theoretical Computer Science, IFIP Advances in Information and Communication
Technology, 323, 53–54.
Anderson, T. W. (1962), “On the Distribution of the Two-Sample Cramer-von Mises Criterion”,
Annals of Mathematical Statistics 33, 1148-1159.
Antos, A. & Kontoyiannis, Y. (2001), “Convergence properties of functional estimates for discrete
distributions”, Random Structures & Algorithms 19, 163–193.
Biswas, A. & Guha, A. (2009), “Time series analysis of categorical data on infant sleep status using
auto-mutual information”, Journal of Statistical Planning and Inference 139, 3076–3087.
Biswas, A. & Guha, A. (2010), “Time series analysis of hybrid neurophysiological data and appli-
cation of mutual information”, Journal of Computational Neuroscience 29, 35–47.
Brillinger, D. R., (2004), “Some data analysis using mutual information”, Brazilian Journal of
Probability and Statistics 18,163–183.
Brillinger, D. R. & Guha, A. (2007), “Mutual information in the frequency domain”, Journal of
Statistical Planning and Inference 137, 1074–1086.
Baumgartner, W., Weiß, P. & Schindler, H. (1998), “A Nonparametric Test for the General Two-
Sample Problem”, Biometrics 54, 1129–1135.
Bosq, D. (1996), Nonparametric Statistics for Stochastic Processes, Springer-Verlag, New York.
Chatzikokolakis, K., Palamidessi, C. & Panangaden, P. (2008), “Anonymity protocols as noisy
channels ”, Information and Computation 206, 378–401
21
Chatzikokolakis, K., Chothia, T. & Guha, A. (2010), “Statistical Measurement of Information
Leakage”. Proceedings of TACAS 2010, 390–404.
Chothia, T. & Smirnov V. (2010), “A traceability attack against e-passports”, Proceedings of the
14th International Conference on Financial Cryptography and Data Security, Springer, LNCS
6052.
Conover, W. J. (1971), Practical Nonparametric Statistics, John Wiley & Sons, New York.
Cover, T. M. & Thomas, J. A., (1991), Elements of Information Theory, Wiley, New York.
Fernandes, M. & Neri, B. (2008), “Nonparametric entropy-based tests of independence between
stochastic processes”, Econometric Reviews, , 276–306.
Gibbons, J. D. & Chakraborti, S. (2010), Nonparametric Statistical Inference, 5th edition, Chap-
man and Hall, London.
Goodin, D. (2010), “Defects in e-passports allow real-time tracking”, The Register, www.
theregister.co.uk/2010/01/26/epassport_rfid_weakness/
Guha, A. (2005), Analysis of Dependence Structures of Hybrid Stochastic Processes Using Mutual
Information, Ph.D. Thesis, University of California, Berkeley.
Hall, P. & Heyde, C. C. (1980), Martingale Limit Theory and Its Application, Academic Press,
San Fransisco.
Jeske, D. R., Lockhart, R. A., Stephens, M. A. and Zhang, Q. (2008), “Cramer-von Mises tests
for the compatibility of two software operating environments”,Technometrics, 50, 53–63.
Lehmann, E. L. & Romano, J. P. (2010), Testing Statistical Hypotheses, 3rd edition, Springer,
New York.
Levene, H. (1960), “Robust tests for equality of variances.” In: Contributions to Probability and
Statistics. Stanford University Press, Stanford. 278-292.
Moddemeijer, R. (1989), “On estimation of entropy and mutual information of continuous distri-
butions”, Signal Processing. 16, 233–248.
22
Neuhauser, M. (2005), “Exact tests based on the Baumgartner-Weiss-Schindler statistic - a sur-
vey”, Statistical Papers, 46, 1–29.
Paninski, L. (2003), “Estimation of entropy and mutual information”, Neural Computation
15, 1191–1253.
Pettitt, A. N. (1976), “A two-sample Anderson-Darling rank statistic”, Biometrika 63, 161–168.
Scholz, F. W. & Stephens, M. A. (1987), “K-sample Anderson-Darling Tests”, Journal of the
American Statistical Association 82, 918-924.
Shannon, C. E. (1948), “A mathematical theory of communication”, Bell System Technical Journal
27, 379–423 & 623–656.
Sheather, S. J. & Jones M. C. (1991), “A reliable data-based bandwidth selection method for
kernel density estimation”, Journal of the Royal Statistical Society B, 53, 683–690.
Silverman, B. W. (1986), Density estimation for statistics and data analysis, Chapman and Hall,
London.
Stone, C. J. (1980), “Optimal rates of convergence for nonparametric estimators”, Annals of
Statistics 8, 1348–1360.
Wald, A. & J. Wolfowitz, J. (1940), “On a test whether two samples are from the same population”,
Annals of Mathematical Statistics 11, 147–162.
Wylupek, G. (2010), “Data-driven k-sample tests”. Technometrics 52, 107–123.
Xiao, Y., Gordon A. & Yakovlev, A. (2007), “A C++ program for the Cramer-von Mises two
sample test”, Journal of Statistical Software 17.
Zhang, J. (2006), “Powerful two sample tests based on the likelihood ratio”, Technometrics 48, 95–
103.
Zhang, J. and Wu, Y. (2007), “k-sample tests based on the likelihood ratio”, Technometrics
51, 4682–4691.
23
A Proof of Theorem 1
We now provide a brief idea of the proof of Theorem 1. The details are similar to Guha (2005)
and are provided as supplimentary materials.
Firstly, note that
IXY /log (e) =
∫y:p(0,y)>0
ln
(p(0, y)
p0p(y)
)p(0, y) dy +
∫y:p(1,y)>0
ln
(p(1, y)
p1p(y)
)p(1, y) dy,
where IXY is as in (6), and ln(x) = loge(x). For simplification of notation, denote
Khi(y) = K
(Yi − yh
); χij = χXi=j; χ
0ij = χij − pj; j = 0, 1. (12)
Now using arguments similar to Fernandes and Neri (2008), it can be shown that when assumption
A2 holds and H0 is true, then
IXY =1
2
∫ (f 2(0, y)
p(0, y)+f 2(1, y)
p(1, y)− f 2(y)
p(y)
)dy + op
(1
Nh1/2
), (13)
where
f(y) = p(y)− p(y); f(j, y) = p(j, y)− p(j, y); fj = pj − pj forj = 0, 1. (14)
When H0 is true, the first expression in the right hand side of (13) can be broken into a sum of
two parts as 1
2(Nh)2p1p0
(T1 + T2) (15)
where
T1 =N∑i=1
χ0i1
2∫K2hi(y)
p(y)dy; T2 = 2
N∑i=1
∑j<i
χ0i1χ
0j1
∫Khi(y)Khj(y)
p(y)dy. (16)
Now, by an application of the Billingsley’s inequality (Bosq 1996), under assumptions A1-A4 and
H0 it can be shown that
E(T1) =N∑i=1
E
(χ0i1
2∫K2hi(y)
p(y)dy
)= Nhp1p0
∫Ip(y)>0dy
∫K2(u)du+ o(Nh),
V ar(T1) = V ar
( N∑i=1
χ0i1
2∫K2hi(y)
p(y)dy
)= O(Nh); (17)
E(T2) = 0;V ar(T2) = 2N2h3
∫ (∫K(w)K(u+ w) du
)2
dw (p1p0)2 + o(N2h3).
24
From the above it follows that
E
(1
2p1p0
∫ (1Nh
∑iKhi(y)χ0
i1
)2p(y)
dy
)≈ C1
Nh; Var
(1
2p1p0
∫ (1Nh
∑iKhi(y)χ0
i1
)2p(y)
dy
)≈ C2
N2h.
Let us next defineκh;ij =
∫1
p(y)Khi(y)Khj(y) dy. (18)
so that T2 = 2∑N
i=1
∑j<i κh;ij χ
0i1χ
0j1. Now define,
Zi =1
Nh3/2
∑j<i
κh;ij · χ0i1 χ
0j1. (19)
To prove Theorem 1, it is now enough to show thatN∑i=1
Zi ⇒ N(0;C2p20p
21) (20)
if assumptions A1-A4 are true. This result follows by an application of Theorem 3.2 from Hall
and Heyde (1980).
25