Cross-validation of N earest-Neighbour Discriminant … Validation of...performing nearest-neighbour...
Transcript of Cross-validation of N earest-Neighbour Discriminant … Validation of...performing nearest-neighbour...
r--;"'-'P'-;lo':y;q":;lPAOf¥!i"JoQ¥PS56:t;;-'5""V~""""::;':::'''-''i;''''':h~~'''''<>~--~~:''':-c'-:: .'".- ~- -.. - -.- -.-.- ~ ~ ---;~-~- '.' -" '," . ----_. --
! .1
Cross-validation of N earest-Neighbour Discriminant Analysis
A.P. White! Computer Centre Elms Road University of Birmingham Edgbaston Birmingham B15 2TT United Kingdom
Abstract
The SAS statistical package contains a general-purpose discriminant proce
dure, DISCRIM. Among the options available for this procedure are ones for
performing nearest-neighbour discriminant analysis and cross-validation. Each
of these works well enough when used separately but, when the two options
are used together, an optimistic bias in cross-validated performance emerges.
For certain parameter values, this bias can be dramatically large. The cause of
the problem is analyzed mathematically for the two-class case with uniformly
distributed data and demonstrated by simulation for normal data. The cor
responding misbehaviour for multiple classes is also demonstrated by Monte
Carlo simulation. A modification to the procedure, which would remove the
bias, is proposed.
Key Words: Cross-validation; nearest neighbour discriminant analysis;
SAS; optimistic bias.
1 A.P. White is also an Associate Member of the School of Mathematics and Statistics at the University of Birmingham
829
t M t
~ ~ L r: ,<
~ :~
~ ;:.i g ~.
~ ~
~ ~,
~ -':'
" :". t' , , ',' ~'
, (
f:~ Ii L ! ~,
1: ~ I'
" .t. ~:r
ti l' ,-f
~ ~ ~.
K [r,
" f ~ ~ ~
r ~ 11 I:t ~ ~,
I ,
~ I ,
~
\ t \..
1 Introd uction
The general discriminant problem is one of deriving a model for classifying
observations of unknown class membership by making use of a set of obser
vations from the same population, whose class membership is known. To be
more specific, let S be a set of n observations, each of the form (w, x), where W
denotes membership of one of c classes. Each observation has measurements
on m variables, giving the vector x.
Let the prior probability for membership of class i be p(Wi). Also, let
the unconditional and class-specific probability densities at x be I(x) and
Ji(x), respectively. Bayes' theorem then gives the posterior probability that
an observation at x belongs to class i:
( .1 ) - p(Wi)!i(X) P W t x - I(x) (1)
Because the classes are mutually exclusive and jointly exhaustive, this can be
re-written as:
(2)
Classification of a new observation at x is then carried out from the
posterior probabilities. Thus x is predicted as belonging to class Wj if
p(wjlx) = maxi(P(wilx )).
Now, different approaches to discriminant analysis employ different meth
ods of estimating the class-specific probability densities, Ji(x). In the para
metric case, Fisher's linear discriminant analysis derives these estimates byas
suming a multivariate normal distribution for the data, based on class-specific
sample means and a pooled covariance matrix. (The quadratic version is
similar but uses separate covariance matrices for each class). The k-nearest
neighbour (k-NN) method, on the other hand, is nonparametric and makes
no such distributional assumptions. In this approach, a kernel is formed in
the measurement space. This kernel has the shape of a hypersphere and is
centred on x. The volume, V, of the kernel is such that it is just large enough
to contain k observations. Let ki of these observations belong to class Wi and
let there be ni observations in S, belonging to class Wi. Thus, summing over
all c classes gives E ki = k and E ni = n.
830
;'_-rr·:~·X.-.;~ - '-;;_~'.-'.--' .. "."- '." .'~ ..... ~-
r""9:.Q.:=;sK"'5'.. .... ;;;; .... ~~Y"&~-:::<:S~'"'---<..;?-G~-;;;.-""-...... -"'-"-'.,-~·--;: -"
~
I }
Hand (1981) shows how posterior probabilities can be estimated in such
a situation. The essence of his argument is as follows. The class-specific
probability density for class Wi at x is estimated by:
A k· hex) =-'
niV
Similarly, the unconditional probability density at x is estimated by:
A k I(x) =
nV
(3)
(4)
If the sample sizes, ni, are proportional to the prior probabilities, p(Wi), then
the priors can also be estimated by:
(5)
Substituting from Equations 3, 4 and 5 into 1 gives estimates for the posterior
probabilities:
p(wilx ) = ki (6) k
On the other hand, if the sample sizes are not proportional to the priors, then
an adjustment is required. In this case, let:
(7)
where the various Ti are adjustment factors for the lack of proportionality.
The estimates for the posterior probabilities now become:
(8)
2 Cross-validation Anomaly in SAS
The statistical package SAS contains a multi-purpose discriminant procedure,
called DISCRIM. This procedure has options for k-NN discriminant anal
ysis and for cross-validation. These options may be used in combination.
However, the way in which this is implemented in SAS is responsible for a
rather strange difficulty which arises under cross-validation, in the form of a
parameter-dependent bias in the cross-validated error rate estimate. In certain
831
~~ E ,. c·
1 '~
~:: t. ~;
~"
~~
~
~
f~ t";
~. ~~
:r , >..~ 2~ "J: ;-'.3" ~
l #, ~: )-; t f'
f; f l'l
~ '" ," "-'" {;
¥ i~ If 1~
?~
t k
i ~~ ~.
~
I ~:
~ .:
I ~
~
\ \.
\.
circumstances, this bias can be dramatically large. This anomaly is, perhaps,
best introduced by means of an example, leaving a more general treatment of
the problem until later in this paper.
Suppose that cross-validation is being performed on a data set in which
the measurement space consists of a single uniformly distributed variable, x,
and that observations belong to one of two equiprobable classes. Suppose that
x contains no information at all about class membership and that the sample
sizes nl and n2 are equal. Now, consider the behaviour of an algorithm,
operating as previously specified, with a parameter setting of k = 2. The
distribution of kernel membership over the cross-validation procedure will be
very nearly binomial (k,p), where p = nt/no In this case, p = 1/2. (The
distribution is not exactly binomial because p changes very slightly, according
to the actual class membership of ~he observation being classified but this
small detail is unimportant here).
The focus of interest is the consequences which follow from tied class
membership in the kernels. In this example, approximately half the kernels
would be expected to have one neighbour belonging to each class. Without
loss of generality, consider what happens when a member of class 1 is subjected
to cross-validatory classification in this situation, in order to estimate eev (the
cross-validated error rate). From Equation (3), it can be seen that:
and also that:
A 1 hex) = V(nl - 1)
A 1 h(x)=
Vn2
(9)
(10)
As the prior probabilities are equal, it follows that p(wllx) > p(w2Ix) and
the case will be classified correctly. Kernels with pure class membership will
obviously produce classification in the expected direction. This leads to the
expected value for the cross-validated error rate, E(ecv ), being only 1/4, rather
than 1/2 as expected under a random assignment of observations to predicted
classes. With these parameter values, it is easy to see that when k is changed,
the parity of k has a marked effect on the estimated error rate under cross
validation, because of the effect of ties in the kernel membership when k is
832
even but not when it is odd. Thus, for odd values of k, E(ecv ) = 1/2 but for
even values of k, an optimistic bias in E(ecv ) is clearly evident:
(for k even) (11)
Another disturbing feature of this approach is the relationship that emerges
between ers (the resubstitution error estimate) and ecv • Under resubstitution,
for the same parameter values, the effect of ties in kernel membership is dif
ferent. In these circumstances, the class-specific probability densities will be
exactly equal, leading to a tie in the posterior probabilities. In SAS these ties
are evaluated conservatively (i.e. as classification errors). Consequently,for
k > 1, the following relationship holds:
e(k+l) = e(k) rs cv (12)
(Of course, for k = 1, ers is zero, because each observation is its own nearest
neighbour). In fact, this relationship is quite general and can be shown to
hold for any number of equiprobable classes. Moving from cross-validation
with k nearest neighbours, to resubstitution with k + 1, increases the num
ber of neighbours of the same class as the test case by one. Because of the
different consequences of having ties for the majority in kernel membership
under resubstitution and cross-validation, this means that the judgement of
majority membership will not differ between the two schemes.
Now, all these strange properties follow from the fact that, for tied kernel
membership, the density estimates for the two classes are not equal (as might
be naively expected) but are biased in favour of the class to which the deleted
observation belongs. In the parametric situation, by contrast, this does not
happen. Consider a particular observation being classified using Fisher's linear
discriminant analysis. Suppose that, under resubstitution, the observation
lies at a point exactly mid-way between the two group means (and hence the
group-specific densities are equal). Under cross-validation, the group mean of
the class to which the deleted case belongs will have moved slightly farther
away from the observation itself (because this observation no longer makes
a contribution to the computation of the mean) and hence the clas~specific
833
density estimate will be somewhat lower for the true class, than for the other
class.
When the sample sizes are not equal, the effect of sample size on the
estimates of group-specific density and hence on the estimated posterior prob
abilities is easily calculated, for ties in kernel membership. The results are
summarised in Table 1. It can easily be seen that, for Inl - n21 ~ 1, there is
an optimistic bias in the classification behaviour under cross-validation. Out
side these limits, the mean error rate has the appropriate theoretical value but
the performance is markedly different for observations from the two different
classes.
Actual Class nl- n2 1 2
>1 wrong correct 1 tie correct 0 correct correct -1 correct tie
< -1 correct wrong
Table 1: Classification behaviour under cross-validation, for ties in kernel membership, as a function of differences in sample size. See text for further explanation.
3 Differences in Class Location
For uniformly distributed data, it is a simple matter to generalise the argument
just presented to the situation where there are differences in location between
two equiprobable classes. Let one class have uniformly distributed data lying
in the range (0, 1) and the other have uniform data in the range (s, 1 + s). Thus
s is the separation distance between the class means and the classes overlap
for a distance 1 - s on the data line. A Bayes' decision rule will give errors
only where the classes overlap in the data space. Thus, the Bayes' error rate,
eb is given simply by: 1
eb = -(1 - s) 2
834
(13)
r-QK-s...~~AG44i'm~""",,,-",::~~~~X£:(~'::':.'-.>"--'----"
i
!
In this situation, the k-NN classification rule should approximate to the Bayes'
rate, provided that k is small compared with the sample sizes. In SAS, under
cross-validation, this is the case only for odd values of k. For even values of
k, the effect of kernel ties in the overlap region produces an optimistic bias in
E(eev}. Just as before, this is clearly evident:
(k)} _ _ ( k ) ~ E(ecv - eb(1 k/2 2k} (for k even) (14)
4 Unequal Class Probabilities
Any attempt to extend the argument just presented to the situation involving
unequal class probabilities immediately runs into a complicating factor in the
performance of the nearest-neighbour algorithm. This arises from the fact that
the k-NN classification rule is optimal only for the special case of equal class
probabilities. Cover & Hart (1967) proved that, for any number of classes,
the single nearest-neighbour decision rule produces an error rate, e~), which
is bounded below by the Bayes' error rate and above by twice the Bayes' rate:
(15)
This is easily illustrated with the sort of discrimination problem under con
sideration in this paper. For a two-class problem, with the sample sizes pro
portional to the prior probabilities, then provided that the data are smoothly
distributed, the following error rate analysis is applicable. Within the region of
overlap, for an observation in class 1, the probability of a classification error is
simply the probability that its nearest neighbour in the data space is in class 2,
i.e. p(W2}. Likewise, the probability of mis-classifying a case in class 2 is p(Wl).
Without loss of generality, let p(Wl) be the smaller of the two class probabil
ities, i.e. p(Wl) ~ p(W2). Taking class separation into account, weighting the
classes by their prior probabilities and writing p(W2) as 1 - p(Wl) gives:
(16)
By contrast, the Bayes' error rate is given by the simple decision rule of clas
sifying according to the most frequent class. Hence:
(17)
835
r ....... - ............. ""'7""'''''''''"~=~~~'''t'-:.~~f.-;..::--~-"'--~'-:-i:"-;'<.:· -" ., . ~
\
\.
Clearly, E(e~!/) = eb when p(Wl) = 1/2 and tends to the value 2eb as p(Wl)
approaches zero.
Thus, it is easier to deal with the general values of k without reference to
the Bayes' error rate. For the general k-NN rule, within the region of overlap,
the expected error rates for odd values and even values of k can be obtained
from binomial expansions, as shown below. To simplify the notation, let the
prior probability of one of the classes be p. Then, by the same sort of argument
used earlier, odd values of k give:
(k-l)/2 (k) . . i=k (k). . . eo=p L i p'(I_p)k-t+(I_p) L i p'(I_p)k-1
i=O (k+1)/2
For even values of k, two quantities can be defined:
k/2-l (k) . . i=k (k) . . e~ = p ?:i pt(1 - p)k-1 + (1- p) L i pt(1 - p)k-,
t=O k/2+l
and
ee = e~ + ~ (k~2)pk/2(1- p)k/2
Thus the true expected cross-validated error rates are:
E(eC'l)) = (1 - s)eo (for k odd)
and
(for k even)
However, because of the way that kernel ties are evaluated in SAS:
E(ecv ) = (1 - s)e~ (for k even, in SAS)
(18)
(19)
(20)
(21)
(22)
(23)
Thus, for even values of k, an optimistic bias in the cross-validated error rate
is noticeable as:
836
5 The Normal Error Distribution
So far, the analysis in this paper of cross-validation of the k-NN algorithm in
SAS has examined its behaviour in dealing with the uniform error distribution
only. This distribution was chosen because of the simplicity that it lends to
the theoretical analysis. This simplicity arises from the fact that the range
boundaries allow the data space to be partitioned into regions which differ
strongly in class-specific density. Also, within each. region, the class-specific
densities are necessarily constant.
If we turn to examining behaviour with other distributions, such as the
normal, then theoretical analysis becomes more difficult because of the fact
that the class-specific densities change continuously with the value of x. Fur
thermore, the normal curve cannot be integrated analytically, which adds to
the problem. For this reason, it was thought preferable to examine the be
haviour of the algorithm with normally-distributed data by means of Monte
Carlo simulation.
Twelve thousand observations were drawn from the normal (0,1) distri
bution and two binary class membership indicators were simulated indepen
dently. One had exactly equal numbers of cases in each class, while the other
was arranged to have exactly two-thirds of the cases in the most frequent class.
Two situations were examined. In one, x was uncorrelated with class. In the
other, a new variable, xl, was derived from x, simply by adding the binary
class membership indicator, thereby introducing a class separation distance of
unity between the respective class means. Thus four conditions were arranged,
as follows:
• EQRAND: equiprobable classes and no class separation;
• EQDIST: equiprobable classes and unit class separation;
• NEQRAND: class membership odds of 2 : 1 and no class separation;
• NEQDIST: class membership odds of 2 : 1 and unit class separation;
For each condition, the classification performance of the k-NN discriminant
algorithm in SAS was estimated with the cross-validation option. Prior prob
abilities were estimated from the data. Each condition was tested using values
837
r== l I l
\
\.
of k from 1 to 12. For comparison purposes, Fisher's linear discriminant anal
ysis was also applied to the same data.
Experimental Condition k EQRAND EQDIST NEQRAND NEQDIST 1 0.4941 0.3901 0.4305 0.3546 2 0.2572 0.2042 0.2233 0.1851 3 0.4933 0.3591 0.4107 0.3223 4 0.3172 0.2424 0.2727 0.2156 5 0.4973 0.3481 0.3959 0.3088 6 0.3513 0.2588 0.2978 0.2319 7 0.4993 0.3402 0.3854 0.3007 8 0.3695 0.2739 0.3052 0.2445 9 0.4944 0.3345 0.3727 0.2983
10 0.3781 0.2797 0.3120 0.2485 11 0.5018 0.3357 0.3665 0.2967 12 0.3923 0.2874 0.3163 0.2547
paramo 0.4988 0.3051 0.3333 0.2788 Bayes 0.5000 0.3085 0;3333 0.2698
Table 2: Cross-validation estimates of error rates as a function of k and experimental condition, obtained in Monte Carlo simulation. The parametric crossvalidation estimates for Fisher's linear discriminant analysis are also given, in addition to the Bayes error rate. See text for further explanation.
The resulting error rates are shown in Table 2. For each experimental
condition, there is the same type of optimistic bias in E(ecv ), for even values of
k, as was deduced analytically for the uniform error distribution. It is obvious
that these estimates are impossibly optimistic because they are smaller than
the Bayes' rate, most noticeably for k = 2. These results are hardly surprising,
because the essence of the problem lies in the high frequency of kernel ties and
the way that they are evaluated in SAS and is not dependent on the particular
within-class error distribution.
838
6 Extension to Multiple Classes by Monte Carlo Simulation
If more than two classes are considered, the position becomes more complex
because many different possibilities for ties emerge for majority kernel mem
bership. For example, if c = 3 and k = 6, a three-way tie is possible, as
well as three possible two-way ties. Also, ties emerge even when k is not a
multiple of c. For example, with k = 6 and four classes, a kernel might have
two observations from each of two classes and a single observation from each
of the other two classes.
Thus, an exact analytic approach to examining the effect of ties for more
than two classes is extremely tedious and not worth the trouble. For this
reason, a simple Monte Carlo simulation was performed, as follows. Twelve
thousand observations were drawn from a uniform distribution and class mem
bership indicator variables for 2, 3, 4, 5 and 6 classes were simulated (inde
pendently of the continuous variable), so as to have exactly equal numbers of
observations in each class. Thus five data sets were generated, all sharing the
same independent variable, which conveyed no information about class mem
bership. The classification performance of k-NN discriminant analysis was
estimated using the DISCRIM procedure in SAS, with the cross-validation
option. Each data set was tested using values of k from 1 to 12. The resulting
error rates are shown in Table 3. The following points should be noted.
1. The error rates for c = 2 are as expected from Equation 11.
2. In all cases where k = 1, the error rates approximate closely to the
theoretical expected values.
3. In all situations where c > 2 and k > 1, the results show clearly that
the estimated error rates are substantially lower than the corresponding
theoretical expected values.
4. Resubstitution estimates of the error rates were also recorded. For k > 1,
they confirmed exactly the relationship with the cross-validation esti
mates stated in Equation 12.
839
N umber of Classes k 2 3 4 5 6 1 0.4966 0.6675 0.7513 0.8004 0.8391 2 0.2514 0.4464 0.5598 0.6394 0.7016 3 0.5040 0.5134 0.5638 0.6123 0.6523 4 0.3155 0.5821 0.6442 0.6566 0.6583 5 0.4983 0.5381 0.6617 0.7035 0.7388 6 0.3465 0.5486 0.6318 0.6954 0.7616 7 0.4966 0.5964 0.6437 0.6852 0.7512 8 0.3663 0.5618 0.6560 0.7023 0.7433 9 0.4943 0.5729 0.6638 0.7166 0.7488 10 0.3759 0.5998 0.6549 0.7227 0.7666 11 0.5022 0.5798 0.6620 0.7221 0.7729 12 0.3919 0.5838 0.6663 0.7225 0.7680
paramo 0.4967 0.6611 0.7448 0.7973 0.8311 Bayes 0.5000 0.6667 0.7500 0.8000 0.8333
Table 3: Cross-validation estimates of error rates as a function of k and number of classes obtained in Monte Carlo simulation. The parametric cross-validation estimates for Fisher's linear discriminant analysis are also given, in addition to the Bayes error rate. See text for further explanation.
7 A Possible Remedy
The basis of the problem lies in peculiarities of the density estimation proce
dure in the k-NN algorithm under cross-validation, compounded by the high
frequency of kernel ties. However, it is possible to compensate for this by
making adjustments to the estimates of the prior probabilities. Hence, one so
lution to the problem is to estimate the prior probabilities from the data after
case deletion, rather than fix them from the outset as is done conventionally. 2
If this course of action is taken then, under cross-validation, if the deleted
case belongs to class i, the prior probability for membership of class i is then
estimated by: A( ) ni - 1 (25) PWi =--
n-l
2Note that this adaptation is proposed for the nonparametric algorithm only.
840
However, the prior probability for membership of any of the other classes, j,
is given by: n·
p(Wj) = n~ 1 (26)
The corresponding class-specific densities at x are estimated as:
~ ki fi(x) = V(ni - 1) (27)
and
j.(x) = l5.L J Vn.
J (28)
and the unconditional density is estimated by:
~ k f(x) = (n - l)V (29)
Thus the corresponding estimated posterior probabilities become:
~( I) ki P Wi x =-k
(30)
and k·
p(wjlx) = -.2 (31) k
In these circumstances, if there is a tie for majority kernel membership in
volving the class to which the deleted case belongs, then there will also be
a tie in the estimated posterior probabilities. It is proposed that a random
classification choice is made between the tied classes in these cases. If this is
done, then the resulting error rate will be essentially unbiased.
Of course, the approach just described is appropriate only when the sam
ples have been drawn so as to be representative of the populations that they
are intended to represent. If the priors are non-proportional, then this ap
proach needs to be modified. In this situation, the priors must be specified
initially by the user. To begin with, this additional information is ignored and
the computation is performed as previously specified, up to the point where
the posterior probabilities are estimated. However, the posterior probabilities
then need to be adjusted for the lack of proportionality before the assignment
to classes is made. Thus, if 7ri is the user-specified prior for class Wi, then the
adjustment factor required is:
(32)
841
\
\,
The appropriately adjusted estimate of the posterior probability is then given
by Equation 8.
8 Conclusion
This is an interesting example of a problem occurring in statistical software
which is caused, not by a computing error, but by a mathematical one. The mis
behaviour of the k-NN algorithm under cross-validation is entirely deducible
from the mathematics given in the SAS manual, (SAS Institute, 1989). In this
respect, the nature of the problem is similar to the one reported by White &
Liu (1993), in which a stepwise discriminant algorithm is improperly cross
validated. In both cases, the respective problems arose because of lack of
consideration of the effects of combining techniques. In the case of SAS, the
k-NN algorithm works well enough when considered in isolation and so does
the cross-validation technique. The difficulty arises when the two techniques
are used in combination. Apart from SAS, nonparametric discriminant tech
niques are not available in the commonly used statistical software with which
the author is familiar. Hence, problems encountered by combining these two
techniques do not seem to have been encountered elsewhere.
The solution offered in this paper keeps as close as possible to the origi
nal philosophies of both cross-validation and k-NN discriminant analysis. It
involves estimating the prior probabilities from the data after the case dele
tion which forms part of the cross-validation procedure. The only possibly
contentious aspect is the proposed use of random choice between predicted
classes in the case of ties in posterior probabilities. One feature of this ap
proach is that the procedure is non-repeatable. However, there are precedents
for this type of procedure. Tocher (1950) proposed a modification to Fisher's
exact probability test which utilised random choice in order to achieve spec
ified a values for significance testing purposes. Also, the use of approximate
randomization techniques for conducting significance tests has been described
by Edgington (1980), Still & White (1981) and White & Still (1984, 1987).
The proposal here is to make use of random choice to achieve an unbiased
estimate of classification performance when kernel ties are encountered.
842
-_ .. _ .. _--- -- --.- -.- -~.---
.-.- ----_." -
Acknowledgements
The author would like to thank Prof. J.B. Copas (from the Department of
Statistics at the University of Warwick) and Prof. A.J. Lawrance and Dr. P.
Davies (both from the School of Mathematics and Statistics, at the University
of Birmingham) for their helpful comments.
References
Cover, T.M. and Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, IT-13, 21-27.
Edgington, E.S. (1980). Randomization Tests, New York: Marcel Dekker.
Hand, D.J. (1981). Discrimination and Classification, New York: John Wiley & Sons Ltd.
SAS Institute Inc. (1989). SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 1, Cary, NC, USA: SAS Institute Inc.
Still, A.W. and White, A.P. (1981). The approximate randomization test as an alternative to the F test in analysis of variance. British Journal of Mathematical and Statistical Psychology, 34, 243-252.
Tocher, K.D. (1950). Extension of the Neyman-Pearson theory of tests to discontinuous variates. Biometrika, 37, 130-144.
White, A.P. and Liu, W.Z. (1993). The jackknife with a stepwise discriminant algorithm - a warning to BMDP users. Journal of Applied Statistics, 20, (1), 187-190.
White, A.P. and Still, A.W. (1984). Monte Carlo analysis of variance. In Proceedings of the Sixth Symposium in Computational Statistics (Prague). Vienna: Physica-Verlag.
White, A.P. and Still, A.W. (1987). Monte Carlo randomization tests: A reply to Bradbury. British Journal of Mathematical and Statistical Psychology, 40, 188-191.
843