Learning Vector Quantization and K-Nearest...

Learning Vector Quantization and K-Nearest Neighbor

Learning Vector Quantization and K-NearestNeighbor

Jia Li

Department of StatisticsThe Pennsylvania State University

Email: [email protected]://www.stat.psu.edu/∼jiali

Jia Li http://www.stat.psu.edu/∼jiali


Learning Vector Quantization

I Developed by Kohonen. A package with document is availableat:http://www.cis.hut.fi/nnrc/nnrc-programs.html .

I When position the prototypes, use information given by classlabels, as a contrast to k-means which selects prototypeswithout using class labels.

I Often works better than k-means.

I The idea is to move a prototype close to training samples inits class, and move away from samples with different classes.



The Algorithm

1. Start from a set of initial prototypes with classes assigned. Denotethe M prototypes by Z = {z1, ..., zM} and their associated classesby C (zm), m = 1, 2, ...,M.

I The initial prototypes can be provided by k-means.

2. Sweep through the training samples and update zm after visitingeach sample.

I Suppose xi is assigned to the mth prototype zm by the nearestneighbor rule:‖ xi − zm ‖≤‖ xi − zm′ ‖ ,∀m′ 6= m, 1 ≤ m′ ≤ M

I If gi = C (zm), move zm towards the training sample:zm ← zm + ε(xi − zm) where ε is the learning rate.

I If gi 6= C (zm), move zm away from the training sample:zm ← zm − ε(xi − zm)

3. Step 2 can be repeated a number of times.



Experiments

I Use the diabetes data set.

I Use prototypes obtained by k-means as initial prototypes.

I Use LVQ with ε = 0.1.

I Results obtained after 1, 2, and 5 passes are shown below.

I Classification is not guaranteed to improve after adjustingprototypes.

I One pass with a small ε usually helps. But don’t over do it.

Comments:I Fine tuning often helps:

I Select initial prototypes.I Adjust learning rate ε.

I Read the package documents for details.



Error rate: 27.74% Error rate: 27.61%



Error rate: 27.86% Error rate: 32.37%



K-Nearest Neighbor Classifiers

I Given a query point x0, find the k training samples x(r),r = 1, ..., k closest in distance to x0, and then classify usingmajority vote among the k neighbors.

I Feature normalization is often performed in pre-processing.

I Classification boundaries become smoother with larger k.



A Comparative Study (ElemStatLearn)

I Two simulated problems.

I There are 10 independent features Xj , each uniformlydistributed on [0, 1].

I The two-class 0-1 target variable is defined as follows:

problem 1: “easy”

Y = I (X1 >1

2);

problem 2: “difficult”

Y = I

sign

3∏

j=1

(Xj −1

2)

> 0

.



I For problem 1 (2), except X1 (and X2, X3), all the otherfeatures are “noise”.

I The Bayes error rates are zero.

I In each run, 100 samples used in training and 1000 used intesting.



I The figure shows the mean and standard deviation of themisclassification error for nearest-neighbors, K-means andLVQ over 10 realizations (10 simulated data sets), as thetuning parameters change.

I K-means and LVQ give almost identical results.

I For the first problem, K-means and LVQ outperform nearestneighbors, assuming the best choice of tuning parameters foreach. For the second problem, they perform similarly.

I The optimal k for the k-nearest neighbor classification differssignificantly for the two problems.



Adaptive Nearest-Neighbor MethodsI When dimension is high, data become relatively sparse.I Implicit in nearest-neighbor classification is the assumption

that the class probabilities are roughly constant in theneighborhood, and hence simple averages give good estimates.

I In high dimensional space, the neighborhood represented bythe few nearest samples may not be local.

I Consider N data points uniformly distributed in the unit cube[− 1

2 , 12 ]p. Let R be the radius of a 1-nearest-neighborhood

centered at the origin. Then

median(R) = v−1/pp

(1− 1

2

1/N)1/p

,

where vprp is the volume of the sphere of radius r in p

dimensions.I The median radius quickly approaches 0.5, the distance to the

edge of the cube, when dimension increases.Jia Li http://www.stat.psu.edu/∼jiali


I Adjust distance metric locally, so that the resultingneighborhoods stretch out in directions for which the classprobabilities don’t change much.



Discriminant Adaptive Nearest-Neighbor (DANN)

I At each query point, a neighborhood of say 50 points isformed.

I Class probabilities are NOT assumed constant in thisneighborhood. This neighborhood is used only to decide howto define the adapted metric.

I After the metric is decided, a normal k-nearest neighbor ruleis applied to classify the query.

I The metric changes with the query.



I The DANN metric at a query point x0 is defined by

D(x , x0) = (x − x0)TΣ(x − x0) ,

where

Σ = W−1/2[(W−1/2)TBW−1/2 + εI](W−1/2)T

= W−1/2[B∗ + εI](W−1/2)T .

I W is the pooled within-class covariance matrix and B is thebetween class covariance matrix.



Intuition

I We compute W and B in LDA.

I Recall we did similar computation when deriving discriminantcoordinates.

I Eigen decomposition of B∗ = V∗DBV∗T .

Σ = W−1/2[B∗ + εI](W−1/2)T

= W−1/2[V∗DBV∗T + εI](W−1/2)T

= W−1/2V∗ · (DB + εI) · (W−1/2V∗)T

I Note that the column vectors of W−1/2V∗ are simply thediscriminant coordinates.

I εI is added to avoid using samples far away from the querypoint.



Geometric interpretation:

I To compute DNAA metric, x − x0 is projected onto thediscriminant coordinates.

I The projection values on the significant discriminantcoordinates are magnified; those on the insignificant DCs areshrunk.

I The implication is that the neighborhood is stretched in thedirections of the insignificant DCs and squeezed in those ofthe significant ones.

I (W−1/2)T is introduced to sphere the data so that thewithin-class covariance matrix is the identity matrix.



I Significant discriminant coordinatesrepresent directions in which classprobabilities change substantially.Hence when we form aneighborhood, we want it to havesmall span in these directions. Onthe opposite, we want theneighborhood to have large span indirections in which classprobabilities don’t change much.

I To summarize: we want to form aneighborhood that contains asmany samples as possible but in themean while has approximatelyconstant class probabilities.



I To summarize: we want to form a neighborhood that containsas many samples as possible but in the mean while hasapproximately constant class probabilities.



Global Dimension Reduction

I At each training sample xi , the between-centroids sum ofsquares matrix Bi is computed, using a neighborhood of say50 points.

I Average Bi over all training samples:

B̄ =1

N

N∑i=1

Bi .



I Let v1, v2, ..., vp be the eigenvectors of the matrix B̄, orderedfrom largest to smallest eigenvalues θk . Then a rank-L, L < p,approximation to B̄ is

B̄[L] =L∑

l=1

θlvlvTl .

B̄[L] is optimal in the sense of solving the least square problem:

minrank(M)=L

trace(B−M)2 .

And hence solves:

minrank(M)=L

N∑i=1

trace(Bi −M)2 .


Learning Vector Quantization and K-Nearest...

Documents

Transcript of Learning Vector Quantization and K-Nearest...