Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna
description
Transcript of Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna
![Page 1: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/1.jpg)
Pairwise Nearest Neighbor Method RevisitedParittainen yhdistelymenetelmä uudistettuna
UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND
Olli Virmajoki
11.12.2004
![Page 2: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/2.jpg)
Clustering
Important combinatorial optimization problem that must often be solved as a part of more complicated tasks in data analysis pattern recognition data mining other fields of science and engineering
Entails partitioning a data set so that similar objects are grouped together and dissimilar objects are placed in separate groups
![Page 3: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/3.jpg)
Example of data sets
ID POSTAL ZONE Self employed
Civil servents
Clerks Manual workers
800 Munchen 56750 57218 300201 242375 801 Munchen-land ost 7684 5790 20279 23491 802 Munchen-land sued 3780 1977 11058 7398 803 Munchen-land west 7226 5623 25571 20380 804 Munchen-land nord 2226 1305 9347 12432 805 Freising 8187 5140 14632 24377 806 Dachau 8165 2763 11638 24489 807 Ingolstadt 5810 5212 15019 30532
Employment statistics
R G B 26 20 45 28 5 46 28 12 44 23 13 46 31 4 51
RGB-data
![Page 4: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/4.jpg)
Summary of data setsData set
Type of data set
Number of data vectors (N)
Number of clusters (M)
Dim of data vector
Bridge Gray-scale
4096 256 16
House RGB 34112 256 3
Miss America
Residual vectors
6480 256 16
Data set S1-S4
Synthetic
5000 15 2
![Page 5: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/5.jpg)
Data sets
![Page 6: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/6.jpg)
An example of clustering
![Page 7: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/7.jpg)
Clustering
Given a set of N data vectors X=x1, x2, ...XN in K-dimensional space, clustering aims at solving the partition P=p1, p2, ...pN, which defines for each data vector the index of the cluster where it is assigned to.Cluster sa = xi|pi=a
Clustering S=s1, s2, ...,sM
Codebook C=c1, c2, ...,cMCost function
Combinatorial optimization problem
N
ipi icx
NPCf
1
21),(
![Page 8: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/8.jpg)
Clustering algorithms
Heuristic methodsOptimization methods K-means Genetic algorithms
Graph-theoretical methodsHierarchical methods Divisive Agglomerative (yhdistelevä)
![Page 9: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/9.jpg)
Agglomerative clustering
N = 22 ( number of data points )M = 3 ( number of final clusters )
![Page 10: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/10.jpg)
Ward’s method (PNN in VQ)
2
, baba
baba cc
nn
nnd
ji
jiNjidba ,
,1,minarg,
Merge cost:
Local optimization strategy:
Nearest neighbor search is needed: (1) finding the cluster pair to be merged(2) updating of NN pointers
![Page 11: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/11.jpg)
The PNN methodM=5000M=4999M=4988...M=50..M=16M=15
M=5000 M=50
M=16 M=15
![Page 12: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/12.jpg)
Nearest neighbor pointers
a
b
c
d
e
f
g
Fast exaxt PNN method:Reduces the amount of the nearest neighbor searchesin each iteration: O(N 3) Ω (N 2)
![Page 13: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/13.jpg)
Combining the PNN and k-means
N
M
M 0
c o m b i n e dPNN
k - m e an s
s t an d ar dPNN
r an d o ms e l e c t i o n
1
M
M 0
N
code
book
siz
e
![Page 14: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/14.jpg)
PNN as a crossover method in the genetic algorithm
Two random codebooksM=15
Combinedcodebook M=30 andfinal codebookM=15
Initial1 Initial2
Combined Result of PNNUnion
PNN
![Page 15: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/15.jpg)
Publication 1: Speed-up methods
Partial distortion search (PDS) Mean-distance-ordered search (MPS) Uses the component means of the
vectors Derives a precondition for the
distance calculationsReduction of the run time to 2 to 15%
![Page 16: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/16.jpg)
Example of the MPS method
A
A '
B
B '
C '
C
A
A '
B
B '
C '
C
Input vector
Best candidate
![Page 17: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/17.jpg)
Publication 2:Graph-based PNN
Based on the exact PNN methodNN search is limited only to the k clusters that are connected by the graph structureReduces the time complexity of every search from O(N) to O(k)Reduction in the run time to 1 to 4%
![Page 18: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/18.jpg)
Why graph structure ?
O(N) searches with the full search (N=4096)
Only O(k) searches with the graph structure !(k = 3)
![Page 19: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/19.jpg)
Sample graph
![Page 20: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/20.jpg)
Publication 3:Multilevel thresholding
Can be considerd as a special case of vector quantization (VQ), where the vectors are 1-dimensionalExisting method (N 2)PNN thresholding can be implemented in O(N·logN)The proposed method works in real time for any number of thresholds
![Page 21: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/21.jpg)
Distances in heap structure
1 2 4 7 8
1 2 4 8
4
7 2
18
update
updatere m o ve
73 15 12 70
m inim um dis tanc e
73 28 88
O(1) O(log N)
![Page 22: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/22.jpg)
Publication 4:Iterative shrinking (IS)
Generates the clustering by a sequence of cluster removal operationsIn the IS method the vectors can be reassigned more freely than in the PNN methodCan be applied as a crossover method in the genetic algorithm (GAIS)GAIS outperforms all other clustering algorithms
![Page 23: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/23.jpg)
Example of the PNN method
Co d e v e cto rs: Data v e cto rs:
Be fo re c lu ste r me rg e Afte r c lu ste r me rg e
Ve cto rs to b e me rg e d
R e ma in in g ve cto rs
D a ta ve cto rs o f th e c lu ste rs to b e me rg e d
O th e r d a ta ve cto rs
S2
S3
S4S5
S1
x
+
x xx
xx
x
xx
x x
x
x
x+
+
+ ++
+
+
+ +
+
+
+
++
++ +
+
x xx
xx
xx
x x
x
x
x+
+
+ ++
+
+
+ +
+
+
++
++ +
+
x
+
![Page 24: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/24.jpg)
Example of the iterative shrinking method
Code v e cto rs: Data v e cto rs:
Be fo re c luste r remova l Afte r c luste r remova l
Vecto r to be removed
R ema in ing vecto rs
D a ta vecto rs o f the c luste r to be removed
O the r da ta vecto rs
S2
S3
S4S5
S1
x
+
+ ++
++
+
++
x x
x
x
x+
+
+ ++
+
+
+ +
+
+
+
++
++ +
+
+ ++
++
++
x x
x
x
x+
+
+ ++
+
+
+ +
+
+
++
++ +
+
+
+
![Page 25: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/25.jpg)
The PNN and IS in the search of the number of clusters
S4
0.000080
0.000085
0.000090
0.000095
0.000100
0.000105
0.000110
0.000115
0.000120
25 20 15 10 5
Number of clusters
F-r
atio
minimum
IS
PNN
![Page 26: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/26.jpg)
Time-distortion performance
160
165
170
175
180
185
190
0 1 10 100 1000 10000 100000Time (s)
MS
E
repeatedK-means
RLS
GAIS
PNN
IS
SAGA
![Page 27: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/27.jpg)
Publication 5:Optimal clustering
Can be found by considering all possible merge sequences and finding the one that minimizes the optimization functionCan be implemented as a branch-and-bound (BB) techniqueTwo suboptimal, but polynomial, time variants: Piecewise optimization Look-Ahead optimization
![Page 28: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/28.jpg)
Example of non-redundant search tree
B C E B C DB C
A B
A B C A B D A B E
A C A D A E B C B D
C D
C E
D E
C D E
C EC D
A B C E A B D E
D E
A C D A C E B D B C EB E D E A D E B C B E B C B D C D B C D B D E
A C D E B E
B D
B D E B C D EA B C D
Branches that do not have any valid clustering have been cut out
![Page 29: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/29.jpg)
Illustration of the Piecewise optimization
N c lu s te rs
N - Z c lu s te rs
N - 2Z c lu s te rs
N - 3Z c lu s te rs
M c lu s te rsF in a l re s u lt
Z m e rg es te ps
![Page 30: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/30.jpg)
Comparative results
160
165
170
175
180
1 10 100 1000 10000 100000
Running time (in seconds)
MS
E
Bridge
GAIS(short) GAIS(long)IS
PNN
Standard k-means
PNN+PDS+MPS+LazyGraph-PNN
Graph-PNN+K-means
K-means+PDS+MPS+Activity
![Page 31: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/31.jpg)
Comparative results
5.5
6.0
6.5
7.0
7.5
8.0
1 10 100 1000 10000 100000 1000000
Running time (in seconds)
MS
E
GAIS(long)
House
GAIS(short)
Standard k-means
PNNIS
PNN+PDS+MPS+Lazy
Graph-PNN+K-means
Graph-PNN
K-means+PDS+MPS+Activity
![Page 32: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/32.jpg)
Comparative results
5.0
5.2
5.4
5.6
5.8
6.0
1 10 100 1000 10000 100000 1000000
Running time (in seconds)
MS
E
Miss AmericaStandard k-means
GAIS(long)GAIS(short)
ISPNN
PNN+PDS+MPS+LazyGraph-PNN
Graph-PNN+K-means
K-means+PDS+MPS+Activity
![Page 33: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/33.jpg)
Example of clustering
k-means agglomerative clustering
![Page 34: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna](https://reader036.fdocuments.in/reader036/viewer/2022062323/56815946550346895dc681c6/html5/thumbnails/34.jpg)
ConclusionsSeveral speed-up methods Projection-based search Partial distortion search k nearest neighbor graphEfficient O(N·logN) time implementation for the 1-dimensional caseGeneralization of the merge phase by cluster removal philosofy (IS) for better qualityOptimal clustering based on the PNN method