Post on 18-Jan-2018
description
Compiled By:Raj Gaurang
TiwariAssistant
ProfessorSRMGPC,
Lucknow
Iterative Optimization and Cluster Validation
Exhaustive enumerationOnce a criterion function has been
selected, clustering becomes a well-defined problemfind those partitions of the set of samples
that extremize the criterion function Since the sample set is finite, there are
only a finite number of possible partitions. Thus, in theory the clustering problem can always be solved by exhaustive enumeration.
However, the computational complexity renders such an approach unthinkable for all but the simplest problems;
there are approximately cn/c! ways of partitioning a set of n elements into c subsets,
For example an exhaustive search for the best set of 5 clusters in 100 samples would require considering more than 597(≈1067)
Thus exhaustive search is completely infeasible partitions.
Iterative OptimizationThe basic idea is to find some reasonable initial
partition and to “move” samples from one group to another if such a move will improve the value of the criterion function.
Different starting points can lead to different solutions, and one never knows whether or not the best solution has been found.
Despite these limitations, the fact that the computational requirements are bearable makes this approach attractive.
Let us consider the use of iterative improvement to minimize the sum-of-squared error criterion Je, written as
In context of i
which typically happens whenever ˆx is closer to mj than mi.
Basic iterative minimum-squared-error clustering
CS583, Bing Liu, UIC9
Cluster ValidationThe goal of clustering is to determine the
intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good
clustering? The best criterion is heavily dependent of the
final aim of the clustering..
It is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.
For instance, one could be interested in finding representatives for homogeneous groups (data
reduction), in finding natural clusters and describe their unknown
properties (natural data types), in finding useful and suitable groupings (useful data
classes) in finding unusual data objects (outlier detection).
More or less, whatever the intention of clustering may be, the number of clusters sought is always unknown. quantity
Cluster ValidityFor cluster analysis, the analogous question is
how to evaluate the “goodness” of the resulting clusters?
But “clusters are in the eye of the beholder”!
Then why do we want to evaluate them?
To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters
The Problem of ValidityWhen clustering is done by extremizing a criterion
function, a common approach is to repeat the clustering procedure for c = 1, c = 2, c = 3, etc., and to see how the criterion function changes with c.
For example, it is clear that the sum-of-squared error criterion Je must decrease monotonically with c, since the squared error can be reduced each time c is increased.
If the n samples are really grouped into ˆc compact, well separated clusters, one would expect to see Je decrease rapidly until ˆc = c,
decreasing much more slowly thereafter until it reaches zero at c = n.