ClusterI
Transcript of ClusterI
-
8/8/2019 ClusterI
1/24
Cluster AnalysisWhat is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning MethodsHierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
-
8/8/2019 ClusterI
2/24
What is Cluster Analysis?Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters
Cluster analysis Grouping a set of data objects into clustersClustering is unsupervised classification : nopredefined classesTypical applications As a stand-alone tool to get insight into data
distribution As a preprocessing step for other algorithms
-
8/8/2019 ClusterI
3/24
General Applications of Clustering
Pattern RecognitionSpatial Data Analysis create thematic maps in GIS by clustering feature
spaces detect spatial clusters and explain them in spatial
data miningImage ProcessingEconomic Science (especially market research)WWW Document classification Cluster Weblog data to discover groups of similar
access patterns
-
8/8/2019 ClusterI
4/24
Examples of Clustering
ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketingprograms
Land use: Identification of areas of similar land use in an earthobservation database
Insurance: Identifying groups of motor insurance policy holders witha high average claim cost
City-planning: Identifying groups of houses according to their house
type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should beclustered along continent faults
-
8/8/2019 ClusterI
5/24
What Is Good Clustering?
A good clustering method will produce high qualityclusters with high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both thesimilarity measure used by the method and itsimplementation.
The quality of a clustering method is also measured byits ability to discover some or all of the hidden patterns.
-
8/8/2019 ClusterI
6/24
Classification is more objective
Can compare various algorithms
x
xx
x
xx
x
x
x
xo
oo
oo
o
o
o
o o
o
o
ox
x
x
o
o
-
8/8/2019 ClusterI
7/24
Clustering is very subjective
Cluster the following animals: Sheep, lizard, cat, dog, sparrow, blue shark, viper, seagull, gold
fish, frog, red-mullet
1. By the way they bear their progeny
2. By the existence of lungs
3. By the environment that they live in
4. By the way the bear their progeny and the existence of their lungs
Which way is correct? Depends
-
8/8/2019 ClusterI
8/24
Requirements of Clustering in Data
MiningScalabilityAbility to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine inputparameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
-
8/8/2019 ClusterI
9/24
Cluster AnalysisWhat is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning MethodsHierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
-
8/8/2019 ClusterI
10/24
Distance measureDissimilarity/Similarity metric: Similarity is expressed in terms of a distancefunction, which is typically metric: d (i, j )Similarity measure: Small means close Ex: Sequence similarity
Distance = cost of insertion + 2*cost of changing C to A + cost of changing A to TDissimilarity measure: small means close
TATATAAGT -TCCA
TATATAAGGCTAAT
TATATAAGTTCCA TATATAAGGCTAAT
-
8/8/2019 ClusterI
11/24
Data Structures
Asymmetrical distance
Symmetrical distance
-
np x ...nf x ...n1 x ... ... ... ... ...
ip x ...if x ...i1 x
... ... ... ... ...1p x ...1f x ...11 x
-
0...)2,()1,(
:::
)2,3()
...d d
0d d(3,10d(2,1)
0
-
8/8/2019 ClusterI
12/24
Measure the Quality of Clustering
There is a separate quality function thatmeasures the goodness of a cluster.The definitions of distance functions are usuallyvery different for interval-scaled, boolean,categorical, ordinal and ratio variables.Weights should be associated with differentvariables based on applications and datasemantics.
It is hard to define similar enough or goodenough the answer is typically highly subjective.
-
8/8/2019 ClusterI
13/24
Type of data in clustering analysis
Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types:
-
8/8/2019 ClusterI
14/24
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
where
Calculate the standardized measurement ( z-score )
Using mean absolute deviation is more robust than using
standard deviation
.)...211
nf f f f x x(xnm!
|)|...|||(|1 21f nf f f f f f
m xm xm xn
s !
f
f if
if s
m x z !
-
8/8/2019 ClusterI
15/24
Similarity and Dissimilarity Between
ObjectsDistances are normally used to measure the similarity or dissimilarity between two data objects
Some popular ones include: M ink ows ki di sta nce :
where i = ( x i1, x i2, , x ip) and j = ( x j1, x j2, , x jp) are two p -dimensional data objects, and q is a positive integer
If q = 1, d is Manhattan distance
qq
p p
qq
j x
i x
j x
i x
j x
i x jid )||...|||(|),(
2211!
||...||||),(2211 p p j xi x j xi x j xi x jid
!
-
8/8/2019 ClusterI
16/24
Similarity and Dissimilarity Between
Objects (Cont.)If q = 2 , d is Euclidean distance:
Properties
d(i,j) u 0d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) e d(i,k) + d(k,j)
Also, one can use weighted distance, parametricPearson product moment correlation, or other disimilaritymeasures
)||...|||(|),( 2222
2
11 p p j x
i x
j x
i x
j x
i x jid !
-
8/8/2019 ClusterI
17/24
Binary Variables
A contingency table for binary data
Simple matching coefficient (invariant, if the binary variable is
symmetr i c ):
Jaccard coefficient (noninvariant if the binary variable is
asymmetr i c ):
d cbacb jid !),(
pd bca sum
d cd c
baba
sum
0
1
01
cbacb jid !),(
Object i
Object j
-
8/8/2019 ClusterI
18/24
18
Dissimilarity between Binary
VariablesExample
gender is a symmetric attribute the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
Name Gender Fever Cough Test- 1 Test- 2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
!!
!!
!!
mary ji md
ji m j ack d
mary j ack d
-
8/8/2019 ClusterI
19/24
Nominal Variables
A generalization of the binary variable in that it can takemore than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching m : # of matches, p : total # of variables
Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states
pm p jid !),(
-
8/8/2019 ClusterI
20/24
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled replace x i f by their rank
map the range of each variable onto [ 0 , 1] by replacing i -th objectin the f -th variable by
compute the dissimilarity using methods for interval-scaledvariables
1
1!
f
if if
M
r z
},...,1{ f if M r
-
8/8/2019 ClusterI
21/24
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on anonlinear scale, approximately at exponential scale,
such as A e Bt or A e -Bt
Methods: treat them like interval-scaled variables not a g oo d cho i ce!
(why?the scale can be distorted)
apply logarithmic transformation
y i f = l og(x i f ) treat them as continuous ordinal data treat their rank as interval-
scaled
-
8/8/2019 ClusterI
22/24
Variables of MixedTypes
A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
One may use a weighted formula to combine their effects
f is binary or nominal:d ij(f) = 0 if xif = x jf , or d ij(f) = 1 o.w.
f is interval-based: use the normalized distance f is ordinal or ratio-scaled
compute ranks r if andand treat z if as interval-scaled
)(1
)()(1),(
f i j
p f
f i j
f i j
p f d jid
H
H
!
!
7
7!
1
1!
f
if
M r z if
-
8/8/2019 ClusterI
23/24
Cluster AnalysisWhat is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning MethodsHierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary
-
8/8/2019 ClusterI
24/24
Major Clustering Approaches
Partitioning algorithms: Construct various partitions and then
evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition of the set
of data (or objects) using some criterionDensity-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters and
the idea is to find the best fit of that model to each other