Spirit Week @ Harlan Display Falcon Pride & earn extra credit points!
week 05 - Clustering.pptec/files_1011/week 05 - Clustering.pdfdata points in one cluster are more...
Transcript of week 05 - Clustering.pptec/files_1011/week 05 - Clustering.pdfdata points in one cluster are more...
Cl iClusteringg
1
Clustering
What is Clustering?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods Hierarchical Methods
2
What is Clustering?
Clustering of data is a method by which large sets of data are grouped into clusters of smaller sets of similar data.
Cluster: a collection of data objects Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Clustering is unsupervised classification: no predefined classes
3
What is Clustering?What is Clustering?
Typical applicationsyp pp
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithmsp p g p g
Use cluster detection when you suspect that there are natural i th t t f t d t th tgroupings that may represent groups of customers or products that
have lot in common.
When there are many competing patterns in the data, making it hard to spot a single pattern, creating clusters of similar records reduces the complexity within clusters so that other data miningreduces the complexity within clusters so that other data mining techniques are more likely to succeed.
4
Examples of Clustering Applicationsp g pp
Marketing: Help marketers discover distinct groups in their g p g pcustomer bases, and then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation database
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
5
Clustering definition
Given a set of data points, each having a set of attributes, and a i il i h fi d l h hsimilarity measure among them, find clusters such that:
data points in one cluster are more similar to one another (high i t l i il it )intra-class similarity)
data points in separate clusters are less similar to one another (low i t l i il it )inter-class similarity )
Similarity measures: e.g. Euclidean distance if attributes are continuous.
6
f lRequirements of Clustering in Data Mining
Scalability Scalability
Ability to deal with different types of attributes
f l h b h Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parametersparameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
7 8
http://webscripts.softpedia.com/screenshots/Efficient-K-Means-Clustering-using-JIT_1.png
http://api.ning.com/files/uI4*osegkS5tF-JjFYZai3mGuslDu*-
9
http://api.ning.com/files/uI4 osegkS5tF JjFYZai3mGuslDuBQ1rFsozaAaDw9IBdc99OjNas3FPKIrdgPXAz34DU0KsbZwl7G8tM5-n4DXTk6Fab/clustering.gif
http://wiki.na-mic.org/Wiki/index.php/Progress_Report:DTI_ClusteringProject aiming at developing tools in the 3D Slicer for automatic clustering of tractographic paths through diffusion tensor MRI (DTI) data.‘characterize the strength of connectivity between selected regions in the brain’
10
N ti f Cl t i A biNotion of a Cluster is Ambiguous
Initial points. Six ClustersInitial points. Six Clusters
Four ClustersTwo Clusters
11
ClusteringClustering
What is Cluster Analysis? What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
12
Data Matrix
Represents n objects with p variables (attributes, measures)
A relational table A relational table
xxx
p1xf1x11x
ipxifx1ix
npxnfx1nx
13
Dissimilarity Matrix
P i iti f i f bj t Proximities of pairs of objects
d(i,j): dissimilarity between objects i and j
N ti Nonnegative
Close to 0: similar
0
0(2,1)
0
d
0(3,2)(3,1)
( , )
dd
0,2)(,1)(
ndnd
14
0,2)(,1)( ndnd
T f d i l i l iType of data in clustering analysis
Continuous variables
Binary variables Binary variables
Nominal and ordinal variables
Variables of mixed types
15
Continuous variables To avoid dependence on the choice of measurement units the data should be
standardizedstandardized.
Standardize data
Calculate the mean absolute deviation:Calculate the mean absolute deviation:
|)fmnf|x...|fmf2|x|fmf1(|xn1
fs −++−+−=
where
Calculate the standardized measurement (z-score)
)nfx...f2xf1(xn1 fm +++=
fsfmifx
ifz−
=
Using mean absolute deviation is more robust than using standard deviation. Since the deviations are not squared the effect of outliers is somewhat reduced but their z-scores do not become to small; therefore, the outliers remain detectable.
f
16
scores do not become to small; therefore, the outliers remain detectable.
Similarity/Dissimilarity Between Objectsy/ y j
Distances are normally used to measure the similarity or dissimilarity b t t d t bj tbetween two data objects
Euclidean distance is probably the most commonly chosen type of distance It is the geometric distance in the multidimensional spacedistance. It is the geometric distance in the multidimensional space:
p 2 Required properties for
a distance function
=
−=p
1k2)kjxkix()j,i(d
a distance function
d(i,j) ≥ 0
d(i i) = 0 d(i,i) = 0
d(i,j) = d(j,i)
d(i j) ≤ d(i k) + d(k j)
17
d(i,j) ≤ d(i,k) + d(k,j)
18
http://uk.geocities.com/ahf_alternate/dist.htm#S2
l / l bSimilarity/Dissimilarity Between Objects
City-block (Manhattan) distance. This distance is simply the sum of differences across dimensions. In most cases, this distance measure
i ld lt i il t th E lid di t H t th t iyields results similar to the Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). p ( y q )
||...||||),(jp
xip
xj
xi
xj
xi
xjid −++−+−=2211
The properties stated for the Euclidean distance also hold for this
||||||)(jpipjiji
j2211
measure.
19
Manhattan distance = distance if you had to travel along coordinates Manhattan distance distance if you had to travel along coordinates only.
(9 8)y = (9,8)euc.:dist(x,y) =√√(42+32) = 5
35
4
35
x = (5,5)manh.:
4
20
dist(x,y) = 4+3 = 7
l / l bSimilarity/Dissimilarity Between Objects
Minkowski distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This measure enables to accomplish that and is computed as:
qqjp
xip
xqj
xi
xqj
xi
xjid1
2211
−++−+−= ||...||||),(
jpipjiji 2211
21
l / l bSimilarity/Dissimilarity Between Objects
Weighted distancesg
If we have some idea of the relative importance that should be assigned to each variable, then we can weight them and obtain a weighted distance measure.
22111
)()(),(jp
xip
xp
wj
xi
xwjid −++−=
22
Binary VariablesBinary Variables
Binary variable has only two states: 0 or 1 Binary variable has only two states: 0 or 1
A binary variable is symmetric if both of its states are equally y y q yvaluable, that is, there is no preference on which outcome should be coded as 1.
A binary variable is asymmetric if the outcome of the states are not equally important such as positive or negative outcomes of anot equally important, such as positive or negative outcomes of a disease test.
Similarity that is based on symmetric binary variables is called invariant similarity.
23
Binary Variables
A contingency table for binary data
sum01Object j
dcdc
baba
++
0
1Object i
pdbcasum ++
24
Binary Variables
Object j
baba
sum
+1
01
j j
pdbcasum
dcdc
+++0Object i
Symmetric binary dissimilarity:
p
dcbacb jid +++
+=),(
Jaccard coefficient (asymmetric binary dissimilarity):
25cba
cb jid +++=),(
Dissimilarity between Binary VariablesDissimilarity between Binary Variables Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N M F Y N P N P N
gender is a symmetric attribute
Mary F Y N P N P N Jim M Y P N N N N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0,
330102
10 .),( =++
+=maryjackd
670111
11102
.),( =++
+=
++
jimjackdJaccard coefficient(for the asymetric
26
750211
21 .),( =++
+=maryjimdvariables)
Nominal VariablesNominal Variables
A generalization of the binary variable in that it can take more A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Method 1: simple matching
m: # of matches, p: total # of variables
pmpjid −=),(
Method 2: use a large number of binary variables
i bi i bl f h f h M i l
p
creating a new binary variable for each of the M nominal states
27
Ordinal VariablesOrdinal Variables On ordinal variables order is important
e.g. Gold, Silver, Bronze
Can be treated like continuous
the ordered states define the ranking 1,...,Mff
replacing xif by their rank
map the range of each variable onto [0, 1] by replacing i-th object in
},...,{f
Mif
r 1∈
map the range of each variable onto [0, 1] by replacing i th object in the f-th variable by
1−if
r
compute the dissimilarity using methods for continuous variables
1−=
fM
ifif
z
28
compute the dissimilarity using methods for continuous variables
Variables of Mixed Types
A database may contain several/all types of variables continuous, symmetric binary, asymmetric binary, nominal and ordinal.
One may use a weighted formula to combine their effects.
1
p
f
(f) (f)δ dij ijd(i j) ==
δ =0 if x is missing or x =x =0 and the variable f is asymmetric binary
1
p
f
d(i, j)(f)δij
=
=
δij=0 if xif is missing or xif=xjf=0 and the variable f is asymmetric binary
δij=1 otherwise continuous and ordinal variables dij: normalized absolute distancej
binary and nominal variables dij=0 if xif=xjf; otherwise dij=1
29
Clustering
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
30
Major Clustering Approaches
P i i i l i h C i i i d h Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
Density-based: Based on connectivity and density functions. Able to find clusters of arbitrary shape Continues growing a cluster as longfind clusters of arbitrary shape. Continues growing a cluster as long as the density of points in the neighborhood exceeds a specified limit.
Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other
31
the idea is to find the best fit of that model to each other
Clustering
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
32
P i i i Al i h B i CPartitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D of nobjects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterionpartitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means: Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster
33
The K-Means Clustering MethodThe K Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps:
1. Partition objects into k nonempty subsets
2 Compute centroids of the clusters of the current partition The centroid2. Compute centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.
3 A i h bj h l i h h id3. Assign each object to the cluster with the nearest centroid.
4. Go back to Step 2; stop when no more new assignment.
34
K-means clustering (k=3)
35
Comments on the K-Means MethodComments on the K Means Method
Strengths & Weaknesses
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum
Applicable only when mean is defined
Need to specify k, the number of clusters, in advance
Sensitive to noise and outliers as a small number of such points can influence the mean value
Not suitable to discover clusters with non-convex shapes
36
Importance of Choosing Initial Centroids
2.5
3Iteration 1
2.5
3Iteration 2
2.5
3Iteration 3
Importance of Choosing Initial Centroids
1
1.5
2
y
1
1.5
2
y
1
1.5
2
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x
3Iteration 4
3Iteration 5
3Iteration 6
1
1.5
2
2.5
y
1
1.5
2
2.5
y
1
1.5
2
2.5
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
37
2 1.5 1 0.5 0 0.5 1 1.5 2x
2 1.5 1 0.5 0 0.5 1 1.5 2x
2 1.5 1 0.5 0 0.5 1 1.5 2x
Importance of Choosing Initial CentroidsImportance of Choosing Initial Centroids
2.5
3Iteration 1
2.5
3Iteration 2
1
1.5
2
y
1
1.5
2
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
xx x
2.5
3Iteration 3
2.5
3Iteration 4
2.5
3Iteration 5
1
1.5
2
2.5
y
1
1.5
2
2.5
y
1
1.5
2
2.5
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
x
38
x x x
Getting k Right
Try different k looking at the change in the average distance to Try different k, looking at the change in the average distance to centroid, as k increases.
Average falls rapidly until right k, then changes little.
AverageBest valueof k
distance tocentroid
of k
39
k
Name Energy Protein Fat Calcium Iron
Braised beef 340 20 28 9 2.6
Hamburger 245 21 17 9 2.7
Example
Roast beef 420 15 39 7 2
Beefsteak 375 19 32 9 2.6
Canned beef 180 22 10 17 3.7
Broiled chicken 115 20 3 8 1.4file: Canned chicken 170 25 7 12 1.5
Beef heart 160 26 5 14 5.9
Roast lamb leg 265 20 20 9 2.6
Roast lamb shoulde 300 18 25 9 2 3
http://orion.math.iastate.edu/burkardt/data/hartigan/file06.txt
Roast lamb shoulde 300 18 25 9 2.3
Smoked ham 340 20 28 9 2.5
Pork roast 340 19 29 9 2.5
Pork simmered 355 19 30 9 2.4
Beef tongue 205 18 14 7 2 5
R instructions> data <- read table Beef tongue 205 18 14 7 2.5
Veal cutlet 185 23 9 9 2.7
Baked bluefish 135 22 4 25 0.6
Raw clams 70 11 1 82 6
C d l 45 7 1 74 5 4
> data < read.table("nutrients_levels.txt", header = TRUE,
Canned clams 45 7 1 74 5.4
Canned crabmeat 90 14 2 38 0.8
Fried haddock 135 16 5 15 0.5
Broiled mackerel 200 19 13 5 1
sep = "\t")
Canned mackerel 155 16 9 157 1.8
Fried perch 195 16 11 14 1.3
Canned salmon 120 17 5 159 0.7
Canned sardines 180 22 9 367 2.5
40
Canned tuna 170 25 7 7 1.2
Canned shrimp 110 23 1 98 2.6
Example (cont.)
41> pairs(data[,2:6])
Example (cont.)
k-means in R> library(cluster)
l2 k (d t [ 2 6] t 2)> cl2 <- kmeans(data[,2:6],centers=2)> cl2K-means clustering with 2 clusters of sizes 18 9K means clustering with 2 clusters of sizes 18, 9Cluster means:
Energy Protein Fat Calcium Iron1 145.5556 19 6.444444 61.555556 2.3388892 331.1111 19 27.555556 8.777778 2.466667Cl t i tClustering vector:[1] 2 2 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1Within cluster sum of squares by cluster:Within cluster sum of squares by cluster:[1] 178738.40 23751.03Available components:
42
[1] "cluster" "centers" "withinss" "size”
Example (cont.)
> cl3 <- kmeans(data[,2:6],centers=3)...> cl8 < kmeans(data[ 2:6] centers 8)> cl8 <- kmeans(data[,2:6],centers=8)
> erros <- erros c(mean(cl2$withinss), mean(cl3$withinss), mean(cl4$withinss), mean(cl5$withinss), mean(cl6$withinss)mean(cl6$withinss), mean(cl7$withinss), mean(cl8$withinss))mean(cl8$withinss))
> plot(erros)
43 44> plot(data[,2:6], col = cl4$cluster)
Name Energy Protein Fat Calcium Iron cl4$clusterBraised beef 340 20 28 9 2.6 1Roast beef 420 15 39 7 2 1Beefsteak 375 19 32 9 2.6 1
Smoked ham 340 20 28 9 2.5 1Pork roast 340 19 29 9 2.5 1
Pork simmered 355 19 30 9 2 4 1Pork simmered 355 19 30 9 2.4 1Hamburger 245 21 17 9 2.7 2
Roast lamb leg 265 20 20 9 2.6 2Roast lamb shoulder 300 18 25 9 2.3 2
R l 70 11 1 82 6 3Raw clams 70 11 1 82 6 3Canned clams 45 7 1 74 5.4 3
Canned mackerel 155 16 9 157 1.8 3Canned salmon 120 17 5 159 0.7 3
Canned sardines 180 22 9 367 2.5 3Canned shrimp 110 23 1 98 2.6 3Canned beef 180 22 10 17 3.7 4
Broiled chicken 115 20 3 8 1.4 4Broiled chicken 115 20 3 8 1.4 4Canned chicken 170 25 7 12 1.5 4
Beef heart 160 26 5 14 5.9 4Beef tongue 205 18 14 7 2.5 4Veal cutlet 185 23 9 9 2 7 4> data w cl 4< Veal cutlet 185 23 9 9 2.7 4
Baked bluefish 135 22 4 25 0.6 4Canned crabmeat 90 14 2 38 0.8 4
Fried haddock 135 16 5 15 0.5 4
> data_w_cl 4<-cbind(data,cl4$cluster)> data_w_cl4
45
Broiled mackerel 200 19 13 5 1 4Fried perch 195 16 11 14 1.3 4
Canned tuna 170 25 7 7 1.2 4
Name Energy Protein Fat Calcium Iron cl3$clusterBraised beef 340 20 28 9 2.6 1Roast beef 420 15 39 7 2 1Beefsteak 375 19 32 9 2 6 1Beefsteak 375 19 32 9 2.6 1Roast lamb leg 265 20 20 9 2.6 1Roast lamb shoulder 300 18 25 9 2.3 1Smoked ham 340 20 28 9 2.5 1Pork roast 340 19 29 9 2.5 1Pork simmered 355 19 30 9 2.4 1Hamburger 245 21 17 9 2.7 2Canned beef 180 22 10 17 3.7 2Broiled chicken 115 20 3 8 1.4 2Canned chicken 170 25 7 12 1.5 2Beef heart 160 26 5 14 5.9 2B f t 205 18 14 7 2 5 2Beef tongue 205 18 14 7 2.5 2Veal cutlet 185 23 9 9 2.7 2Baked bluefish 135 22 4 25 0.6 2Canned crabmeat 90 14 2 38 0.8 2Fried haddock 135 16 5 15 0.5 2Broiled mackerel 200 19 13 5 1 2Fried perch 195 16 11 14 1.3 2Canned tuna 170 25 7 7 1 2 2> data w cl 3< Canned tuna 170 25 7 7 1.2 2Raw clams 70 11 1 82 6 3Canned clams 45 7 1 74 5.4 3Canned mackerel 155 16 9 157 1.8 3
> data_w_cl 3<-cbind(data,cl3$cluster)> data_w_cl3
46
Canned salmon 120 17 5 159 0.7 3Canned sardines 180 22 9 367 2.5 3Canned shrimp 110 23 1 98 2.6 3
Davies-Bouldin Validity IndexyR Package ‘clusterSim’ – index.DB
Let Ci be a cluster of vector. Let Xi be a vector on Ci and Si be a
Tqi
AXS 1
measure of scatter within the cluster
q
jij
ii AX
TS
=
−=1
Let Ai be the centroid of Ci and Mij a measure of separation
N SS +
Let Ai be the centroid of Ci and Mij a measure of separation between cluster Ci and Cj
pN
k
p
jkikji aaM =
−=1
,,,ji
jiji M
SSR
,,
+= ji
jiji RR ,:
max≠
=
N
=
=N
iiR
NDB
1
1
DB should be small for good clustering47
c1Examplec2
++
=++=
15152010
6773
1058
11xa
C.
( ) ( ) ( )( ) 57415151510677831
2222
1 .. =−++−+−= S
=
==
5
5222
153
2
1
x
y
aC
a
.
3
( ) ( )( ) 595505222021
222
2 .. =+−+−= S
= 52ya
( ) ( ) 89175155226772 2221 ..., =−+−=M
595574 +5680
8917595574
1221 ..
..,, =+== RR
( ) 5680568056801 +DB ( ) 5680568056802
... =+=DB
48
Silhouette Validation MethodR Package ‘clusterSim’ – index.DB
For each point i, let a(i) be the average dissimilarity of i with all other points ithi th l twithin the same cluster.
Any measure of similarity can be used (e.g. Euclidian distance)
Then, find the average dissimilarity of i with the data on another single cluster. Repeat this for every other cluster.
b(i) is the lowest average similarity of i to any other clusterb(i) is the lowest average similarity of i to any other cluster.
We can no define for each point i )()()( iaibis
−= { })(),(max)(
ibiais =
The average s(i) of a cluster is a measure of how tightly grouped the data is inThe average s(i) of a cluster is a measure of how tightly grouped the data is in the cluster.
Therefore, the average s(i) of the entire dataset is a measure of how e e o e, e a e age s( ) o e e e da ase s a easu e o oappropriately the data has been clustered.
49
Hartigan indexR P k ‘ l t Si ’ i d DBR Package ‘clusterSim’ – index.DB
Assume a dataset with N samples X1,...,XN. Each sample with M dimensions
For k clusters the overall fitness can be expressed as the square of error for all samples, where d is the distance between sample Xj and the centroid Xci of its cluster
( )( ) k N
dk 2)(
cluster.
( )( ) = ∈=
=i Cjj
cij
i
XXdkerr1 1
2
,,)(
The Hartigan index H(k) is defined as followsThe Hartigan index H(k) is defined as follows
( ))(
)()()(1
11
+−−−=k
kerrkerrknkH ( )
)( 1+kerr
The multiplier correction term of (N-k-1) is a penalty factor for large number of cluster partitioning. The optimal k number is the one that maximizes the H(k).
50
Some links
HARTIGAN Data for Clustering AlgorithmsHARTIGAN - Data for Clustering Algorithms
http://orion.math.iastate.edu/burkardt/data/hartigan/hartigan.html
R packagesR packages
cluster - http://cran.r-project.org/web/packages/cluster/index.html
clusterSim - http://cran.r-project.org/web/packages/clusterSim/index.html
51