week 05 - Clustering.pptec/files_1011/week 05 - Clustering.pdfdata points in one cluster are more...

Cl iClusteringg

1

Clustering

What is Clustering?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

Hierarchical Methods Hierarchical Methods

2

What is Clustering?

Clustering of data is a method by which large sets of data are grouped into clusters of smaller sets of similar data.

Cluster: a collection of data objects Similar to one another within the same cluster

Dissimilar to the objects in other clusters

Clustering is unsupervised classification: no predefined classes

3

What is Clustering?What is Clustering?

Typical applicationsyp pp

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithmsp p g p g

Use cluster detection when you suspect that there are natural i th t t f t d t th tgroupings that may represent groups of customers or products that

have lot in common.

When there are many competing patterns in the data, making it hard to spot a single pattern, creating clusters of similar records reduces the complexity within clusters so that other data miningreduces the complexity within clusters so that other data mining techniques are more likely to succeed.

4

Examples of Clustering Applicationsp g pp

Marketing: Help marketers discover distinct groups in their g p g pcustomer bases, and then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

City-planning: Identifying groups of houses according to their house type, value, and geographical location

Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

5

Clustering definition

Given a set of data points, each having a set of attributes, and a i il i h fi d l h hsimilarity measure among them, find clusters such that:

data points in one cluster are more similar to one another (high i t l i il it )intra-class similarity)

data points in separate clusters are less similar to one another (low i t l i il it )inter-class similarity )

Similarity measures: e.g. Euclidean distance if attributes are continuous.

6

f lRequirements of Clustering in Data Mining

Scalability Scalability

Ability to deal with different types of attributes

f l h b h Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to determine input parametersparameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability

7 8

http://webscripts.softpedia.com/screenshots/Efficient-K-Means-Clustering-using-JIT_1.png

http://api.ning.com/files/uI4*osegkS5tF-JjFYZai3mGuslDu*-

9

http://api.ning.com/files/uI4 osegkS5tF JjFYZai3mGuslDuBQ1rFsozaAaDw9IBdc99OjNas3FPKIrdgPXAz34DU0KsbZwl7G8tM5-n4DXTk6Fab/clustering.gif

http://wiki.na-mic.org/Wiki/index.php/Progress_Report:DTI_ClusteringProject aiming at developing tools in the 3D Slicer for automatic clustering of tractographic paths through diffusion tensor MRI (DTI) data.‘characterize the strength of connectivity between selected regions in the brain’

10

N ti f Cl t i A biNotion of a Cluster is Ambiguous

Initial points. Six ClustersInitial points. Six Clusters

Four ClustersTwo Clusters

11

ClusteringClustering

What is Cluster Analysis? What is Cluster Analysis?




Hierarchical Methods

12

Data Matrix

Represents n objects with p variables (attributes, measures)

A relational table A relational table

xxx

p1xf1x11x

ipxifx1ix

npxnfx1nx

13

Dissimilarity Matrix

P i iti f i f bj t Proximities of pairs of objects

d(i,j): dissimilarity between objects i and j

N ti Nonnegative

Close to 0: similar

0

0(2,1)

0

d

0(3,2)(3,1)

( , )

dd

0,2)(,1)(

ndnd

14

0,2)(,1)( ndnd

T f d i l i l iType of data in clustering analysis

Continuous variables

Binary variables Binary variables

Nominal and ordinal variables

Variables of mixed types

15

Continuous variables To avoid dependence on the choice of measurement units the data should be

standardizedstandardized.

Standardize data

Calculate the mean absolute deviation:Calculate the mean absolute deviation:

|)fmnf|x...|fmf2|x|fmf1(|xn1

fs −++−+−=

where

Calculate the standardized measurement (z-score)

)nfx...f2xf1(xn1 fm +++=

fsfmifx

ifz−

=

Using mean absolute deviation is more robust than using standard deviation. Since the deviations are not squared the effect of outliers is somewhat reduced but their z-scores do not become to small; therefore, the outliers remain detectable.

f

16

scores do not become to small; therefore, the outliers remain detectable.

Similarity/Dissimilarity Between Objectsy/ y j

Distances are normally used to measure the similarity or dissimilarity b t t d t bj tbetween two data objects

Euclidean distance is probably the most commonly chosen type of distance It is the geometric distance in the multidimensional spacedistance. It is the geometric distance in the multidimensional space:

p 2 Required properties for

a distance function

=

−=p

1k2)kjxkix()j,i(d

a distance function

d(i,j) ≥ 0

d(i i) = 0 d(i,i) = 0

d(i,j) = d(j,i)

d(i j) ≤ d(i k) + d(k j)

17

d(i,j) ≤ d(i,k) + d(k,j)

18

http://uk.geocities.com/ahf_alternate/dist.htm#S2

l / l bSimilarity/Dissimilarity Between Objects

City-block (Manhattan) distance. This distance is simply the sum of differences across dimensions. In most cases, this distance measure

i ld lt i il t th E lid di t H t th t iyields results similar to the Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). p ( y q )

||...||||),(jp

xip

xj

xi

xj

xi

xjid −++−+−=2211

The properties stated for the Euclidean distance also hold for this

||||||)(jpipjiji

j2211

measure.

19

Manhattan distance = distance if you had to travel along coordinates Manhattan distance distance if you had to travel along coordinates only.

(9 8)y = (9,8)euc.:dist(x,y) =√√(42+32) = 5

35

4

35

x = (5,5)manh.:

4

20

dist(x,y) = 4+3 = 7


Minkowski distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This measure enables to accomplish that and is computed as:

qqjp

xip

xqj

xi

xqj

xi

xjid1

2211

−++−+−= ||...||||),(

jpipjiji 2211

21


Weighted distancesg

If we have some idea of the relative importance that should be assigned to each variable, then we can weight them and obtain a weighted distance measure.

22111

)()(),(jp

xip

xp

wj

xi

xwjid −++−=

22

Binary VariablesBinary Variables

Binary variable has only two states: 0 or 1 Binary variable has only two states: 0 or 1

A binary variable is symmetric if both of its states are equally y y q yvaluable, that is, there is no preference on which outcome should be coded as 1.

A binary variable is asymmetric if the outcome of the states are not equally important such as positive or negative outcomes of anot equally important, such as positive or negative outcomes of a disease test.

Similarity that is based on symmetric binary variables is called invariant similarity.

23

Binary Variables

A contingency table for binary data

sum01Object j

dcdc

baba

++

0

1Object i

pdbcasum ++

24

Binary Variables

Object j

baba

sum

+1

01

j j

pdbcasum

dcdc

+++0Object i

Symmetric binary dissimilarity:

p

dcbacb jid +++

+=),(

Jaccard coefficient (asymmetric binary dissimilarity):

25cba

cb jid +++=),(

Dissimilarity between Binary VariablesDissimilarity between Binary Variables Example

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N M F Y N P N P N

gender is a symmetric attribute

Mary F Y N P N P N Jim M Y P N N N N

gender is a symmetric attribute

the remaining attributes are asymmetric binary

let the values Y and P be set to 1, and the value N be set to 0,

330102

10 .),( =++

+=maryjackd

670111

11102

.),( =++

+=

++

jimjackdJaccard coefficient(for the asymetric

26

750211

21 .),( =++

+=maryjimdvariables)

Nominal VariablesNominal Variables

A generalization of the binary variable in that it can take more A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Method 1: simple matching

m: # of matches, p: total # of variables

pmpjid −=),(

Method 2: use a large number of binary variables

i bi i bl f h f h M i l

p

creating a new binary variable for each of the M nominal states

27

Ordinal VariablesOrdinal Variables On ordinal variables order is important

e.g. Gold, Silver, Bronze

Can be treated like continuous

the ordered states define the ranking 1,...,Mff

replacing xif by their rank

map the range of each variable onto [0, 1] by replacing i-th object in

},...,{f

Mif

r 1∈

map the range of each variable onto [0, 1] by replacing i th object in the f-th variable by

1−if

r

compute the dissimilarity using methods for continuous variables

1−=

fM

ifif

z

28

compute the dissimilarity using methods for continuous variables

Variables of Mixed Types

A database may contain several/all types of variables continuous, symmetric binary, asymmetric binary, nominal and ordinal.

One may use a weighted formula to combine their effects.

1

p

f

(f) (f)δ dij ijd(i j) ==

δ =0 if x is missing or x =x =0 and the variable f is asymmetric binary

1

p

f

d(i, j)(f)δij

=

=

δij=0 if xif is missing or xif=xjf=0 and the variable f is asymmetric binary

δij=1 otherwise continuous and ordinal variables dij: normalized absolute distancej

binary and nominal variables dij=0 if xif=xjf; otherwise dij=1

29

Clustering

What is Cluster Analysis?





30

Major Clustering Approaches

P i i i l i h C i i i d h Partitioning algorithms: Construct various partitions and then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion

Density-based: Based on connectivity and density functions. Able to find clusters of arbitrary shape Continues growing a cluster as longfind clusters of arbitrary shape. Continues growing a cluster as long as the density of points in the neighborhood exceeds a specified limit.

Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

31

the idea is to find the best fit of that model to each other

Clustering

What is Cluster Analysis?


A Categorization of Major Clustering Methods A Categorization of Major Clustering Methods



32

P i i i Al i h B i CPartitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of nobjects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterionpartitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means: Each cluster is represented by the center of the cluster

k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster

33

The K-Means Clustering MethodThe K Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps:

1. Partition objects into k nonempty subsets

2 Compute centroids of the clusters of the current partition The centroid2. Compute centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.

3 A i h bj h l i h h id3. Assign each object to the cluster with the nearest centroid.

4. Go back to Step 2; stop when no more new assignment.

34

K-means clustering (k=3)

35

Comments on the K-Means MethodComments on the K Means Method

Strengths & Weaknesses

Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum

Applicable only when mean is defined

Need to specify k, the number of clusters, in advance

Sensitive to noise and outliers as a small number of such points can influence the mean value

Not suitable to discover clusters with non-convex shapes

36

Importance of Choosing Initial Centroids

2.5

3Iteration 1

2.5

3Iteration 2

2.5

3Iteration 3

Importance of Choosing Initial Centroids

1

1.5

2

y

1

1.5

2

y

1

1.5

2

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

x

3Iteration 4

3Iteration 5

3Iteration 6

1

1.5

2

2.5

y

1

1.5

2

2.5

y

1

1.5

2

2.5

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

37

2 1.5 1 0.5 0 0.5 1 1.5 2x

2 1.5 1 0.5 0 0.5 1 1.5 2x

2 1.5 1 0.5 0 0.5 1 1.5 2x

Importance of Choosing Initial CentroidsImportance of Choosing Initial Centroids

2.5

3Iteration 1

2.5

3Iteration 2

1

1.5

2

y

1

1.5

2

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

xx x

2.5

3Iteration 3

2.5

3Iteration 4

2.5

3Iteration 5

1

1.5

2

2.5

y

1

1.5

2

2.5

y

1

1.5

2

2.5

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

x-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

x

38

x x x

Getting k Right

Try different k looking at the change in the average distance to Try different k, looking at the change in the average distance to centroid, as k increases.

Average falls rapidly until right k, then changes little.

AverageBest valueof k

distance tocentroid

of k

39

k

Name Energy Protein Fat Calcium Iron

Braised beef 340 20 28 9 2.6

Hamburger 245 21 17 9 2.7

Example

Roast beef 420 15 39 7 2

Beefsteak 375 19 32 9 2.6

Canned beef 180 22 10 17 3.7

Broiled chicken 115 20 3 8 1.4file: Canned chicken 170 25 7 12 1.5

Beef heart 160 26 5 14 5.9

Roast lamb leg 265 20 20 9 2.6

Roast lamb shoulde 300 18 25 9 2 3

http://orion.math.iastate.edu/burkardt/data/hartigan/file06.txt

Roast lamb shoulde 300 18 25 9 2.3

Smoked ham 340 20 28 9 2.5

Pork roast 340 19 29 9 2.5

Pork simmered 355 19 30 9 2.4

Beef tongue 205 18 14 7 2 5

R instructions> data <- read table Beef tongue 205 18 14 7 2.5

Veal cutlet 185 23 9 9 2.7

Baked bluefish 135 22 4 25 0.6

Raw clams 70 11 1 82 6

C d l 45 7 1 74 5 4

> data < read.table("nutrients_levels.txt", header = TRUE,

Canned clams 45 7 1 74 5.4

Canned crabmeat 90 14 2 38 0.8

Fried haddock 135 16 5 15 0.5

Broiled mackerel 200 19 13 5 1

sep = "\t")

Canned mackerel 155 16 9 157 1.8

Fried perch 195 16 11 14 1.3

Canned salmon 120 17 5 159 0.7

Canned sardines 180 22 9 367 2.5

40

Canned tuna 170 25 7 7 1.2

Canned shrimp 110 23 1 98 2.6

Example (cont.)

41> pairs(data[,2:6])

Example (cont.)

k-means in R> library(cluster)

l2 k (d t [ 2 6] t 2)> cl2 <- kmeans(data[,2:6],centers=2)> cl2K-means clustering with 2 clusters of sizes 18 9K means clustering with 2 clusters of sizes 18, 9Cluster means:

Energy Protein Fat Calcium Iron1 145.5556 19 6.444444 61.555556 2.3388892 331.1111 19 27.555556 8.777778 2.466667Cl t i tClustering vector:[1] 2 2 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1Within cluster sum of squares by cluster:Within cluster sum of squares by cluster:[1] 178738.40 23751.03Available components:

42

[1] "cluster" "centers" "withinss" "size”

Example (cont.)

> cl3 <- kmeans(data[,2:6],centers=3)...> cl8 < kmeans(data[ 2:6] centers 8)> cl8 <- kmeans(data[,2:6],centers=8)

> erros <- erros c(mean(cl2$withinss), mean(cl3$withinss), mean(cl4$withinss), mean(cl5$withinss), mean(cl6$withinss)mean(cl6$withinss), mean(cl7$withinss), mean(cl8$withinss))mean(cl8$withinss))

> plot(erros)

43 44> plot(data[,2:6], col = cl4$cluster)

Name Energy Protein Fat Calcium Iron cl4$clusterBraised beef 340 20 28 9 2.6 1Roast beef 420 15 39 7 2 1Beefsteak 375 19 32 9 2.6 1

Smoked ham 340 20 28 9 2.5 1Pork roast 340 19 29 9 2.5 1

Pork simmered 355 19 30 9 2 4 1Pork simmered 355 19 30 9 2.4 1Hamburger 245 21 17 9 2.7 2

Roast lamb leg 265 20 20 9 2.6 2Roast lamb shoulder 300 18 25 9 2.3 2

R l 70 11 1 82 6 3Raw clams 70 11 1 82 6 3Canned clams 45 7 1 74 5.4 3

Canned mackerel 155 16 9 157 1.8 3Canned salmon 120 17 5 159 0.7 3

Canned sardines 180 22 9 367 2.5 3Canned shrimp 110 23 1 98 2.6 3Canned beef 180 22 10 17 3.7 4

Broiled chicken 115 20 3 8 1.4 4Broiled chicken 115 20 3 8 1.4 4Canned chicken 170 25 7 12 1.5 4

Beef heart 160 26 5 14 5.9 4Beef tongue 205 18 14 7 2.5 4Veal cutlet 185 23 9 9 2 7 4> data w cl 4< Veal cutlet 185 23 9 9 2.7 4

Baked bluefish 135 22 4 25 0.6 4Canned crabmeat 90 14 2 38 0.8 4

Fried haddock 135 16 5 15 0.5 4

> data_w_cl 4<-cbind(data,cl4$cluster)> data_w_cl4

45

Broiled mackerel 200 19 13 5 1 4Fried perch 195 16 11 14 1.3 4

Canned tuna 170 25 7 7 1.2 4

Name Energy Protein Fat Calcium Iron cl3$clusterBraised beef 340 20 28 9 2.6 1Roast beef 420 15 39 7 2 1Beefsteak 375 19 32 9 2 6 1Beefsteak 375 19 32 9 2.6 1Roast lamb leg 265 20 20 9 2.6 1Roast lamb shoulder 300 18 25 9 2.3 1Smoked ham 340 20 28 9 2.5 1Pork roast 340 19 29 9 2.5 1Pork simmered 355 19 30 9 2.4 1Hamburger 245 21 17 9 2.7 2Canned beef 180 22 10 17 3.7 2Broiled chicken 115 20 3 8 1.4 2Canned chicken 170 25 7 12 1.5 2Beef heart 160 26 5 14 5.9 2B f t 205 18 14 7 2 5 2Beef tongue 205 18 14 7 2.5 2Veal cutlet 185 23 9 9 2.7 2Baked bluefish 135 22 4 25 0.6 2Canned crabmeat 90 14 2 38 0.8 2Fried haddock 135 16 5 15 0.5 2Broiled mackerel 200 19 13 5 1 2Fried perch 195 16 11 14 1.3 2Canned tuna 170 25 7 7 1 2 2> data w cl 3< Canned tuna 170 25 7 7 1.2 2Raw clams 70 11 1 82 6 3Canned clams 45 7 1 74 5.4 3Canned mackerel 155 16 9 157 1.8 3

> data_w_cl 3<-cbind(data,cl3$cluster)> data_w_cl3

46

Canned salmon 120 17 5 159 0.7 3Canned sardines 180 22 9 367 2.5 3Canned shrimp 110 23 1 98 2.6 3

Davies-Bouldin Validity IndexyR Package ‘clusterSim’ – index.DB

Let Ci be a cluster of vector. Let Xi be a vector on Ci and Si be a

Tqi

AXS 1

measure of scatter within the cluster

q

jij

ii AX

TS

=

−=1

Let Ai be the centroid of Ci and Mij a measure of separation

N SS +

Let Ai be the centroid of Ci and Mij a measure of separation between cluster Ci and Cj

pN

k

p

jkikji aaM =

−=1

,,,ji

jiji M

SSR

,,

+= ji

jiji RR ,:

max≠

=

N

=

=N

iiR

NDB

1

1

DB should be small for good clustering47

c1Examplec2

++

=++=

15152010

6773

1058

11xa

C.

( ) ( ) ( )( ) 57415151510677831

2222

1 .. =−++−+−= S

=

==

5

5222

153

2

1

x

y

aC

a

.

3

( ) ( )( ) 595505222021

222

2 .. =+−+−= S

= 52ya

( ) ( ) 89175155226772 2221 ..., =−+−=M

595574 +5680

8917595574

1221 ..

..,, =+== RR

( ) 5680568056801 +DB ( ) 5680568056802

... =+=DB

48

Silhouette Validation MethodR Package ‘clusterSim’ – index.DB

For each point i, let a(i) be the average dissimilarity of i with all other points ithi th l twithin the same cluster.

Any measure of similarity can be used (e.g. Euclidian distance)

Then, find the average dissimilarity of i with the data on another single cluster. Repeat this for every other cluster.

b(i) is the lowest average similarity of i to any other clusterb(i) is the lowest average similarity of i to any other cluster.

We can no define for each point i )()()( iaibis

−= { })(),(max)(

ibiais =

The average s(i) of a cluster is a measure of how tightly grouped the data is inThe average s(i) of a cluster is a measure of how tightly grouped the data is in the cluster.

Therefore, the average s(i) of the entire dataset is a measure of how e e o e, e a e age s( ) o e e e da ase s a easu e o oappropriately the data has been clustered.

49

Hartigan indexR P k ‘ l t Si ’ i d DBR Package ‘clusterSim’ – index.DB

Assume a dataset with N samples X1,...,XN. Each sample with M dimensions

For k clusters the overall fitness can be expressed as the square of error for all samples, where d is the distance between sample Xj and the centroid Xci of its cluster

( )( ) k N

dk 2)(

cluster.

( )( ) = ∈=

=i Cjj

cij

i

XXdkerr1 1

2

,,)(

The Hartigan index H(k) is defined as followsThe Hartigan index H(k) is defined as follows

( ))(

)()()(1

11

+−−−=k

kerrkerrknkH ( )

)( 1+kerr

The multiplier correction term of (N-k-1) is a penalty factor for large number of cluster partitioning. The optimal k number is the one that maximizes the H(k).

50

Some links

HARTIGAN Data for Clustering AlgorithmsHARTIGAN - Data for Clustering Algorithms

http://orion.math.iastate.edu/burkardt/data/hartigan/hartigan.html

R packagesR packages

cluster - http://cran.r-project.org/web/packages/cluster/index.html

clusterSim - http://cran.r-project.org/web/packages/clusterSim/index.html

51

week 05 - Clustering.pptec/files_1011/week 05 - Clustering.pdfdata points in one cluster are more...

Documents

Transcript of week 05 - Clustering.pptec/files_1011/week 05 - Clustering.pdfdata points in one cluster are more...