ClusterI

8/8/2019 ClusterI

1/24

Cluster AnalysisWhat is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning MethodsHierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Outlier Analysis

Summary

8/8/2019 ClusterI

2/24

What is Cluster Analysis?Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

Cluster analysis Grouping a set of data objects into clustersClustering is unsupervised classification : nopredefined classesTypical applications As a stand-alone tool to get insight into data

distribution As a preprocessing step for other algorithms

8/8/2019 ClusterI

3/24

General Applications of Clustering

Pattern RecognitionSpatial Data Analysis create thematic maps in GIS by clustering feature

spaces detect spatial clusters and explain them in spatial

data miningImage ProcessingEconomic Science (especially market research)WWW Document classification Cluster Weblog data to discover groups of similar

access patterns

8/8/2019 ClusterI

4/24

Examples of Clustering

ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketingprograms

Land use: Identification of areas of similar land use in an earthobservation database

Insurance: Identifying groups of motor insurance policy holders witha high average claim cost

City-planning: Identifying groups of houses according to their house

type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should beclustered along continent faults

8/8/2019 ClusterI

5/24

What Is Good Clustering?

A good clustering method will produce high qualityclusters with high intra-class similarity

low inter-class similarity

The quality of a clustering result depends on both thesimilarity measure used by the method and itsimplementation.

The quality of a clustering method is also measured byits ability to discover some or all of the hidden patterns.

8/8/2019 ClusterI

6/24

Classification is more objective

Can compare various algorithms

x

xx

x

xx

x

x

x

xo

oo

oo

o

o

o

o o

o

o

ox

x

x

o

o

8/8/2019 ClusterI

7/24

Clustering is very subjective

Cluster the following animals: Sheep, lizard, cat, dog, sparrow, blue shark, viper, seagull, gold

fish, frog, red-mullet

1. By the way they bear their progeny

2. By the existence of lungs

3. By the environment that they live in

4. By the way the bear their progeny and the existence of their lungs

Which way is correct? Depends

8/8/2019 ClusterI

8/24

Requirements of Clustering in Data

MiningScalabilityAbility to deal with different types of attributes

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to determine inputparameters

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability

8/8/2019 ClusterI

9/24






Grid-Based Methods


Outlier Analysis

Summary

8/8/2019 ClusterI

10/24

Distance measureDissimilarity/Similarity metric: Similarity is expressed in terms of a distancefunction, which is typically metric: d (i, j )Similarity measure: Small means close Ex: Sequence similarity

Distance = cost of insertion + 2*cost of changing C to A + cost of changing A to TDissimilarity measure: small means close

TATATAAGT -TCCA

TATATAAGGCTAAT

TATATAAGTTCCA TATATAAGGCTAAT

8/8/2019 ClusterI

11/24

Data Structures

Asymmetrical distance

Symmetrical distance

-

np x ...nf x ...n1 x ... ... ... ... ...

ip x ...if x ...i1 x

... ... ... ... ...1p x ...1f x ...11 x

-

0...)2,()1,(

:::

)2,3()

...d d

0d d(3,10d(2,1)

0

8/8/2019 ClusterI

12/24

Measure the Quality of Clustering

There is a separate quality function thatmeasures the goodness of a cluster.The definitions of distance functions are usuallyvery different for interval-scaled, boolean,categorical, ordinal and ratio variables.Weights should be associated with differentvariables based on applications and datasemantics.

It is hard to define similar enough or goodenough the answer is typically highly subjective.

8/8/2019 ClusterI

13/24

Type of data in clustering analysis

Interval-scaled variables:

Binary variables:

Nominal, ordinal, and ratio variables:

Variables of mixed types:

8/8/2019 ClusterI

14/24

Interval-valued variables

Standardize data

Calculate the mean absolute deviation:

where

Calculate the standardized measurement ( z-score )

Using mean absolute deviation is more robust than using

standard deviation

.)...211

nf f f f x x(xnm!

|)|...|||(|1 21f nf f f f f f

m xm xm xn

s !

f

f if

if s

m x z !

8/8/2019 ClusterI

15/24

Similarity and Dissimilarity Between

ObjectsDistances are normally used to measure the similarity or dissimilarity between two data objects

Some popular ones include: M ink ows ki di sta nce :

where i = ( x i1, x i2, , x ip) and j = ( x j1, x j2, , x jp) are two p -dimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance

qq

p p

qq

j x

i x

j x

i x

j x

i x jid )||...|||(|),(

2211!

||...||||),(2211 p p j xi x j xi x j xi x jid

!

8/8/2019 ClusterI

16/24

Similarity and Dissimilarity Between

Objects (Cont.)If q = 2 , d is Euclidean distance:

Properties

d(i,j) u 0d(i,i) = 0

d(i,j) = d(j,i)

d(i,j) e d(i,k) + d(k,j)

Also, one can use weighted distance, parametricPearson product moment correlation, or other disimilaritymeasures

)||...|||(|),( 2222

2

11 p p j x

i x

j x

i x

j x

i x jid !

8/8/2019 ClusterI

17/24

Binary Variables

A contingency table for binary data

Simple matching coefficient (invariant, if the binary variable is

symmetr i c ):

Jaccard coefficient (noninvariant if the binary variable is

asymmetr i c ):

d cbacb jid !),(

pd bca sum

d cd c

baba

sum

0

1

01

cbacb jid !),(

Object i

Object j

8/8/2019 ClusterI

18/24

18

Dissimilarity between Binary

VariablesExample

gender is a symmetric attribute the remaining attributes are asymmetric binary

let the values Y and P be set to 1, and the value N be set to 0

Name Gender Fever Cough Test- 1 Test- 2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

!!

!!

!!

mary ji md

ji m j ack d

mary j ack d

8/8/2019 ClusterI

19/24

Nominal Variables

A generalization of the binary variable in that it can takemore than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching m : # of matches, p : total # of variables

Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states

pm p jid !),(

8/8/2019 ClusterI

20/24

Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Can be treated like interval-scaled replace x i f by their rank

map the range of each variable onto [ 0 , 1] by replacing i -th objectin the f -th variable by

compute the dissimilarity using methods for interval-scaledvariables

1

1!

f

if if

M

r z

},...,1{ f if M r

8/8/2019 ClusterI

21/24

Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on anonlinear scale, approximately at exponential scale,

such as A e Bt or A e -Bt

Methods: treat them like interval-scaled variables not a g oo d cho i ce!

(why?the scale can be distorted)

apply logarithmic transformation

y i f = l og(x i f ) treat them as continuous ordinal data treat their rank as interval-

scaled

8/8/2019 ClusterI

22/24

Variables of MixedTypes

A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval

and ratio

One may use a weighted formula to combine their effects

f is binary or nominal:d ij(f) = 0 if xif = x jf , or d ij(f) = 1 o.w.

f is interval-based: use the normalized distance f is ordinal or ratio-scaled

compute ranks r if andand treat z if as interval-scaled

)(1

)()(1),(

f i j

p f

f i j

f i j

p f d jid

H

H

!

!

7

7!

1

1!

f

if

M r z if

8/8/2019 ClusterI

23/24






Grid-Based Methods


Outlier Analysis

Summary

8/8/2019 ClusterI

24/24

Major Clustering Approaches

Partitioning algorithms: Construct various partitions and then

evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition of the set

of data (or objects) using some criterionDensity-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the clusters and

the idea is to find the best fit of that model to each other

ClusterI

Documents

Transcript of ClusterI