ClusterI

download ClusterI

of 24

Transcript of ClusterI

  • 8/8/2019 ClusterI

    1/24

    Cluster AnalysisWhat is Cluster Analysis?

    Types of Data in Cluster Analysis

    A Categorization of Major Clustering Methods

    Partitioning MethodsHierarchical Methods

    Density-Based Methods

    Grid-Based Methods

    Model-Based Clustering Methods

    Outlier Analysis

    Summary

  • 8/8/2019 ClusterI

    2/24

    What is Cluster Analysis?Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

    Cluster analysis Grouping a set of data objects into clustersClustering is unsupervised classification : nopredefined classesTypical applications As a stand-alone tool to get insight into data

    distribution As a preprocessing step for other algorithms

  • 8/8/2019 ClusterI

    3/24

    General Applications of Clustering

    Pattern RecognitionSpatial Data Analysis create thematic maps in GIS by clustering feature

    spaces detect spatial clusters and explain them in spatial

    data miningImage ProcessingEconomic Science (especially market research)WWW Document classification Cluster Weblog data to discover groups of similar

    access patterns

  • 8/8/2019 ClusterI

    4/24

    Examples of Clustering

    ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketingprograms

    Land use: Identification of areas of similar land use in an earthobservation database

    Insurance: Identifying groups of motor insurance policy holders witha high average claim cost

    City-planning: Identifying groups of houses according to their house

    type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should beclustered along continent faults

  • 8/8/2019 ClusterI

    5/24

    What Is Good Clustering?

    A good clustering method will produce high qualityclusters with high intra-class similarity

    low inter-class similarity

    The quality of a clustering result depends on both thesimilarity measure used by the method and itsimplementation.

    The quality of a clustering method is also measured byits ability to discover some or all of the hidden patterns.

  • 8/8/2019 ClusterI

    6/24

    Classification is more objective

    Can compare various algorithms

    x

    xx

    x

    xx

    x

    x

    x

    xo

    oo

    oo

    o

    o

    o

    o o

    o

    o

    ox

    x

    x

    o

    o

  • 8/8/2019 ClusterI

    7/24

    Clustering is very subjective

    Cluster the following animals: Sheep, lizard, cat, dog, sparrow, blue shark, viper, seagull, gold

    fish, frog, red-mullet

    1. By the way they bear their progeny

    2. By the existence of lungs

    3. By the environment that they live in

    4. By the way the bear their progeny and the existence of their lungs

    Which way is correct? Depends

  • 8/8/2019 ClusterI

    8/24

    Requirements of Clustering in Data

    MiningScalabilityAbility to deal with different types of attributes

    Discovery of clusters with arbitrary shape

    Minimal requirements for domain knowledge to determine inputparameters

    Able to deal with noise and outliers

    Insensitive to order of input records

    High dimensionality

    Incorporation of user-specified constraints

    Interpretability and usability

  • 8/8/2019 ClusterI

    9/24

    Cluster AnalysisWhat is Cluster Analysis?

    Types of Data in Cluster Analysis

    A Categorization of Major Clustering Methods

    Partitioning MethodsHierarchical Methods

    Density-Based Methods

    Grid-Based Methods

    Model-Based Clustering Methods

    Outlier Analysis

    Summary

  • 8/8/2019 ClusterI

    10/24

    Distance measureDissimilarity/Similarity metric: Similarity is expressed in terms of a distancefunction, which is typically metric: d (i, j )Similarity measure: Small means close Ex: Sequence similarity

    Distance = cost of insertion + 2*cost of changing C to A + cost of changing A to TDissimilarity measure: small means close

    TATATAAGT -TCCA

    TATATAAGGCTAAT

    TATATAAGTTCCA TATATAAGGCTAAT

  • 8/8/2019 ClusterI

    11/24

    Data Structures

    Asymmetrical distance

    Symmetrical distance

    -

    np x ...nf x ...n1 x ... ... ... ... ...

    ip x ...if x ...i1 x

    ... ... ... ... ...1p x ...1f x ...11 x

    -

    0...)2,()1,(

    :::

    )2,3()

    ...d d

    0d d(3,10d(2,1)

    0

  • 8/8/2019 ClusterI

    12/24

    Measure the Quality of Clustering

    There is a separate quality function thatmeasures the goodness of a cluster.The definitions of distance functions are usuallyvery different for interval-scaled, boolean,categorical, ordinal and ratio variables.Weights should be associated with differentvariables based on applications and datasemantics.

    It is hard to define similar enough or goodenough the answer is typically highly subjective.

  • 8/8/2019 ClusterI

    13/24

    Type of data in clustering analysis

    Interval-scaled variables:

    Binary variables:

    Nominal, ordinal, and ratio variables:

    Variables of mixed types:

  • 8/8/2019 ClusterI

    14/24

    Interval-valued variables

    Standardize data

    Calculate the mean absolute deviation:

    where

    Calculate the standardized measurement ( z-score )

    Using mean absolute deviation is more robust than using

    standard deviation

    .)...211

    nf f f f x x(xnm!

    |)|...|||(|1 21f nf f f f f f

    m xm xm xn

    s !

    f

    f if

    if s

    m x z !

  • 8/8/2019 ClusterI

    15/24

    Similarity and Dissimilarity Between

    ObjectsDistances are normally used to measure the similarity or dissimilarity between two data objects

    Some popular ones include: M ink ows ki di sta nce :

    where i = ( x i1, x i2, , x ip) and j = ( x j1, x j2, , x jp) are two p -dimensional data objects, and q is a positive integer

    If q = 1, d is Manhattan distance

    qq

    p p

    qq

    j x

    i x

    j x

    i x

    j x

    i x jid )||...|||(|),(

    2211!

    ||...||||),(2211 p p j xi x j xi x j xi x jid

    !

  • 8/8/2019 ClusterI

    16/24

    Similarity and Dissimilarity Between

    Objects (Cont.)If q = 2 , d is Euclidean distance:

    Properties

    d(i,j) u 0d(i,i) = 0

    d(i,j) = d(j,i)

    d(i,j) e d(i,k) + d(k,j)

    Also, one can use weighted distance, parametricPearson product moment correlation, or other disimilaritymeasures

    )||...|||(|),( 2222

    2

    11 p p j x

    i x

    j x

    i x

    j x

    i x jid !

  • 8/8/2019 ClusterI

    17/24

    Binary Variables

    A contingency table for binary data

    Simple matching coefficient (invariant, if the binary variable is

    symmetr i c ):

    Jaccard coefficient (noninvariant if the binary variable is

    asymmetr i c ):

    d cbacb jid !),(

    pd bca sum

    d cd c

    baba

    sum

    0

    1

    01

    cbacb jid !),(

    Object i

    Object j

  • 8/8/2019 ClusterI

    18/24

    18

    Dissimilarity between Binary

    VariablesExample

    gender is a symmetric attribute the remaining attributes are asymmetric binary

    let the values Y and P be set to 1, and the value N be set to 0

    Name Gender Fever Cough Test- 1 Test- 2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N

    75.0211

    21),(

    67.0111

    11),(

    33.0102

    10),(

    !!

    !!

    !!

    mary ji md

    ji m j ack d

    mary j ack d

  • 8/8/2019 ClusterI

    19/24

    Nominal Variables

    A generalization of the binary variable in that it can takemore than 2 states, e.g., red, yellow, blue, green

    Method 1: Simple matching m : # of matches, p : total # of variables

    Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states

    pm p jid !),(

  • 8/8/2019 ClusterI

    20/24

    Ordinal Variables

    An ordinal variable can be discrete or continuous

    Order is important, e.g., rank

    Can be treated like interval-scaled replace x i f by their rank

    map the range of each variable onto [ 0 , 1] by replacing i -th objectin the f -th variable by

    compute the dissimilarity using methods for interval-scaledvariables

    1

    1!

    f

    if if

    M

    r z

    },...,1{ f if M r

  • 8/8/2019 ClusterI

    21/24

    Ratio-Scaled Variables

    Ratio-scaled variable: a positive measurement on anonlinear scale, approximately at exponential scale,

    such as A e Bt or A e -Bt

    Methods: treat them like interval-scaled variables not a g oo d cho i ce!

    (why?the scale can be distorted)

    apply logarithmic transformation

    y i f = l og(x i f ) treat them as continuous ordinal data treat their rank as interval-

    scaled

  • 8/8/2019 ClusterI

    22/24

    Variables of MixedTypes

    A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval

    and ratio

    One may use a weighted formula to combine their effects

    f is binary or nominal:d ij(f) = 0 if xif = x jf , or d ij(f) = 1 o.w.

    f is interval-based: use the normalized distance f is ordinal or ratio-scaled

    compute ranks r if andand treat z if as interval-scaled

    )(1

    )()(1),(

    f i j

    p f

    f i j

    f i j

    p f d jid

    H

    H

    !

    !

    7

    7!

    1

    1!

    f

    if

    M r z if

  • 8/8/2019 ClusterI

    23/24

    Cluster AnalysisWhat is Cluster Analysis?

    Types of Data in Cluster Analysis

    A Categorization of Major Clustering Methods

    Partitioning MethodsHierarchical Methods

    Density-Based Methods

    Grid-Based Methods

    Model-Based Clustering Methods

    Outlier Analysis

    Summary

  • 8/8/2019 ClusterI

    24/24

    Major Clustering Approaches

    Partitioning algorithms: Construct various partitions and then

    evaluate them by some criterion

    Hierarchy algorithms: Create a hierarchical decomposition of the set

    of data (or objects) using some criterionDensity-based: based on connectivity and density functions

    Grid-based: based on a multiple-level granularity structure

    Model-based: A model is hypothesized for each of the clusters and

    the idea is to find the best fit of that model to each other