2002 Spring CS525 Lecture 2

download 2002 Spring CS525 Lecture 2

of 37

Transcript of 2002 Spring CS525 Lecture 2

  • 8/13/2019 2002 Spring CS525 Lecture 2

    1/37

    What is Cluster Analysis?

    Cluster: a collection of data objects

    Similar to one another within the same cluster

    Dissimilar to the objects in other clusters

    Cluster analysis Grouping a set of data objects into clusters

    Clustering is unsupervised classification: no predefined

    classes

    Typical applications As a stand-alone toolto get insight into data distribution

    As a preprocessing stepfor other algorithms

  • 8/13/2019 2002 Spring CS525 Lecture 2

    2/37

    Examples of Clustering

    Applications

    Marketing: Help marketers discover distinct groups in their customer

    bases, and then use this knowledge to develop targeted marketing

    programs

    Land use: Identification of areas of similar land use in an earth

    observation database

    Insurance: Identifying groups of motor insurance policy holders with a

    high average claim cost

    City-planning: Identifying groups of houses according to their house

    type, value, and geographical location

    Earth-quake studies: Observed earth quake epicenters should be

    clustered along continent faults

  • 8/13/2019 2002 Spring CS525 Lecture 2

    3/37

    What Is Good Clustering?

    A good clustering method will produce high quality clusters with

    high intra-class similarity

    low inter-class similarity

    The quality of a clustering result depends on both the similaritymeasure used by the method and its implementation.

    The quality of a clustering method is also measured by its ability to

    discover some or all of the hidden patterns.

  • 8/13/2019 2002 Spring CS525 Lecture 2

    4/37

    Requirements of Clustering in

    Data Mining

    Scalability

    Ability to deal with different types of attributes

    Discovery of clusters with arbitrary shape

    Minimal requirements for domain knowledge to determine inputparameters

    Able to deal with noise and outliers

    Insensitive to order of input records

    High dimensionality Incorporation of user-specified constraints

    Interpretability and usability

  • 8/13/2019 2002 Spring CS525 Lecture 2

    5/37

    Data Structures

    Data matrix

    (two modes)

    Dissimilarity matrix (one mode)

    npx...

    nfx...

    n1x

    ..... ..........

    ipx...

    i fx...

    i1x

    ..... ..........

    1px...

    1fx...

    11x

    0...)2,()1,(

    :::

    )2,3()

    ...ndnd

    0dd(3,10d(2,1)

    0

  • 8/13/2019 2002 Spring CS525 Lecture 2

    6/37

    Measure the Quality of

    Clustering

    Dissimilarity/Similarity metric: Similarity is expressed interms of a distance function, which is typically metric:

    d(i, j)

    There is a separate quality function that measures the

    goodness of a cluster. The definitions of distance functions are usually very

    different for interval-scaled, boolean, categorical, ordinaland ratio variables.

    Weights should be associated with different variables

    based on applications and data semantics.

    It is hard to define similar enough or good enough

    the answer is typically highly subjective.

  • 8/13/2019 2002 Spring CS525 Lecture 2

    7/37

    Major Clustering

    Approaches

    Partitioning algorithms: Construct various partitions and

    then evaluate them by some criterion

    Hierarchy algorithms: Create a hierarchical decomposition

    of the set of data (or objects) using some criterion

    Density-based: based on connectivity and density functions

    Grid-based: based on a multiple-level granularity structure

    Model-based: A model is hypothesized for each of the

    clusters and the idea is to find the best fit of that model to

    each other

  • 8/13/2019 2002 Spring CS525 Lecture 2

    8/37

    Partitioning Algorithms: Basic

    Concept

    Partitioning method: Construct a partition of a

    database Dof nobjects into a set of kclusters

    Given a k, find a partition of k clusters that

    optimizes the chosen partitioning criterion

    Global optimal: exhaustively enumerate all partitions

    Heuristic methods: k-meansand k-medoidsalgorithms

    k-means(MacQueen67): Each cluster is representedby the center of the cluster

    k-medoidsor PAM (Partition around medoids)

    (Kaufman & Rousseeuw87): Each cluster is

    represented by one of the objects in the cluster

  • 8/13/2019 2002 Spring CS525 Lecture 2

    9/37

    The K-MeansClustering Method

    Given k, the k-meansalgorithm is

    implemented in four steps:

    Partition objects into knonempty subsets Compute seed points as the centroids of the

    clusters of the current partition (the centroid is the

    center, i.e., mean point, of the cluster)

    Assign each object to the cluster with the nearest

    seed point

    Go back to Step 2, stop when no more new

    assignment

  • 8/13/2019 2002 Spring CS525 Lecture 2

    10/37

    The K-MeansClustering Method

    Example

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    K=2

    Arbitrarily choose Kobject as initialcluster center

    Assigneachobjectsto mostsimilarcenter

    Updatetheclustermeans

    Updatetheclustermeans

    reassignreassign

  • 8/13/2019 2002 Spring CS525 Lecture 2

    11/37

    Comments on the K-Means

    Method

    Strength: Relatively efficient: O(tkn), where nis #

    objects, kis # clusters, and t is # iterations. Normally, k,

    t

  • 8/13/2019 2002 Spring CS525 Lecture 2

    12/37

    Variations of the K-MeansMethod

    A few variants of the k-meanswhich differ in

    Selection of the initial kmeans

    Dissimilarity calculations

    Strategies to calculate cluster means

    Handling categorical data: k-modes(Huang98)

    Replacing means of clusters with modes

    Using new dissimilarity measures to deal with categorical objects

    Using a frequency-based method to update modes of clusters

    A mixture of categorical and numerical data: k-prototypemethod

  • 8/13/2019 2002 Spring CS525 Lecture 2

    13/37

    What is the problem of k-Means

    Method?

    The k-means algorithm is sensitive to outliers !

    Since an object with an extremely large value may substantially distort the

    distribution of the data.

    K-Medoids: Instead of taking the meanvalue of the object in a cluster as areference point, medoidscan be used, which is the most centrally located

    object in a cluster.

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

  • 8/13/2019 2002 Spring CS525 Lecture 2

    14/37

    TheK-MedoidsClustering Method

    Find representativeobjects, called medoids, in clusters

    PAM(Partitioning Around Medoids, 1987)

    starts from an initial set of medoids and iteratively replaces one of the

    medoids by one of the non-medoids if it improves the total distance of

    the resulting clustering

    PAMworks effectively for small data sets, but does not scale well for

    large data sets

    CLARA(Kaufmann & Rousseeuw, 1990)

    CLARANS(Ng & Han, 1994): Randomized sampling

    Focusing + spatial data structure (Ester et al., 1995)

  • 8/13/2019 2002 Spring CS525 Lecture 2

    15/37

    Typical k-medoids algorithm (PAM)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    Total Cost = 20

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    K=2

    Arbitrarychoose kobject as

    initialmedoids

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    Assigneachremainin

    g objecttonearestmedoids

    Randomly select anonmedoid object,Oramdom

    Computetotal cost ofswapping

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    Total Cost = 26

    Swapping Oand Oramdom

    If quality isimproved.

    Do loop

    Until nochange

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

  • 8/13/2019 2002 Spring CS525 Lecture 2

    16/37

    PAM (Partitioning Around Medoids) (1987)

    PAM (Kaufman and Rousseeuw, 1987), built in Splus

    Use real object to represent the cluster Select krepresentative objects arbitrarily

    For each pair of non-selected object hand selected object i,

    calculate the total swapping cost TCih

    For each pair of iand h,

    If TCih< 0, iis replaced by h

    Then assign each non-selected object to the most similar

    representative object

    repeat steps 2-3 until there is no change

  • 8/13/2019 2002 Spring CS525 Lecture 2

    17/37

    PAM Clustering: Total swapping costTCih=jCjih

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    j

    ih

    t

    Cjih= 0

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    t

    i h

    j

    Cjih= d(j, h) - d(j, i)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    h

    it

    j

    Cjih= d(j, t) - d(j, i)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    t

    ih j

    Cjih= d(j, h) - d(j, t)

  • 8/13/2019 2002 Spring CS525 Lecture 2

    18/37

    What is the problem with PAM?

    Pam is more robust than k-means in the presence of

    noise and outliers because a medoid is less influencedby outliers or other extreme values than a mean

    Pam works efficiently for small data sets but does not

    scale wellfor large data sets.

    O(k(n-k)2) for each iteration

    where n is # of data,k is # of clusters

    Sampling based method,

    CLARA(Clustering LARge Applications)

  • 8/13/2019 2002 Spring CS525 Lecture 2

    19/37

    CLARA(Clustering Large Applications) (1990)

    CLARA(Kaufmann and Rousseeuw in 1990)

    Built in statistical analysis packages, such as S+

    It draws multiple samplesof the data set, applies PAMon each

    sample, and gives the best clustering as the output

    Strength: deals with larger data sets than PAM

    Weakness:

    Efficiency depends on the sample size

    A good clustering based on samples will not necessarily represent agood clustering of the whole data set if the sample is biased

  • 8/13/2019 2002 Spring CS525 Lecture 2

    20/37

    K-Means Example

    Given: {2,4,10,12,3,20,30,11,25}, k=2

    Randomly assign means: m1=3,m2=4

    Solve for the rest .

    Similarly try for k-medoids

  • 8/13/2019 2002 Spring CS525 Lecture 2

    21/37

    Clustering Approaches

    Clustering

    Hierarchical Partitional Categorical Large DB

    Agglomerative Divisive Sampling Compression

  • 8/13/2019 2002 Spring CS525 Lecture 2

    22/37

    Cluster Summary Parameters

  • 8/13/2019 2002 Spring CS525 Lecture 2

    23/37

    Distance Between Clusters

    Sing le Link: smallest distance betweenpoints

    Complete Link :largest distance betweenpoints

    Average Link : average distance betweenpoints

    Centroid: distance between centroids

  • 8/13/2019 2002 Spring CS525 Lecture 2

    24/37

    Hierarchical Clustering

    Use distance matrix as clustering criteria. This method does not

    require the number of clusters kas an input, but needs a

    termination condition

    Step 0 Step 1 Step 2 Step 3 Step 4

    b

    d

    c

    e

    aa b

    d e

    c d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0

    agglomerative(AGNES)

    divisive

    (DIANA)

  • 8/13/2019 2002 Spring CS525 Lecture 2

    25/37

    Hierarchical Clustering

    Clusters are created in levels actually creatingsets of clusters at each level.

    Agglomerat ive

    Initially each item in its own cluster Iteratively clusters are merged together

    Bottom Up

    Divis ive

    Initially all items in one cluster Large clusters are successively divided

    Top Down

  • 8/13/2019 2002 Spring CS525 Lecture 2

    26/37

    Hierarchical Algorithms

    Single Link

    MST Single Link

    Complete Link Average Link

  • 8/13/2019 2002 Spring CS525 Lecture 2

    27/37

    Dendrogram

    Dendrogram:a tree data

    structure which illustrates

    hierarchical clustering

    techniques.

    Each level shows clusters for

    that level.

    Leafindividual clusters

    Rootone cluster

    A cluster at level i is the unionof its children clusters at level

    i+1.

  • 8/13/2019 2002 Spring CS525 Lecture 2

    28/37

    Levels of Clustering

  • 8/13/2019 2002 Spring CS525 Lecture 2

    29/37

    Agglomerative ExampleA B C D E

    A 0 1 2 2 3

    B 1 0 2 4 3

    C 2 2 0 1 5

    D 2 4 1 0 3

    E 3 3 5 3 0

    BA

    E C

    D

    4Threshold of

    2 3 51

    A B C D E

  • 8/13/2019 2002 Spring CS525 Lecture 2

    30/37

    MST Example

    A B C D E

    A 0 1 2 2 3

    B 1 0 2 4 3

    C 2 2 0 1 5

    D 2 4 1 0 3

    E 3 3 5 3 0

    BA

    E C

    D

  • 8/13/2019 2002 Spring CS525 Lecture 2

    31/37

    Agglomerative Algorithm

  • 8/13/2019 2002 Spring CS525 Lecture 2

    32/37

    Single Link

    View all items with links (distances) between them. Finds maximal connected components in this graph.

    Two clusters are merged if there is at least one edge

    which connects them.

    Uses threshold distances at each level.

    Could be agglomerative or divisive.

  • 8/13/2019 2002 Spring CS525 Lecture 2

    33/37

    MST Single Link Algorithm

  • 8/13/2019 2002 Spring CS525 Lecture 2

    34/37

    Single Link Clustering

  • 8/13/2019 2002 Spring CS525 Lecture 2

    35/37

  • 8/13/2019 2002 Spring CS525 Lecture 2

    36/37

    DIANA (Divisive Analysis)

    Introduced in Kaufmann and Rousseeuw (1990)

    Implemented in statistical analysis packages, e.g., Splus

    Inverse order of AGNES

    Eventually each node forms a cluster on its own

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

  • 8/13/2019 2002 Spring CS525 Lecture 2

    37/37

    Readings

    CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic

    Modeling. George Karypis, Eui-Hong Han, Vipin Kumar, IEEE Computer

    32(8): 68-75, 1999 (http://glaros.dtc.umn.edu/gkhome/node/152)

    A Density-Based Algorithm for Discovering Clusters in Large Spatial

    Databases with Noise. Martin Ester, Hans-Peter Kriegel, Jrg Sander,Xiaowei Xu. Proceedings of 2nd International Conference on Knowledge

    Discovery and Data Mining (KDD-96)

    BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining

    and Knowledge Discovery Volume 1 , Issue 2 (1997)

    http://glaros.dtc.umn.edu/gkhome/node/152http://glaros.dtc.umn.edu/gkhome/node/152