Lecture 9 Clustering

download Lecture 9 Clustering

of 28

Transcript of Lecture 9 Clustering

  • 8/13/2019 Lecture 9 Clustering

    1/28

    Clustering

  • 8/13/2019 Lecture 9 Clustering

    2/28

    Pattern Recognition

    Learning/Testing

    Supervised Learning (Data with type labels) :

    (Bayesian, KNN, Neural network, ..) = Classifiers in

    Weka

    Un-Supervised Learning (Data with no labels): (K-

    Means, CobWeb, EM, ..) = Cluster in Weka= So

    generally this topic is called Clustering

  • 8/13/2019 Lecture 9 Clustering

    3/28

    3

    Supervised learning vs.

    unsupervised learning Supervised learning:discover patterns in the

    data that relate data attributes with a target(class) attribute.

    These patterns are then utilized to predict thevalues of the target attribute in future datainstances.

    Unsupervised learning: The data have no

    target attribute. We want to explore the data to find some intrinsic

    structures in them.

  • 8/13/2019 Lecture 9 Clustering

    4/28

    4

    Clustering Clustering is a technique for finding similarity groupsin

    data, called clusters. I.e., it groups data instances that are similar to (near) each other in

    one cluster and data instances that are very different (far away)from each other into different clusters.

    Clustering is often called an unsupervised learningtaskasno class values denoting an a priorigrouping of the datainstances are given, which is the case in supervisedlearning.

  • 8/13/2019 Lecture 9 Clustering

    5/28

    5

    An illustration

    The data set has three natural groups of data points, i.e.,3 natural clusters.

  • 8/13/2019 Lecture 9 Clustering

    6/28

    6

    What is clustering for?

    Let us see some real-life examples Example 1: groups people of similar sizes

    together to make small, medium and

    large T-Shirts. Tailor-made for each person: too expensive

    One-size-fits-all: does not fit all.

    Example 2: In marketing, segment customersaccording to their similarities

    To do targeted marketing.

  • 8/13/2019 Lecture 9 Clustering

    7/28

    7

    What is clustering for? (cont) Example 3: Given a collection of text

    documents, we want to organize themaccording to their content similarities, To produce a topic hierarchy

    In fact, clustering is one of the most utilizeddata mining techniques. It has a long history, and used in almost every field,

    e.g., medicine, psychology, botany, sociology,biology, archeology, marketing, insurance, libraries,etc.

    In recent years, due to the rapid increase of onlinedocuments, text clustering becomes important.

  • 8/13/2019 Lecture 9 Clustering

    8/28

    8

    Aspects of clustering A clustering algorithm

    Partitional clustering

    Hierarchical clustering

    A distance (similarity, or dissimilarity) function

    Clustering quality Inter-clusters distancemaximized

    Intra-clusters distanceminimized

    The qualityof a clustering result depends onthe algorithm, the distance function, and theapplication.

  • 8/13/2019 Lecture 9 Clustering

    9/28

    9

    K-means clustering

    K-means is a partitional clusteringalgorithm Let the set of data points (or instances) Dbe

    {x1, x2, , xn},

    where xi= (x

    i1,x

    i2, ,x

    ir) is a vectorin a real-

    valued spaceXRr, and ris the number ofattributes (dimensions) in the data.

    The k-means algorithm partitions the given

    data into kclusters. Each cluster has a cluster center, called centroid.

    kis specified by the user

  • 8/13/2019 Lecture 9 Clustering

    10/28

    Algorithmk-means

    1. Decide on a value for k.

    2. Initialize the kcluster centers (randomly, if

    necessary).

    3. Decide the class memberships of the Nobjects byassigning them to the nearest cluster center.

    4. Re-estimate the kcluster centers, by assuming the

    memberships found above are correct.

    5. If none of the Nobjects changed membership in

    the last iteration, exit. Otherwise goto 3.

  • 8/13/2019 Lecture 9 Clustering

    11/28

    11

    K-means algorithm(cont )

  • 8/13/2019 Lecture 9 Clustering

    12/28

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    K-means Clustering: Step 1Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2

    k3

  • 8/13/2019 Lecture 9 Clustering

    13/28

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    K-means Clustering: Step 2Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2

    k3

  • 8/13/2019 Lecture 9 Clustering

    14/28

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    K-means Clustering: Step 3Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2

    k3

  • 8/13/2019 Lecture 9 Clustering

    15/28

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    K-means Clustering: Step 4Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2

    k3

  • 8/13/2019 Lecture 9 Clustering

    16/28

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    expression in condition 1

    expressioni

    nc

    ondition2

    K-means Clustering: Step 5Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2k3

  • 8/13/2019 Lecture 9 Clustering

    17/28

    10

    1 2 3 4 5 6 7 8 9 10

    1

    2

    3

    4

    5

    6

    7

    8

    9

    How can we tell the rightnumber of clusters?

    In general, this is an unsolved problem. However there are many approximate methods. In

    the next few slides we will see an example.

    For our example, we will use the dataset

    on the left.

    However, in this case we are imagining

    that we do NOT know the class labels. We

    are only clustering on the X and Y axis

    values.

  • 8/13/2019 Lecture 9 Clustering

    18/28

    1 2 3 4 5 6 7 8 9 10

    When k = 1, the objective function is 873.0

  • 8/13/2019 Lecture 9 Clustering

    19/28

    1 2 3 4 5 6 7 8 9 10

    When k = 2, the objective function is 173.1

  • 8/13/2019 Lecture 9 Clustering

    20/28

    1 2 3 4 5 6 7 8 9 10

    When k = 3, the objective function is 133.6

  • 8/13/2019 Lecture 9 Clustering

    21/28

    0.00E+00

    1.00E+02

    2.00E+02

    3.00E+02

    4.00E+02

    5.00E+02

    6.00E+02

    7.00E+02

    8.00E+02

    9.00E+02

    1.00E+03

    1 2 3 4 5 6

    We can plot the objective function values for k equals 1 to 6

    The abrupt change at k = 2, is highly suggestive of two clusters in the data. This

    technique for determining the number of clusters is known as knee finding or

    elbow finding.

    Note that the results are not always as clear cut as in this toy example

    k

    Obje

    ctiveFunction

  • 8/13/2019 Lecture 9 Clustering

    22/28

    Image Segmentation Results

    An image (I) Three-cluster image (J) on

    gray values of I

  • 8/13/2019 Lecture 9 Clustering

    23/28

    23

    Strengths of k-means Strengths:

    Simple: easy to understand and to implement

    Efficient: Time complexity: O(tkn),

    where nis the number of data points,

    kis the number of clusters, and

    t is the number of iterations.

    Since both kand tare small, k-means is considered a linear

    algorithm.

    K-means is the most popular clustering algorithm.

  • 8/13/2019 Lecture 9 Clustering

    24/28

    24

    Weaknesses of k-means

    The algorithm is only applicable if the meanisdefined.

    For categorical data, k-mode - the centroid is represented

    by most frequent values.

    The user needs to specify k.

    The algorithm is sensitive to outliers

    Outliers are data points that are very far away from other

    data points.

    Outliers could be errors in the data recording or some

    special data points with very different values.

  • 8/13/2019 Lecture 9 Clustering

    25/28

    25

    Weaknesses of k-means: Problems with

    outliers

  • 8/13/2019 Lecture 9 Clustering

    26/28

    26

    Weaknesses of k-means: To deal with outliers

    One method is to remove some data points in theclustering process that are much further away from the

    centroids than other data points.

    Another method is to perform random sampling. Since insampling, we only choose a small subset of the data

    points, the chance of selecting an outlier is very small.

  • 8/13/2019 Lecture 9 Clustering

    27/28

    27

    Weaknesses of k-means (cont )

    The algorithm is sensitive to initial seeds.

  • 8/13/2019 Lecture 9 Clustering

    28/28

    28

    Weaknesses of k-means (cont )

    If we use different seeds: good results

    There are some

    methods to help

    choose good

    seeds