2003Data Mining Tut 3

download 2003Data Mining Tut 3

of 4

Transcript of 2003Data Mining Tut 3

  • 8/3/2019 2003Data Mining Tut 3

    1/4

    Intelligent Data Analysis and Probabilistic Inference

    Data Mining Tutorial 3: Clustering and Associations Rules

    1.

    i. Explain the operation of the k-means clustering algorithm using pseudo code.

    ii. Given the following eight points, and assuming initial cluster centroids given by A, B, C,and that a Euclidean distance function is used for measuring distance between points, usek-means to show only the three clusters and calculate their new centroids after the second

    round of execution.

    2.

    i. Explain the meaning of support and confidence in the context of association rule

    discovery algorithms and explain how the a priori heuristic can be used to improve the

    efficiency of such algorithms.

    ii. Given the transactions described below, find all rules between single items that have

    support >= 60%. For each rule report both support and confidence.

    1: (Beer)2: (Cola, Beer)

    3: (Cola, Beer)

    4: (Nuts, Beer)5: (Nuts, Cola, Beer)

    6: (Nuts, Cola, Beer)

    7: (Crisps, Nuts, Cola)

    8: (Crisps, Nuts, Cola, Beer)

    9: (Crisps, Nuts, Cola, Beer)

    10:(Crisps, Nuts, Cola, Beer)

    [email protected], [email protected] 16th Dec2003

    ID X Y

    A 2 10

    B 2 5

    C 8 4

    D 5 8

    E 7 5

    F 6 4G 1 2

    H 4 9

  • 8/3/2019 2003Data Mining Tut 3

    2/4

    3. a. Explain how hierarchical clustering algorithms work, make sure your answer describes what is

    meant by a linkage method and how it is used.

    b. Explain the advantages and disadvantages of hierarchical clustering compared to K-means

    clustering.

    4. The following table shows the distance matrix between five genes,

    G1 G2 G3 G4 G5

    G1 0

    G2 9 0

    G3 3 7 0

    G4 6 5 9 0

    G5 11 10 2 8 0

    i. Based on a complete linkage method show the distance matrix between the first formed

    cluster and the other data points.

    ii. Draw a dendrogram showing the full hierarchical clustering tree for five points based on

    complete linkage.iii. Draw a dendrogram showing the full hierarchicatree for the five points based on single

    linkage.

    [email protected], [email protected] 16th Dec2003

  • 8/3/2019 2003Data Mining Tut 3

    3/4

    Data Mining Tutorial 3: Answers

    1.

    Clusters after 1st iterationCluster1: A (2,10), D (5,8), H (4,9)

    Cluster2: B: B (2,5), G (1,2)

    Cluster3: C (8,4), E (7,5), F (6,4)

    Centroids after 1stiteration

    Cluster1: centroid: (3.66, 9)

    Cluster2: centroid: (1.5, 3.5)

    Cluster3: centroid: (7, 4.33)

    Clusters after 2nd iteration(no change)

    Cluster1: A (2,10), D (5,8), H (4,9)

    Cluster2: B: B (2,5), G (1,2)

    Cluster3: C (8,4), E (7,5), F (6,4)

    Centroids after 2nd iteration (no change)

    Cluster1: centroid: (3.66, 9)

    Cluster2: centroid: (1.5, 3.5)Cluster3: centroid: (7, 4.33)

    2.Initial Supports

    Beer: Support = 9/10

    Cola: Support=8/10

    Nuts: Support=7/10Crisps: Support=4/10 (Drop Crisps)

    Beer, Cola: Support=7/10

    Beer, Nuts: Support=6/10

    Cola, Nuts: Support=6/10

    Beer->Cola (Support=70%, Confidence= 7/9=77%

    Cola->Beer (Support=70%, Confidence= 7/8=87.5Beer->Nuts (Support=60%, Confidence= 6/9=66%

    Nuts->Beer (Support= 60%, Confidence= 6/7=85.7%

    Cola->Nuts (Support=60%, Confidence= 6/8=75%

    Nuts->Cola (Support=60%, Confidence= 6/7=85.7%

    [email protected], [email protected] 16th Dec2003

  • 8/3/2019 2003Data Mining Tut 3

    4/4

    4. The first cluster will be formed from G3 and G5 since they have the minimum distance.

    G35 G1 G2 G4

    G35 0

    G1 11 0

    G2 10 9 0

    G4 9 6 5 0

    [email protected], [email protected] 16th Dec2003

    Single Linkage Complete Linkage