Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck,...

17
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London

description

Dendrogram – Example dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculturebiologyphysicsCSspace... … (30)...

Transcript of Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck,...

Page 1: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Information Retrieval and Organisation

Chapter 17Hierarchical Clustering

Dell ZhangBirkbeck, University of London

Page 2: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Hierarchical Clustering

Build a tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents.

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Page 3: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Dendrogram – Examplehttp://dir.yahoo.com/science

dairycrops

agronomyforestry

AI

HCIcraft

missions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

... ...

Page 4: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Dendrogram – Example

Clusters of News Stories:Reuters RCV1

Page 5: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Dendrogram Clusters Clustering can be

obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

The number of clusters is not required in advance.

Page 6: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Divisive vs. Agglomerative

Divisive (Top-Down) Start with all documents belong to the same

cluster. Eventually each node forms a cluster on its own.

Recursive application of a (flat) partitional clustering algorithm

e.g., k-means (k=2) bi-secting k-means. Agglomerative (Bottom-Up)

Start with each document being a single cluster. Eventually all documents belong to the same cluster.

Page 7: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

HAC Algorithm

Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster. Repeat until there is only one cluster:

Among the current clusters, determine the pair of closest pair of clusters, ci and cj

Then merges ci and cj to a single cluster. The history of merging forms a binary tree or

hierarchy (dendrogram).

Page 8: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

HAC Alg.

Page 9: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Time Complexity

In the initialization step, compute similarity of all pairs of n documents which is O(n2).

In each of the subsequent n2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters.

The overall time complexity is often O(n3) if done naively or O(n2 log n) if done more cleverly using a priority-queue.

Page 10: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

HAC Variants

How to define the closest pair of clusters Single-Link

maximum similarity between pairs of docs Complete-Link

minimum similarity between pairs of docs Average-Link

average similarity between pairs of docs Centroid

maximum similarity between cluster centroids

Page 11: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Page 12: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Single-Link The similarity between a pair of clusters is defined by the

single strongest link (i.e., maximum cosine-similarity) between their members:

After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(max),(,

yxsimccsimji cycxji

),(),,(max)),(( kjkikji ccsimccsimcccsim

Page 13: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

HAC – Example

d1

d2

d3

d4

d5

d1,d2 d4,d5 d3

d3,d4,d5

As clusters agglomerate, documents fall into a dendrogram.

Page 14: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

HAC – Example

Single-Link

Page 15: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

HAC – Exercise

Digital Camera Megapixel Zoom

Page 16: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

HAC – Exercise

Page 17: Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.

Chaining Effect

Single-Link HAC can result in “straggly” (long and thin) clusters due to the chaining effect.