Clustering and NLP

CIS 8590 – Fall 2008 NLP1

Clustering and NLP

Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille,

Andrew Moore, and others

Outline

• Clustering Overview

• Sample Clustering Techniques for NLP– K-means– Agglomerative– Model-based (EM)


Clustering Overview


What is clustering?

• Given a collection of objects, clustering is a procedure that detects the presence of distinct groups, and assign objects to groups.

40 45 50 55

74

76

78

80

82

84

Another example

Why should we care about clustering?

• Clustering is a basic step in most data mining procedures:

Examples :

Clustering movie viewers for movie ranking.

Clustering proteins by their functionality.

Clustering text documents for content similarity.

Clustering is one of the most widely used toolfor exploratory data analysis. Social Sciences Biology Astronomy Computer Science . .All apply clustering to gain a first understanding of the structure of large data sets.

Clustering as Data Exploration

“Clustering” is an ill defined problem

There are many different clustering tasks, leading to different clustering paradigms:

There are Many Clustering TasksThere are Many Clustering Tasks

Some more examples

-2 0 2

-3-2

-10

12

3

2-d data set

-2 0 2

-3-2

-10

12

3

Compact partitioning into tw o strata

-2 0 2

-3-2

-10

12

3

Unsupervised learning

Issues

The clustering problem:

Given a set of objects, find groups of similar objects

1.What is similar?

Define appropriate metrics

2.What makes a good group?Groups that contain the highest average similarity between all pairs?

Groups that are most separated from neighboring groups?

3. How can you evaluate a clustering algorithm?

Formal Definition

Given a data set S and a clustering “objective” function f, find a partition P of S that maximizes (or minimizes) f(P).

A partition is a set of subsets of S such that the subsets don’t intersect, and their union is equal to S.

CIS 8590 – Fall 2008 NLP12

Sample Objective Functions

• Objective 1: Minimize the average distance between points in the same cluster

• Objective 2: Maximize the margin (smallest distance) between neighboring clusters

• Objective 3 (Minimum Description Length): Minimize the number of bits needed to describe the clustering and the number of bits needed to describe the points in each cluster.

CIS 8590 – Fall 2008 NLP13

More Issues

1. Having an objective function f gives a way of evaluating a clustering.

But the real f is usually not known!

2. EfficiencyComparing N points to each other means making

O(N2) comparisons.

3. Curse of DimensionalityThe more features in your data, the more likely

the clustering algorithm is to get it wrong.CIS 8590 – Fall 2008 NLP

14

Clustering as “Unsupervised” Learning

X1 X2 X3 X4 Y

1 1 0 0 1

1 0 1 1 0

0 1 1 0 0

1 0 0 0 1

CIS 8590 – Fall 2008 NLP15

H = space of boolean functions

Input Output

f = X1 Λ ⌐X3 Λ ⌐X4


X1 X2 X3 X4 Y

1 1 0 0 ?

1 0 1 1 ?

0 1 1 0 ?

1 0 0 0 ?

CIS 8590 – Fall 2008 NLP16

H = space of boolean functions

Input Output

f = X1 Λ ⌐X3 Λ ⌐X4

Clustering is just like ML, except ….:


• Supervised learning has:– Labeled training examples– A space Y of possible labels

• Unsupervised learning has:– Unlabeled training examples– No information (or limited information) about

the space of possible labels

CIS 8590 – Fall 2008 NLP17

Some Notes on Complexity

• The ML example used a space of Boolean functions of N Boolean variables22^N+1 possible functionsBut many possibilities are eliminated by

training data and assumptions

• How many possible clusterings?~2N * K / K!, for K clusters (K>1)No possibilities eliminated by training dataNeed to search for a good one efficiently!

CIS 8590 – Fall 2008 NLP18

Clustering Problem Formulation

• General Assumptions– Each data item is a tuple (vector)– Values of tuples are nominal, ordinal or numerical– Similarity (or Distance) function is provided

• For pure numerical tuples, for example:– Sim(di,dj) = di,kdj,k

– sim (di,dj) = cos(di,dj)– …and many more (slide after next)

Similarity Measures in Data Analysis

• For Ordinal Values– E.g. "small," "medium," "large," "X-large"– Convert to numerical assuming constant …on

a normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate

– E.g. "small"=0, "medium"=0.33, etc.– Then, use numerical similarity measures– Or, use similarity matrix (see next slide)

Similarity Measures (cont.)

• For Nominal Values– E.g. "Boston", "LA", "Pittsburgh", or "male", "female",

or "diffuse", "globular", "spiral", "pinwheel"

– Binary rule: If di, = dj,k, then sim = 1, else 0

– Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) =

(|size(Boston) - size(LA)| ) /Max(size(cities))– Or, use similarity Matrix

Similarity Matrixtiny little small medium large huge

tiny 1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0

– Diagonal must be 1.0– Monotonicity property must hold– No linearity (value interpolation) assumed– Qualitative Transitive property must hold

Document Clustering Techniques

• Similarity or Distance Measure:Alternative Choices– Cosine similarity

– Euclidean distance

– Kernel functions, e.g.,

– Language Modeling P(y|modelx) where x and y are documents


– Kullback Leibler distance ("relative entropy")

Some Clustering Methods

• K-Means and K-medoids algorithms:– CLARANS, [Ng and Han, VLDB 1994]

• Hierarchical algorithms– CURE, [Guha et al, SIGMOD 1998]– BIRCH, [Zhang et al, SIGMOD 1996]– CHAMELEON, [Kapyris et al, COMPUTER, 32]

• Density based algorithms – DENCLUE, [Hinneburg, Keim, KDD 1998]– DBSCAN, [Ester et al, KDD 96]

• Clustering with obstacles, [Tung et al, ICDE 2001]

K-Means

CIS 8590 – Fall 2008 NLP26

K-means and K-medoids algorithms

• Objective function: Minimize the sum of square distances of points to a cluster representative (centroid)

• Efficient iterative algorithms (O(n))

K-Means Clustering

1. Select K seed centroids s.t. d(ci,cj) > dmin

2. Assign points to clusters by minimum distance to centroid

3. Compute new cluster centroids:

4. Iterate steps 2 & 3 until no points change clusters

jpCluster

ij

i

pn

c)(

1

),(Argmin)(1

jiKj

i cpdpCluster

Initial Seeds (k=3)

Step 1: Select k random seeds s.t. d(ci,cj) > dmin

K-Means Clustering: Initial Data Points

Initial Seeds

Step 2: Assign points to clusters by min dist.

K-Means Clustering: First-Pass Clusters

),(Argmin)(1

jiKj

i cpdpCluster

New CentroidsStep 3: Compute new cluster centroids:

K-Means Clustering: Seeds Centroids

jpCluster

ij

i

pn

c)(

1

CentroidsStep 4: Recompute

K-Means Clustering: Second Pass Clusters

),(Argmin)(1

jiKj

i cpdpCluster

New CentroidsAnd so on.

K-Means Clustering: Iterate Until Stability

Question

If space of possible clusterings is exponential, why is it that K-Means can find one in O(n) time?

CIS 8590 – Fall 2008 NLP34

Problems with K-means type algorithms

• Clusters are approximately spherical

• High dimensionality is a problem

• The value of K is an input parameter

Agglomerative Clustering

36

Hierarchical Clustering

• Quadratic algorithms• Running time can be

improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]

Lecture 7 Information Retrieval and Digital Libraries

Hierarchical Agglomerative Clustering

• Create N single-document clusters

• For i in 1..n• Merge two clusters

with greatest similarity


Hierarchical Agglomerative Clustering

Hierarchical agglomerative clustering gives a hierarchy of clusters

• This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters

34

5


High density variations

• Intuitively “correct” clustering


High density variations

• Intuitively “correct” clustering

• HAC-generated clusters


• Example. Group documents based on similaritySimilarity matrix:

Thresholding at similarity value of .9 yields:complete graph C1 = {1,4,5}, namely Complete Linkage

connected graph C2={1,4,5,6}, namely Single LinkageFor clustering we need three things:• A similarity measure for pairwise comparison between documents• A clustering criterion (complete Link, Single Ling,…)• A clustering algorithm

Document Clustering Techniques• Clustering Criterion: Alternative Linkages

– Single-link ('nearest neighbor"):

– Complete-link:

– Average-link ("group average clustering") or GAC):

Hierarchical Agglomerative Clustering Methods

• Generic Agglomerative Procedure (Salton '89):- result in nested clusters via iterations1. Compute all pairwise document-document similarity

coefficients

2. Place each of n documents into a class of its own

3. Merge the two most similar clusters into one; - replace the two clusters by the new cluster

- recompute intercluster similarity scores w.r.t. the new cluster

4. Repeat the above step until there are only k clusters left (note k could = 1).

Group Agglomerative Clustering

1

2

4

5

6

7

8

93

Expectation-Maximization


Clustering as Model Selection

Let’s look at clustering as a probabilistic modeling problem:

I have some set of clusters C1, C2, and C3. Each one has a certain probability distribution for generating points:

P(xi | C1), P(xi | C2), P(xi | C3)

CIS 8590 – Fall 2008 NLP49

Clustering as Model Selection

How can I determine which points belong to which cluster?

Cluster for xi = argmaxj P(xi | Cj)

So, all I need is to figure out what P(xi | Cj) is, for each i and j.

But without training data! How can I do that?

CIS 8590 – Fall 2008 NLP50

Super Simple ExampleCoin I and Coin II. (Weighted.)

Pick a coin at random (uniform).

Flip it 4 times.

Repeat.

What are the parameters of the model?

DataCoin I Coin II

HHHT TTTH

HTHH THTT

HTTH TTHT

THHH HTHT

HHHH HTTT

Probability of Data Given Modelp: Probability of H from Coin I

q: Probability of H from Coin II

Let’s say h heads and t tails for Coin I. h’ and t’ for Coin II.

Pr(D|M) = ph (1-p)t qh’ (1-q)t’

How maximize this quantity?

Maximizing pDp(ph (1-p)t qh’ (1-q)t’ ) = 0

Dp(ph)(1-p)t + ph Dp((1-p)t) = 0

h ph-1 (1-p)t = ph t(1-p)t-1

h (1-p) = p t

h = p t + hp

h/(t+h) = p

Duh…Duh…

Missing DataHHHT HTTH

TTTH HTHH

THTT HTTT

TTHT HHHH

THHH HTHT

Oh Boy, Now What!If we knew the labels (which flips from which

coin), we could find ML values for p and q.

What could we use to label?

p and q!

Computing Labelsp = ¾, q = 3/10

Pr(Coin I | HHTH)

= Pr(HHTH | Coin I) Pr(Coin I) / c

= (3/4)3(1/4) (1/2)/c = .052734375/c

Pr(Coin II | HHTH)

= Pr(HHTH | Coin II) Pr(Coin II) / c

= (3/10)3(7/10) (1/2)/c= .00945/c

Expected LabelsI II I II

HHHT .85 .15 HTTH .44 .56

TTTH .10 .90 HTHH .85 .15

THTT .10 .90 HTTT .10 .90

TTHT .10 .90 HHHH .98 .02

THHH .85 .15 HTHT .44 .56

2 Unknowns

• We don’t know the labels (which coins generated which sequences), and we don’t know the probabilities for the coins

• If we knew the labels, we could calculate the probabilities

• If we knew the probabilities, we could calculate the labels

CIS 8590 – Fall 2008 NLP59

Wait, I Have an IdeaPick some model M0

Expectation

• Compute expected labels via Mi

Maximization

• Compute ML model Mi+1

Repeat

Could This Work?Expectation-Maximization (EM)

Theorem: Pr(D|Mi) will not decrease.

Sound familiar? Type of search.

62

GMMs – Gaussian Mixture Models

W

H

Suppose we have 1000 data points in 2D space (w,h)

63

W

H

GMMs – Gaussian Mixture Models

Assume each data point is normally distributed Obviously, there are 5 sets of underlying gaussians

64

The GMM assumption

There are K components (Gaussians) Each k is specified with three parameters: weight, mean,

covariance matrix The total density function is:

1

1

1

1

1( ) exp

22 det( )

{ , , }

0 1

TK

j j j

j dj

j

Kj j j j

K

j jj

x xf x

weight j

65

The EM algorithm (Dempster, Laird and Rubin,

1977)

Raw data GMMs (K = 6) Total Density Function

ii

66

EM Basics

Objective:Given N data points, find maximum likelihood estimation of :

Algorithm:1. Guess initial

2. Perform E step (expectation) Based on , associate each data point with specific gaussian

3. Perform M step (maximization) Based on data points clustering, maximize

4. Repeat 2-3 until convergence (~tens iterations)

1arg max ( ,..., )Nf x x

67

EM Details

E-Step (estimate probability that point t associated to gaussian j):

M-Step (estimate new parameters):

,

1

( , )1,..., 1,...,

( , )

j t j j

t j K

i t i ii

f xw j K t N

f x

,1

,1

,1

,1

,1

1

( )( )

Nnewj t j

t

N

t j tnew tj N

t jt

N new new Tt j t j t jnew t

j N

t jt

wN

w x

w

w x x

w

68

EM Example

Gaussian j

data point t

blue: wt,j

69

EM Example

70

EM Example

71

EM Example

72

EM Example

73

EM Example

74

EM Example

75

EM Example

76

Back to Clustering

We want to label “close” pixels with the same label Proposed metric: label pixels from the same gaussian

with same label Label according to max probability:

Number of labels = K

,( ) arg max( )t jj

label t w

Clustering and NLP

Documents

Transcript of Clustering and NLP