Clustering and NLP

76
CIS 8590 – Fall 2008 NLP 1 Clustering and NLP Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille, Andrew Moore, and others

description

Clustering and NLP. Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille, Andrew Moore, and others. Outline. Clustering Overview Sample Clustering Techniques for NLP K-means Agglomerative Model-based (EM). Clustering Overview. What is clustering?. - PowerPoint PPT Presentation

Transcript of Clustering and NLP

Page 1: Clustering and NLP

CIS 8590 – Fall 2008 NLP1

Clustering and NLP

Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille,

Andrew Moore, and others

Page 2: Clustering and NLP

Outline

• Clustering Overview

• Sample Clustering Techniques for NLP– K-means– Agglomerative– Model-based (EM)

CIS 8590 – Fall 2008 NLP2

Page 3: Clustering and NLP

Clustering Overview

CIS 8590 – Fall 2008 NLP3

Page 4: Clustering and NLP

What is clustering?

• Given a collection of objects, clustering is a procedure that detects the presence of distinct groups, and assign objects to groups.

40 45 50 55

74

76

78

80

82

84

Page 5: Clustering and NLP

Another example

Page 6: Clustering and NLP

Why should we care about clustering?

• Clustering is a basic step in most data mining procedures:

Examples :

Clustering movie viewers for movie ranking.

Clustering proteins by their functionality.

Clustering text documents for content similarity.

Page 7: Clustering and NLP

Clustering is one of the most widely used toolfor exploratory data analysis. Social Sciences Biology Astronomy Computer Science . .All apply clustering to gain a first understanding of the structure of large data sets.

Clustering as Data Exploration

Page 8: Clustering and NLP

“Clustering” is an ill defined problem

There are many different clustering tasks, leading to different clustering paradigms:

There are Many Clustering TasksThere are Many Clustering Tasks

Page 9: Clustering and NLP

“Clustering” is an ill defined problem

There are many different clustering tasks, leading to different clustering paradigms:

There are Many Clustering TasksThere are Many Clustering Tasks

Page 10: Clustering and NLP

Some more examples

-2 0 2

-3-2

-10

12

3

2-d data set

-2 0 2

-3-2

-10

12

3

Compact partitioning into tw o strata

-2 0 2

-3-2

-10

12

3

Unsupervised learning

Page 11: Clustering and NLP

Issues

The clustering problem:

Given a set of objects, find groups of similar objects

1.What is similar?

Define appropriate metrics

2.What makes a good group?Groups that contain the highest average similarity between all pairs?

Groups that are most separated from neighboring groups?

3. How can you evaluate a clustering algorithm?

Page 12: Clustering and NLP

Formal Definition

Given a data set S and a clustering “objective” function f, find a partition P of S that maximizes (or minimizes) f(P).

A partition is a set of subsets of S such that the subsets don’t intersect, and their union is equal to S.

CIS 8590 – Fall 2008 NLP12

Page 13: Clustering and NLP

Sample Objective Functions

• Objective 1: Minimize the average distance between points in the same cluster

• Objective 2: Maximize the margin (smallest distance) between neighboring clusters

• Objective 3 (Minimum Description Length): Minimize the number of bits needed to describe the clustering and the number of bits needed to describe the points in each cluster.

CIS 8590 – Fall 2008 NLP13

Page 14: Clustering and NLP

More Issues

1. Having an objective function f gives a way of evaluating a clustering.

But the real f is usually not known!

2. EfficiencyComparing N points to each other means making

O(N2) comparisons.

3. Curse of DimensionalityThe more features in your data, the more likely

the clustering algorithm is to get it wrong.CIS 8590 – Fall 2008 NLP

14

Page 15: Clustering and NLP

Clustering as “Unsupervised” Learning

X1 X2 X3 X4 Y

1 1 0 0 1

1 0 1 1 0

0 1 1 0 0

1 0 0 0 1

CIS 8590 – Fall 2008 NLP15

H = space of boolean functions

Input Output

f = X1 Λ ⌐X3 Λ ⌐X4

Page 16: Clustering and NLP

Clustering as “Unsupervised” Learning

X1 X2 X3 X4 Y

1 1 0 0 ?

1 0 1 1 ?

0 1 1 0 ?

1 0 0 0 ?

CIS 8590 – Fall 2008 NLP16

H = space of boolean functions

Input Output

f = X1 Λ ⌐X3 Λ ⌐X4

Clustering is just like ML, except ….:

Page 17: Clustering and NLP

Clustering as “Unsupervised” Learning

• Supervised learning has:– Labeled training examples– A space Y of possible labels

• Unsupervised learning has:– Unlabeled training examples– No information (or limited information) about

the space of possible labels

CIS 8590 – Fall 2008 NLP17

Page 18: Clustering and NLP

Some Notes on Complexity

• The ML example used a space of Boolean functions of N Boolean variables22^N+1 possible functionsBut many possibilities are eliminated by

training data and assumptions

• How many possible clusterings?~2N * K / K!, for K clusters (K>1)No possibilities eliminated by training dataNeed to search for a good one efficiently!

CIS 8590 – Fall 2008 NLP18

Page 19: Clustering and NLP

Clustering Problem Formulation

• General Assumptions– Each data item is a tuple (vector)– Values of tuples are nominal, ordinal or numerical– Similarity (or Distance) function is provided

• For pure numerical tuples, for example:– Sim(di,dj) = di,kdj,k

– sim (di,dj) = cos(di,dj)– …and many more (slide after next)

Page 20: Clustering and NLP

Similarity Measures in Data Analysis

• For Ordinal Values– E.g. "small," "medium," "large," "X-large"– Convert to numerical assuming constant …on

a normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate

– E.g. "small"=0, "medium"=0.33, etc.– Then, use numerical similarity measures– Or, use similarity matrix (see next slide)

Page 21: Clustering and NLP

Similarity Measures (cont.)

• For Nominal Values– E.g. "Boston", "LA", "Pittsburgh", or "male", "female",

or "diffuse", "globular", "spiral", "pinwheel"

– Binary rule: If di, = dj,k, then sim = 1, else 0

– Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) =

(|size(Boston) - size(LA)| ) /Max(size(cities))– Or, use similarity Matrix

Page 22: Clustering and NLP

Similarity Matrixtiny little small medium large huge

tiny 1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0

– Diagonal must be 1.0– Monotonicity property must hold– No linearity (value interpolation) assumed– Qualitative Transitive property must hold

Page 23: Clustering and NLP

Document Clustering Techniques

• Similarity or Distance Measure:Alternative Choices– Cosine similarity

– Euclidean distance

– Kernel functions, e.g.,

– Language Modeling P(y|modelx) where x and y are documents

Page 24: Clustering and NLP

Document Clustering Techniques

– Kullback Leibler distance ("relative entropy")

Page 25: Clustering and NLP

Some Clustering Methods

• K-Means and K-medoids algorithms:– CLARANS, [Ng and Han, VLDB 1994]

• Hierarchical algorithms– CURE, [Guha et al, SIGMOD 1998]– BIRCH, [Zhang et al, SIGMOD 1996]– CHAMELEON, [Kapyris et al, COMPUTER, 32]

• Density based algorithms – DENCLUE, [Hinneburg, Keim, KDD 1998]– DBSCAN, [Ester et al, KDD 96]

• Clustering with obstacles, [Tung et al, ICDE 2001]

Page 26: Clustering and NLP

K-Means

CIS 8590 – Fall 2008 NLP26

Page 27: Clustering and NLP

K-means and K-medoids algorithms

• Objective function: Minimize the sum of square distances of points to a cluster representative (centroid)

• Efficient iterative algorithms (O(n))

Page 28: Clustering and NLP

K-Means Clustering

1. Select K seed centroids s.t. d(ci,cj) > dmin

2. Assign points to clusters by minimum distance to centroid

3. Compute new cluster centroids:

4. Iterate steps 2 & 3 until no points change clusters

jpCluster

ij

i

pn

c)(

1

),(Argmin)(1

jiKj

i cpdpCluster

Page 29: Clustering and NLP

Initial Seeds (k=3)

Step 1: Select k random seeds s.t. d(ci,cj) > dmin

K-Means Clustering: Initial Data Points

Page 30: Clustering and NLP

Initial Seeds

Step 2: Assign points to clusters by min dist.

K-Means Clustering: First-Pass Clusters

),(Argmin)(1

jiKj

i cpdpCluster

Page 31: Clustering and NLP

New CentroidsStep 3: Compute new cluster centroids:

K-Means Clustering: Seeds Centroids

jpCluster

ij

i

pn

c)(

1

Page 32: Clustering and NLP

CentroidsStep 4: Recompute

K-Means Clustering: Second Pass Clusters

),(Argmin)(1

jiKj

i cpdpCluster

Page 33: Clustering and NLP

New CentroidsAnd so on.

K-Means Clustering: Iterate Until Stability

Page 34: Clustering and NLP

Question

If space of possible clusterings is exponential, why is it that K-Means can find one in O(n) time?

CIS 8590 – Fall 2008 NLP34

Page 35: Clustering and NLP

Problems with K-means type algorithms

• Clusters are approximately spherical

• High dimensionality is a problem

• The value of K is an input parameter

Page 36: Clustering and NLP

Agglomerative Clustering

36

Page 37: Clustering and NLP

Hierarchical Clustering

• Quadratic algorithms• Running time can be

improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]

Page 38: Clustering and NLP

Lecture 7 Information Retrieval and Digital Libraries Page 38

Hierarchical Agglomerative Clustering

• Create N single-document clusters

• For i in 1..n• Merge two clusters

with greatest similarity

Page 39: Clustering and NLP

Lecture 7 Information Retrieval and Digital Libraries Page 39

Hierarchical Agglomerative Clustering

• Create N single-document clusters

• For i in 1..n• Merge two clusters

with greatest similarity

Page 40: Clustering and NLP

Lecture 7 Information Retrieval and Digital Libraries Page 40

Hierarchical Agglomerative Clustering

• Create N single-document clusters

• For i in 1..n• Merge two clusters

with greatest similarity

Page 41: Clustering and NLP

Lecture 7 Information Retrieval and Digital Libraries Page 41

Hierarchical Agglomerative Clustering

Hierarchical agglomerative clustering gives a hierarchy of clusters

• This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters

34

5

Page 42: Clustering and NLP

Lecture 7 Information Retrieval and Digital Libraries Page 42

High density variations

• Intuitively “correct” clustering

Page 43: Clustering and NLP

Lecture 7 Information Retrieval and Digital Libraries Page 43

High density variations

• Intuitively “correct” clustering

• HAC-generated clusters

Page 44: Clustering and NLP

Document Clustering Techniques

• Example. Group documents based on similaritySimilarity matrix:

Thresholding at similarity value of .9 yields:complete graph C1 = {1,4,5}, namely Complete Linkage

connected graph C2={1,4,5,6}, namely Single LinkageFor clustering we need three things:• A similarity measure for pairwise comparison between documents• A clustering criterion (complete Link, Single Ling,…)• A clustering algorithm

Page 45: Clustering and NLP

Document Clustering Techniques• Clustering Criterion: Alternative Linkages

– Single-link ('nearest neighbor"):

– Complete-link:

– Average-link ("group average clustering") or GAC):

Page 46: Clustering and NLP

Hierarchical Agglomerative Clustering Methods

• Generic Agglomerative Procedure (Salton '89):- result in nested clusters via iterations1. Compute all pairwise document-document similarity

coefficients

2. Place each of n documents into a class of its own

3. Merge the two most similar clusters into one; - replace the two clusters by the new cluster

- recompute intercluster similarity scores w.r.t. the new cluster

4. Repeat the above step until there are only k clusters left (note k could = 1).

Page 47: Clustering and NLP

Group Agglomerative Clustering

1

2

4

5

6

7

8

93

Page 48: Clustering and NLP

Expectation-Maximization

Lecture 7 Information Retrieval and Digital LibrariesPage 48

Page 49: Clustering and NLP

Clustering as Model Selection

Let’s look at clustering as a probabilistic modeling problem:

I have some set of clusters C1, C2, and C3. Each one has a certain probability distribution for generating points:

P(xi | C1), P(xi | C2), P(xi | C3)

CIS 8590 – Fall 2008 NLP49

Page 50: Clustering and NLP

Clustering as Model Selection

How can I determine which points belong to which cluster?

Cluster for xi = argmaxj P(xi | Cj)

So, all I need is to figure out what P(xi | Cj) is, for each i and j.

But without training data! How can I do that?

CIS 8590 – Fall 2008 NLP50

Page 51: Clustering and NLP

Super Simple ExampleCoin I and Coin II. (Weighted.)

Pick a coin at random (uniform).

Flip it 4 times.

Repeat.

What are the parameters of the model?

Page 52: Clustering and NLP

DataCoin I Coin II

HHHT TTTH

HTHH THTT

HTTH TTHT

THHH HTHT

HHHH HTTT

Page 53: Clustering and NLP

Probability of Data Given Modelp: Probability of H from Coin I

q: Probability of H from Coin II

Let’s say h heads and t tails for Coin I. h’ and t’ for Coin II.

Pr(D|M) = ph (1-p)t qh’ (1-q)t’

How maximize this quantity?

Page 54: Clustering and NLP

Maximizing pDp(ph (1-p)t qh’ (1-q)t’ ) = 0

Dp(ph)(1-p)t + ph Dp((1-p)t) = 0

h ph-1 (1-p)t = ph t(1-p)t-1

h (1-p) = p t

h = p t + hp

h/(t+h) = p

Duh…Duh…

Page 55: Clustering and NLP

Missing DataHHHT HTTH

TTTH HTHH

THTT HTTT

TTHT HHHH

THHH HTHT

Page 56: Clustering and NLP

Oh Boy, Now What!If we knew the labels (which flips from which

coin), we could find ML values for p and q.

What could we use to label?

p and q!

Page 57: Clustering and NLP

Computing Labelsp = ¾, q = 3/10

Pr(Coin I | HHTH)

= Pr(HHTH | Coin I) Pr(Coin I) / c

= (3/4)3(1/4) (1/2)/c = .052734375/c

Pr(Coin II | HHTH)

= Pr(HHTH | Coin II) Pr(Coin II) / c

= (3/10)3(7/10) (1/2)/c= .00945/c

Page 58: Clustering and NLP

Expected LabelsI II I II

HHHT .85 .15 HTTH .44 .56

TTTH .10 .90 HTHH .85 .15

THTT .10 .90 HTTT .10 .90

TTHT .10 .90 HHHH .98 .02

THHH .85 .15 HTHT .44 .56

Page 59: Clustering and NLP

2 Unknowns

• We don’t know the labels (which coins generated which sequences), and we don’t know the probabilities for the coins

• If we knew the labels, we could calculate the probabilities

• If we knew the probabilities, we could calculate the labels

CIS 8590 – Fall 2008 NLP59

Page 60: Clustering and NLP

Wait, I Have an IdeaPick some model M0

Expectation

• Compute expected labels via Mi

Maximization

• Compute ML model Mi+1

Repeat

Page 61: Clustering and NLP

Could This Work?Expectation-Maximization (EM)

Theorem: Pr(D|Mi) will not decrease.

Sound familiar? Type of search.

Page 62: Clustering and NLP

62

GMMs – Gaussian Mixture Models

W

H

Suppose we have 1000 data points in 2D space (w,h)

Page 63: Clustering and NLP

63

W

H

GMMs – Gaussian Mixture Models

Assume each data point is normally distributed Obviously, there are 5 sets of underlying gaussians

Page 64: Clustering and NLP

64

The GMM assumption

There are K components (Gaussians) Each k is specified with three parameters: weight, mean,

covariance matrix The total density function is:

1

1

1

1

1( ) exp

22 det( )

{ , , }

0 1

TK

j j j

j dj

j

Kj j j j

K

j jj

x xf x

weight j

Page 65: Clustering and NLP

65

The EM algorithm (Dempster, Laird and Rubin,

1977)

Raw data GMMs (K = 6) Total Density Function

ii

Page 66: Clustering and NLP

66

EM Basics

Objective:Given N data points, find maximum likelihood estimation of :

Algorithm:1. Guess initial

2. Perform E step (expectation) Based on , associate each data point with specific gaussian

3. Perform M step (maximization) Based on data points clustering, maximize

4. Repeat 2-3 until convergence (~tens iterations)

1arg max ( ,..., )Nf x x

Page 67: Clustering and NLP

67

EM Details

E-Step (estimate probability that point t associated to gaussian j):

M-Step (estimate new parameters):

,

1

( , )1,..., 1,...,

( , )

j t j j

t j K

i t i ii

f xw j K t N

f x

,1

,1

,1

,1

,1

1

( )( )

Nnewj t j

t

N

t j tnew tj N

t jt

N new new Tt j t j t jnew t

j N

t jt

wN

w x

w

w x x

w

Page 68: Clustering and NLP

68

EM Example

Gaussian j

data point t

blue: wt,j

Page 69: Clustering and NLP

69

EM Example

Page 70: Clustering and NLP

70

EM Example

Page 71: Clustering and NLP

71

EM Example

Page 72: Clustering and NLP

72

EM Example

Page 73: Clustering and NLP

73

EM Example

Page 74: Clustering and NLP

74

EM Example

Page 75: Clustering and NLP

75

EM Example

Page 76: Clustering and NLP

76

Back to Clustering

We want to label “close” pixels with the same label Proposed metric: label pixels from the same gaussian

with same label Label according to max probability:

Number of labels = K

,( ) arg max( )t jj

label t w