Clustering and NLP
description
Transcript of Clustering and NLP
CIS 8590 – Fall 2008 NLP1
Clustering and NLP
Slides by me, Sidhartha Shakya, Pedro Domingos, D. Gunopulos, A.L. Yuille,
Andrew Moore, and others
Outline
• Clustering Overview
• Sample Clustering Techniques for NLP– K-means– Agglomerative– Model-based (EM)
CIS 8590 – Fall 2008 NLP2
Clustering Overview
CIS 8590 – Fall 2008 NLP3
What is clustering?
• Given a collection of objects, clustering is a procedure that detects the presence of distinct groups, and assign objects to groups.
40 45 50 55
74
76
78
80
82
84
Another example
Why should we care about clustering?
• Clustering is a basic step in most data mining procedures:
Examples :
Clustering movie viewers for movie ranking.
Clustering proteins by their functionality.
Clustering text documents for content similarity.
Clustering is one of the most widely used toolfor exploratory data analysis. Social Sciences Biology Astronomy Computer Science . .All apply clustering to gain a first understanding of the structure of large data sets.
Clustering as Data Exploration
“Clustering” is an ill defined problem
There are many different clustering tasks, leading to different clustering paradigms:
There are Many Clustering TasksThere are Many Clustering Tasks
“Clustering” is an ill defined problem
There are many different clustering tasks, leading to different clustering paradigms:
There are Many Clustering TasksThere are Many Clustering Tasks
Some more examples
-2 0 2
-3-2
-10
12
3
2-d data set
-2 0 2
-3-2
-10
12
3
Compact partitioning into tw o strata
-2 0 2
-3-2
-10
12
3
Unsupervised learning
Issues
The clustering problem:
Given a set of objects, find groups of similar objects
1.What is similar?
Define appropriate metrics
2.What makes a good group?Groups that contain the highest average similarity between all pairs?
Groups that are most separated from neighboring groups?
3. How can you evaluate a clustering algorithm?
Formal Definition
Given a data set S and a clustering “objective” function f, find a partition P of S that maximizes (or minimizes) f(P).
A partition is a set of subsets of S such that the subsets don’t intersect, and their union is equal to S.
CIS 8590 – Fall 2008 NLP12
Sample Objective Functions
• Objective 1: Minimize the average distance between points in the same cluster
• Objective 2: Maximize the margin (smallest distance) between neighboring clusters
• Objective 3 (Minimum Description Length): Minimize the number of bits needed to describe the clustering and the number of bits needed to describe the points in each cluster.
CIS 8590 – Fall 2008 NLP13
More Issues
1. Having an objective function f gives a way of evaluating a clustering.
But the real f is usually not known!
2. EfficiencyComparing N points to each other means making
O(N2) comparisons.
3. Curse of DimensionalityThe more features in your data, the more likely
the clustering algorithm is to get it wrong.CIS 8590 – Fall 2008 NLP
14
Clustering as “Unsupervised” Learning
X1 X2 X3 X4 Y
1 1 0 0 1
1 0 1 1 0
0 1 1 0 0
1 0 0 0 1
CIS 8590 – Fall 2008 NLP15
H = space of boolean functions
Input Output
f = X1 Λ ⌐X3 Λ ⌐X4
Clustering as “Unsupervised” Learning
X1 X2 X3 X4 Y
1 1 0 0 ?
1 0 1 1 ?
0 1 1 0 ?
1 0 0 0 ?
CIS 8590 – Fall 2008 NLP16
H = space of boolean functions
Input Output
f = X1 Λ ⌐X3 Λ ⌐X4
Clustering is just like ML, except ….:
Clustering as “Unsupervised” Learning
• Supervised learning has:– Labeled training examples– A space Y of possible labels
• Unsupervised learning has:– Unlabeled training examples– No information (or limited information) about
the space of possible labels
CIS 8590 – Fall 2008 NLP17
Some Notes on Complexity
• The ML example used a space of Boolean functions of N Boolean variables22^N+1 possible functionsBut many possibilities are eliminated by
training data and assumptions
• How many possible clusterings?~2N * K / K!, for K clusters (K>1)No possibilities eliminated by training dataNeed to search for a good one efficiently!
CIS 8590 – Fall 2008 NLP18
Clustering Problem Formulation
• General Assumptions– Each data item is a tuple (vector)– Values of tuples are nominal, ordinal or numerical– Similarity (or Distance) function is provided
• For pure numerical tuples, for example:– Sim(di,dj) = di,kdj,k
– sim (di,dj) = cos(di,dj)– …and many more (slide after next)
Similarity Measures in Data Analysis
• For Ordinal Values– E.g. "small," "medium," "large," "X-large"– Convert to numerical assuming constant …on
a normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate
– E.g. "small"=0, "medium"=0.33, etc.– Then, use numerical similarity measures– Or, use similarity matrix (see next slide)
Similarity Measures (cont.)
• For Nominal Values– E.g. "Boston", "LA", "Pittsburgh", or "male", "female",
or "diffuse", "globular", "spiral", "pinwheel"
– Binary rule: If di, = dj,k, then sim = 1, else 0
– Use underlying sematic property: E.g. Sim(Boston, LA) = dist(Boston, LA)-1, or Sim(Boston, LA) =
(|size(Boston) - size(LA)| ) /Max(size(cities))– Or, use similarity Matrix
Similarity Matrixtiny little small medium large huge
tiny 1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0
– Diagonal must be 1.0– Monotonicity property must hold– No linearity (value interpolation) assumed– Qualitative Transitive property must hold
Document Clustering Techniques
• Similarity or Distance Measure:Alternative Choices– Cosine similarity
– Euclidean distance
– Kernel functions, e.g.,
– Language Modeling P(y|modelx) where x and y are documents
Document Clustering Techniques
– Kullback Leibler distance ("relative entropy")
Some Clustering Methods
• K-Means and K-medoids algorithms:– CLARANS, [Ng and Han, VLDB 1994]
• Hierarchical algorithms– CURE, [Guha et al, SIGMOD 1998]– BIRCH, [Zhang et al, SIGMOD 1996]– CHAMELEON, [Kapyris et al, COMPUTER, 32]
• Density based algorithms – DENCLUE, [Hinneburg, Keim, KDD 1998]– DBSCAN, [Ester et al, KDD 96]
• Clustering with obstacles, [Tung et al, ICDE 2001]
K-Means
CIS 8590 – Fall 2008 NLP26
K-means and K-medoids algorithms
• Objective function: Minimize the sum of square distances of points to a cluster representative (centroid)
• Efficient iterative algorithms (O(n))
K-Means Clustering
1. Select K seed centroids s.t. d(ci,cj) > dmin
2. Assign points to clusters by minimum distance to centroid
3. Compute new cluster centroids:
4. Iterate steps 2 & 3 until no points change clusters
jpCluster
ij
i
pn
c)(
1
),(Argmin)(1
jiKj
i cpdpCluster
Initial Seeds (k=3)
Step 1: Select k random seeds s.t. d(ci,cj) > dmin
K-Means Clustering: Initial Data Points
Initial Seeds
Step 2: Assign points to clusters by min dist.
K-Means Clustering: First-Pass Clusters
),(Argmin)(1
jiKj
i cpdpCluster
New CentroidsStep 3: Compute new cluster centroids:
K-Means Clustering: Seeds Centroids
jpCluster
ij
i
pn
c)(
1
CentroidsStep 4: Recompute
K-Means Clustering: Second Pass Clusters
),(Argmin)(1
jiKj
i cpdpCluster
New CentroidsAnd so on.
K-Means Clustering: Iterate Until Stability
Question
If space of possible clusterings is exponential, why is it that K-Means can find one in O(n) time?
CIS 8590 – Fall 2008 NLP34
Problems with K-means type algorithms
• Clusters are approximately spherical
• High dimensionality is a problem
• The value of K is an input parameter
Agglomerative Clustering
36
Hierarchical Clustering
• Quadratic algorithms• Running time can be
improved using sampling [Guha et al, SIGMOD 1998] [Kollios et al, ICDE 2001]
Lecture 7 Information Retrieval and Digital Libraries Page 38
Hierarchical Agglomerative Clustering
• Create N single-document clusters
• For i in 1..n• Merge two clusters
with greatest similarity
Lecture 7 Information Retrieval and Digital Libraries Page 39
Hierarchical Agglomerative Clustering
• Create N single-document clusters
• For i in 1..n• Merge two clusters
with greatest similarity
Lecture 7 Information Retrieval and Digital Libraries Page 40
Hierarchical Agglomerative Clustering
• Create N single-document clusters
• For i in 1..n• Merge two clusters
with greatest similarity
Lecture 7 Information Retrieval and Digital Libraries Page 41
Hierarchical Agglomerative Clustering
Hierarchical agglomerative clustering gives a hierarchy of clusters
• This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters
34
5
Lecture 7 Information Retrieval and Digital Libraries Page 42
High density variations
• Intuitively “correct” clustering
Lecture 7 Information Retrieval and Digital Libraries Page 43
High density variations
• Intuitively “correct” clustering
• HAC-generated clusters
Document Clustering Techniques
• Example. Group documents based on similaritySimilarity matrix:
Thresholding at similarity value of .9 yields:complete graph C1 = {1,4,5}, namely Complete Linkage
connected graph C2={1,4,5,6}, namely Single LinkageFor clustering we need three things:• A similarity measure for pairwise comparison between documents• A clustering criterion (complete Link, Single Ling,…)• A clustering algorithm
Document Clustering Techniques• Clustering Criterion: Alternative Linkages
– Single-link ('nearest neighbor"):
– Complete-link:
– Average-link ("group average clustering") or GAC):
Hierarchical Agglomerative Clustering Methods
• Generic Agglomerative Procedure (Salton '89):- result in nested clusters via iterations1. Compute all pairwise document-document similarity
coefficients
2. Place each of n documents into a class of its own
3. Merge the two most similar clusters into one; - replace the two clusters by the new cluster
- recompute intercluster similarity scores w.r.t. the new cluster
4. Repeat the above step until there are only k clusters left (note k could = 1).
Group Agglomerative Clustering
1
2
4
5
6
7
8
93
Expectation-Maximization
Lecture 7 Information Retrieval and Digital LibrariesPage 48
Clustering as Model Selection
Let’s look at clustering as a probabilistic modeling problem:
I have some set of clusters C1, C2, and C3. Each one has a certain probability distribution for generating points:
P(xi | C1), P(xi | C2), P(xi | C3)
CIS 8590 – Fall 2008 NLP49
Clustering as Model Selection
How can I determine which points belong to which cluster?
Cluster for xi = argmaxj P(xi | Cj)
So, all I need is to figure out what P(xi | Cj) is, for each i and j.
But without training data! How can I do that?
CIS 8590 – Fall 2008 NLP50
Super Simple ExampleCoin I and Coin II. (Weighted.)
Pick a coin at random (uniform).
Flip it 4 times.
Repeat.
What are the parameters of the model?
DataCoin I Coin II
HHHT TTTH
HTHH THTT
HTTH TTHT
THHH HTHT
HHHH HTTT
Probability of Data Given Modelp: Probability of H from Coin I
q: Probability of H from Coin II
Let’s say h heads and t tails for Coin I. h’ and t’ for Coin II.
Pr(D|M) = ph (1-p)t qh’ (1-q)t’
How maximize this quantity?
Maximizing pDp(ph (1-p)t qh’ (1-q)t’ ) = 0
Dp(ph)(1-p)t + ph Dp((1-p)t) = 0
h ph-1 (1-p)t = ph t(1-p)t-1
h (1-p) = p t
h = p t + hp
h/(t+h) = p
Duh…Duh…
Missing DataHHHT HTTH
TTTH HTHH
THTT HTTT
TTHT HHHH
THHH HTHT
Oh Boy, Now What!If we knew the labels (which flips from which
coin), we could find ML values for p and q.
What could we use to label?
p and q!
Computing Labelsp = ¾, q = 3/10
Pr(Coin I | HHTH)
= Pr(HHTH | Coin I) Pr(Coin I) / c
= (3/4)3(1/4) (1/2)/c = .052734375/c
Pr(Coin II | HHTH)
= Pr(HHTH | Coin II) Pr(Coin II) / c
= (3/10)3(7/10) (1/2)/c= .00945/c
Expected LabelsI II I II
HHHT .85 .15 HTTH .44 .56
TTTH .10 .90 HTHH .85 .15
THTT .10 .90 HTTT .10 .90
TTHT .10 .90 HHHH .98 .02
THHH .85 .15 HTHT .44 .56
2 Unknowns
• We don’t know the labels (which coins generated which sequences), and we don’t know the probabilities for the coins
• If we knew the labels, we could calculate the probabilities
• If we knew the probabilities, we could calculate the labels
CIS 8590 – Fall 2008 NLP59
Wait, I Have an IdeaPick some model M0
Expectation
• Compute expected labels via Mi
Maximization
• Compute ML model Mi+1
Repeat
Could This Work?Expectation-Maximization (EM)
Theorem: Pr(D|Mi) will not decrease.
Sound familiar? Type of search.
62
GMMs – Gaussian Mixture Models
W
H
Suppose we have 1000 data points in 2D space (w,h)
63
W
H
GMMs – Gaussian Mixture Models
Assume each data point is normally distributed Obviously, there are 5 sets of underlying gaussians
64
The GMM assumption
There are K components (Gaussians) Each k is specified with three parameters: weight, mean,
covariance matrix The total density function is:
1
1
1
1
1( ) exp
22 det( )
{ , , }
0 1
TK
j j j
j dj
j
Kj j j j
K
j jj
x xf x
weight j
65
The EM algorithm (Dempster, Laird and Rubin,
1977)
Raw data GMMs (K = 6) Total Density Function
ii
66
EM Basics
Objective:Given N data points, find maximum likelihood estimation of :
Algorithm:1. Guess initial
2. Perform E step (expectation) Based on , associate each data point with specific gaussian
3. Perform M step (maximization) Based on data points clustering, maximize
4. Repeat 2-3 until convergence (~tens iterations)
1arg max ( ,..., )Nf x x
67
EM Details
E-Step (estimate probability that point t associated to gaussian j):
M-Step (estimate new parameters):
,
1
( , )1,..., 1,...,
( , )
j t j j
t j K
i t i ii
f xw j K t N
f x
,1
,1
,1
,1
,1
1
( )( )
Nnewj t j
t
N
t j tnew tj N
t jt
N new new Tt j t j t jnew t
j N
t jt
wN
w x
w
w x x
w
68
EM Example
Gaussian j
data point t
blue: wt,j
69
EM Example
70
EM Example
71
EM Example
72
EM Example
73
EM Example
74
EM Example
75
EM Example
76
Back to Clustering
We want to label “close” pixels with the same label Proposed metric: label pixels from the same gaussian
with same label Label according to max probability:
Number of labels = K
,( ) arg max( )t jj
label t w