For Wednesday

29
For Wednesday • No reading • No homework

description

For Wednesday. No reading No homework. Program 4. Any questions? Really no comparisons to make (easily), but I would like a brief discussion of RL and the experience and the results you get. - PowerPoint PPT Presentation

Transcript of For Wednesday

Page 1: For  Wednesday

For Wednesday

• No reading• No homework

Page 2: For  Wednesday

Program 4

• Any questions?• Really no comparisons to make (easily), but I

would like a brief discussion of RL and the experience and the results you get.

• Extra credit extensions are certainly possible, but I’m going to let you suggest extensions to me as opposed to the other way around.

Page 3: For  Wednesday

3

Clustering• Partition unlabeled examples into disjoint

subsets of clusters, such that:– Examples within a cluster are very similar– Examples in different clusters are very different

• Discover new categories in an unsupervised manner (no sample category labels provided).

Page 4: For  Wednesday

4

.

Clustering Example

.. ..

. .. ..

..

....

.. ..

. .. ..

..

....

.

Page 5: For  Wednesday

5

Hierarchical Clustering

• Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

• Recursive application of a standard clustering algorithm can produce a hierarchical clustering.

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Page 6: For  Wednesday

6

Aglommerative vs. Divisive Clustering

• Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters.

• Divisive (partitional, top-down) separate all examples immediately into clusters.

Page 7: For  Wednesday

7

Direct Clustering Method

• Direct clustering methods require a specification of the number of clusters, k, desired.

• A clustering evaluation function assigns a real-valued quality measure to a clustering.

• The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function.

Page 8: For  Wednesday

8

Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining the similarity of two instances.

• Starts with each instance in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

Page 9: For  Wednesday

9

HAC Algorithm

Start with each instance in its own cluster.Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj

Page 10: For  Wednesday

10

Cluster Similarity• Assume a similarity function that determines the

similarity of two instances: sim(x,y).– Cosine similarity of document vectors.

• How to compute similarity of two clusters each possibly containing multiple instances?– Single Link: Similarity of two most similar members.– Complete Link: Similarity of two least similar members.– Group Average: Average similarity between members.

Page 11: For  Wednesday

11

Single Link Agglomerative Clustering

• Use maximum similarity of pairs:

• Can result in “straggly” (long and thin) clusters due to chaining effect.– Appropriate in some domains, such as clustering

islands.

),(max),(,

yxsimccsimji cycxji

Page 12: For  Wednesday

12

Single Link Example

Page 13: For  Wednesday

13

Complete Link Agglomerative Clustering

• Use minimum similarity of pairs:

• Makes more “tight,” spherical clusters that are typically preferable.

),(min),(,

yxsimccsimji cycxji

Page 14: For  Wednesday

14

Complete Link Example

Page 15: For  Wednesday

15

Computational Complexity

• In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2).

• In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters.

• In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time.

Page 16: For  Wednesday

16

Computing Cluster Similarity

• After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by:– Single Link:

– Complete Link:

)),(),,(max()),(( kjkikji ccsimccsimcccsim

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Page 17: For  Wednesday

17

Group Average Agglomerative Clustering

• Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters.

• Compromise between single and complete link.• Averaged across all ordered pairs in the merged

cluster instead of unordered pairs between the two clusters to encourage tight clusters.

)( :)(

),()1(

1),(ji jiccx xyccyjiji

ji yxsimcccc

ccsim

Page 18: For  Wednesday

18

Computing Group Average Similarity

• Assume cosine similarity and normalized vectors with unit length.

• Always maintain sum of vectors in each cluster.

• Compute similarity of clusters in constant time:

jcx

j xcs

)(

)1||||)(|||(||)||(|))()(())()((

),(

jiji

jijijiji cccc

cccscscscsccsim

Page 19: For  Wednesday

19

Non-Hierarchical Clustering

• Typically must provide the number of desired clusters, k.

• Randomly choose k instances as seeds, one per cluster.

• Form initial clusters based on these seeds.• Iterate, repeatedly reallocating instances to different

clusters to improve the overall clustering.• Stop when clustering converges or after a fixed

number of iterations.

Page 20: For  Wednesday

20

K-Means

• Assumes instances are real-valued vectors.• Clusters based on centroids, center of gravity,

or mean of points in a cluster, c:

• Reassignment of instances to clusters is based on distance to the current cluster centroids.

cx

xc

||1(c)μ

Page 21: For  Wednesday

21

Distance Metrics

• Euclidian distance (L2 norm):

• L1 norm:

• Cosine Similarity (transform to a distance by subtracting from 1):

2

12 )(),( i

m

ii yxyxL

m

iii yxyxL

11 ),(

yxyx

1

Page 22: For  Wednesday

22

K-Means Algorithm

Let d be the distance measure between instances.Select k random instances {s1, s2,… sk} as seeds.Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj

sj = (cj)

Page 23: For  Wednesday

23

K Means Example(K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Page 24: For  Wednesday

24

Time Complexity• Assume computing distance between two instances

is O(m) where m is the dimensionality of the vectors.• Reassigning clusters: O(kn) distance computations,

or O(knm).• Computing centroids: Each instance vector gets

added once to some centroid: O(nm).• Assume these two steps are each done once for I

iterations: O(Iknm).• Linear in all relevant factors, assuming a fixed

number of iterations, more efficient than O(n2) HAC.

Page 25: For  Wednesday

25

K-Means Objective• The objective of k-means is to minimize the

total sum of the squared distance of every point to its corresponding cluster centroid.

2

1||||

K

l Xx lilix

• Finding the global optimum is NP-hard.• The k-means algorithm is guaranteed to

converge a local optimum.

Page 26: For  Wednesday

26

Seed Choice

• Results can vary based on random seed selection.

• Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.

• Select good seeds using a heuristic or the results of another method.

Page 27: For  Wednesday

27

Buckshot Algorithm

• Combines HAC and K-Means clustering.• First randomly take a sample of instances of

size n • Run group-average HAC on this sample, which

takes only O(n) time.• Use the results of HAC as initial seeds for K-

means.• Overall algorithm is O(n) and avoids problems

of bad seed selection.

Page 28: For  Wednesday

28

Text Clustering• HAC and K-Means have been applied to text in a

straightforward way.• Typically use normalized, TF/IDF-weighted vectors

and cosine similarity.• Optimize computations for sparse vectors.• Applications:

– During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall.

– Clustering of results of retrieval to present more organized results to the user (à la Northernlight folders).

– Automated production of hierarchical taxonomies of documents for browsing purposes (à la Yahoo & DMOZ).

Page 29: For  Wednesday

29

Soft Clustering

• Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster.

• Does not allow uncertainty in class membership or for an instance to belong to more than one cluster.

• Soft clustering gives probabilities that an instance belongs to each of a set of clusters.

• Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1).