Download - Introduction to Machine Learningudel.edu/~amotong/teaching/machine learning/lectures/(Lec 13... · Unsupervised Learning • K-means Framework • Cut-based Framework • Agglomerative

Introduction to Machine Learning

Introduction to Machine Learning Amo G. Tong 1

Lecture 13Unsupervised Learning

• K-means Framework

• Cut-based Framework

• Agglomerative Framework

• Divisive Framework

• Some materials are courtesy of Vibhave Gogate , Carlos Guestrin, Dan Klein & Luke Zettlemoyer, Eric Xing, Hastie.

• All pictures belong to their creators.



Machine Learning

Machine Learning

Supervised Learning 𝒇(𝒙) Reinforcement Learning

ParametricRegressions vs ClassificationContinuous vs DiscreteLinear vs Non-linear

Methods:Linear regressionDecision TreeNeural network….

Non parametric

Instance-based learning KNN

Unsupervised Learning

Clustering


Clustering

• Input: some data

• Goal: infer group information


Clustering



• E.g. Group emails, search results, detection styles.

source : http://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/_images/plot_cluster_comparison_11.png

http://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/_images/plot_cluster_comparison_11.png


Clustering



• E.g. Group emails, search results, detection styles.

Edge Foci Interest PointsDOI: 10.1109/ICCV.2011.6126263


Clustering (Eric Xing)



• Clustering is subjective.


Clustering



• Clustering is subjective.

• Similarity??

• Output:

• a partition

• Some pattern can reflect the group information.


Clustering



• E.g. Group emails, research results, detection styles.

• We have data but there is no label.

• We do not know how many clusters.

• We do not know which data belongs to which cluster.

• We do not even know if the hidden pattern exists.

• BUT we never give up..


Clustering

• BUT we never give up..

• Partition based framework

• Hierarchical clustering framework


K-means Framework

• We have some data.

• We can define (a) the similarity between two instances and (b) the center of a set of instances.

• E.g. Euclidian space (real vector)

• Distance 𝒙𝟏 − 𝒙𝟐2

• similarity=1/distance

• Center of 𝑥1, … , 𝑥𝑛 : ഥ𝒙 =σ 𝒙𝑖

𝑛


K-means Framework

• We have some data.

• We can define (a) the similarity between two instances and (b) the center of a set of instances.

• Suppose there are 𝑘 clusters.

• Randomly select 𝑘 centers

• Repeat

• Assign each instance to the closest center. (now we have 𝑘 clusters)

• Recompute the center of each cluster.

• Until converge or other criteria meet


K-means Framework (Bishop)

• Example (Euclidian space)

Suppose k=2.Step 1: random pick two centers




Suppose k=2.Step 1: random pick two centersStep 2: assign points to the closest center




Suppose k=2.Step 1: random pick two centersStep 2: assign points to the closest centerStep 3: calculate the center of each cluster




Suppose 𝑘 = 2.Step 1: random pick two centersStep 2: assign points to the closest centerStep 3: calculate the center of each clusterStep 4: assign points to the closest center

Repeat until converge.

• Example (Graph Segmentation)

• Formally, partition an image into regions each of which has reasonably homogeneous visual appearance.

• Informally, identify main elements in an image.



Pixel and color.


K-means Framework

• Repeat

• Update the assignment.

• Update the means (centers).



K-means Framework

• Repeat




• Given the assignment 𝐶, let 𝐶(𝑥) be the mean (center) of the cluster containing 𝑥. Consider the Euclidian distance.

• Will it converge? Yes!

• Consider a potential function 𝑓 = σ𝑥∈𝐷 dist(𝑥, 𝐶(𝑥))

• 𝑓 will never increase and 𝑓 is bounded => it will converge


K-means Framework

• Given the assignment 𝐶, let 𝐶(𝑥) be the means (center) of the cluster containing 𝑥. Consider the Euclidian distance.

• Repeat




• Updating the assignment will not increase 𝒇.

• Recalculating the means will not increase 𝒇.

• For a fixed cluster, which one can minizine the distance sum?

• Try Lagrange Multiplier Method (do it yourself).

𝑓 =

𝑥∈𝐷

dist(𝑥, 𝐶(𝑥))

𝑑𝑖𝑠𝑡 = 𝒙𝟏 − 𝒙𝟐2


K-means Framework

• Simple

• Intuitive, minimize σ𝑥∈𝐷 dist(𝑥, 𝐶(𝑥)) (implicitly)

• Not time consuming, O(tkn) (k: clusters, t: iterations, n: instance #)


K-means Framework

• Simple



• K-means may converge to local optimal

• How many clusters are there?

• Distance between clusters.

• How to define mean? What if the attributes are not real numbers

• Cannot handle noise

• Not suitable for non-convex patterns. (recall the pattern of knn)


K-means Framework

• Simple



• K-means may converge to local optima

• How many clusters are there?

• Distance between clusters.

• How to define mean? What if the attributes are not real numbers

• Cannot handle noise

• Not suitable for non-convex patterns. (recall the pattern of knn)


Cut-based Clustering

• Two intuitions behind a good clustering.

• (a) weaken the connection between objects in different clusters

• (b) strengthen the connection between objects within a cluster




• (a) weaken the connection between objects in different clusters

• (b) strengthen the connection between objects within a cluster

• Ground set 𝑈 = {𝑣1, … 𝑣𝑛}

• Similarity between two elements 𝑠𝑖𝑚(𝑣𝑖 , 𝑣𝑗)

• A partition 𝐶1, … , 𝐶𝑘 of 𝑈

• Inner-sim(𝐶𝑖) = σ𝑢,𝑣∈ 𝐶𝑖𝑠𝑖𝑚(𝑢, 𝑣)

• Inter-sim(𝐶𝑖)= σ𝑢∈ 𝐶𝑖, 𝑣∉ 𝐶𝑖𝑠𝑖𝑚(𝑢, 𝑣) (cut)

How to measure the goodness of a cluster?

Cost of a clustering 𝐶1, … , 𝐶𝑘

σInter−sim(𝐶𝑖)

Inner−sim(𝐶𝑖)




• Ground set 𝑈 = {𝑣1, … 𝑣𝑛}

• Similarity between two elements 𝑠𝑖𝑚(𝑣𝑖 , 𝑣𝑗)

• A partition 𝐶1, … , 𝐶𝑘 of 𝑈

• Inner-sim(𝐶𝑖) = σ𝑢,𝑣∈ 𝐶𝑖𝑠𝑖𝑚(𝑢, 𝑣)

• Inter-sim(𝐶𝑖)= σ𝑢∈ 𝐶𝑖, 𝑣∉ 𝐶𝑖𝑠𝑖𝑚(𝑢, 𝑣) (cut)

• Find a clustering that minimizes 𝐜𝐨𝐬𝐭 = σInter−sim(𝐶𝑖)Inner−sim(𝐶𝑖)

Optimal solution exists but it is hard to find.Enumerating?Polynomial time?




• An algorithm

• Initialize 𝐶1, … , 𝐶𝑘 randomly.

• Repeat until converge

• Unlock all elements

• Repeat until all elements are locked.

• Randomly select one 𝐶𝑖• Randomly select one unlocked element 𝑣 ∈ 𝐶𝑖 if any

• Move 𝑣 to the cluster such that 𝒄𝒐𝒔𝒕 is maximally decreased.

• Lock 𝑣.




• An algorithm

• Example. k=2

1

32

Cost= 0+ (3+1)/2 =2

ab

c

• Inner−cost(𝐶𝑖)=∞ if |𝐶𝑖|=1• Or you can do some smoothing

by assign a base similarity.




• An algorithm

• Example. k=2

1

32

Cost= 0+ (3+1)/2 =2

ab

c

If move c, cost=(1+2)/3+ 0 =1

1

32

ab

c

Inner−cost(𝐶𝑖)=∞ if |𝐶𝑖|=1




• An algorithm

• Example. k=2

1

32

Cost= 0+ (3+1)/2 =2

ab

c

If move c, cost=(1+2)/3+ 0 =1

If move b, cost=(3+2)/1+ 0 =5

1

32

ab

c

1

32

ab

c

Inner−cost(𝐶𝑖)=∞ if |𝐶𝑖|=1




• An algorithm

• Heuristic algorithm.

• May not be optimal

• Is the solution good?

• Reasonable. Cost is iteratively decreased.

• Does it converge?

• Yes.




• An algorithm


• Repeat until converge (converge?)


• Repeat until all elements are locked. (converge?)



• Lock 𝑣.




• An algorithm

• Heuristic algorithm.

• May not optimal

• Is the solution good?

• Reasonable. Cost is iteratively decreased.

• Does it converge?

• Yes.

• Any other choices?

• Yes




• An algorithm


• Repeat until converge


• Repeat until all elements are locked.



• Lock 𝑣.

You may select the one that, after considered, can maximally decrease the cost



• Compare to k-means

• The number of clusters is known in advance.

• Need some initializations

• Iteratively improve the solution.

• Cut-based: consider both the inter and inner similarity

• K-means: only consider the inner similarity.


Agglomerative Clustering

• Idea: combine small clusters.



• Idea: combine small clusters.

• Framework:

• Maintain a set of clusters

• Initially, each instance is one cluster

• Repeat

• Merge two closest clusters

• Until there is one cluster

• Key: how to define closeness of clusters?



• Key: how to define closeness of clusters?

• First, define the closeness of each pair.

• The closeness of the clusters can be• The closest pair (single-link clustering)

• The farthest pair (complete-link clustering, diameter)

• Sum of all pairs? Average of all pairs.

• Ward’s method

• If you can define the distance within a cluster, find the pair of cluster that results minimum increase in in-cluster distance.


Agglomerative Clustering (Hastie)

• The result of agglomerative clustering hierarchy of clusters.

dendrogram

So what if we want k clusters?



Detect

outliers.


Divisive Clustering

• Idea: split a large cluster into two


Divisive Clustering


• Framework:


• Initially, all instances form one cluster

• Repeat

• Split one cluster into two

• Until each cluster is a singleton.

• Key: Which cluster should we split? How to split it?


Divisive Clustering (Andrea)

Key: Which cluster should we split? How to split it?


Divisive Clustering


• Framework:



• Repeat



Which cluster should we split?

If we grow the entire dendrogram and your splitting rule is local, it does not matter.

Otherwise, you may select the one with the highest cost.

How to split it? (many choices)Equally partition it such that the cost is minimized.

DIANA.


Divisive Clustering


• Framework:



• Repeat



Which cluster should we split?

If we grow the entire dendrogram and your splitting rule is local, it does not matter.

Otherwise, you may select the one with the highest cost.

How to split it? (many choices)Equally partition it such that the cost is minimized.

DIANA

DIANA:To divide the selected cluster, the algorithm first looks for its most disparate observation (i.e., which has the largest average dissimilarity to the other observations of the selected cluster). This observation initiates the "splinter group". In subsequent steps, the algorithm reassigns observations that are closer to the "splinter group" than to the "old party". The result is a division of the selected cluster into two new clusters.


Hierarchal Clustering - Summary

• No need to specify the number of clusters in advance.

• Can be time consuming, time complexity of at least O(𝑛2), where n is the number of total objects

• Hierarchical structure stands for intuitions for some domains.

• But the interpretation is subjective.


Summary

• K-means

• Cut-based measurements

• Agglomerative clustering

• Divisive clustering

Spring 2019 Amo G. Tong 52

Equal-sized k-clustering

Cut-based k-clustering.

cost = σInter−cost(𝐶𝑖)

Inner−cost(𝐶𝑖)

Initialize 𝐶1, … , 𝐶𝑘 randomly.Repeat until converge (converge?)

Unlock all elementsRepeat until all elements are locked. (converge?)

Randomly select one 𝐶𝑖Randomly select one unlocked element 𝑣 ∈ 𝐶𝑖 if anyMove 𝑣 to the cluster such that 𝒄𝒐𝒔𝒕is maximally decreased.Lock 𝑣.

Given a set of 𝑘 ∗ 𝑚 elements, we want a equal-sized k-clustering. That is, each cluster has exactly 𝑚 elements.

Please describe a cut-based algorithm for such a purpose.

Hint: How to take account of the size of the clusters?