Machine Learning Hamid Beigyce.sharif.edu/courses/94-95/1/ce717-1/resources/... · Table of...
Transcript of Machine Learning Hamid Beigyce.sharif.edu/courses/94-95/1/ce717-1/resources/... · Table of...
Clustering algorithmsMachine Learning
Hamid Beigy
Sharif University of Technology
Fall 1394
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 1 / 32
Table of contents
1 Supservised vs. Unsupervised Learning
2 Hierarchical Clustering
3 Non-Hierarchical Clustering
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 2 / 32
Supervised vs. Unsupervised Learning
Supervised Learning
Classification: partition examples into groups according to pre-defined categoriesRegression: assign value of feature vectorsrequired labeled data for training
Unsupervised Learning
Clustering: partition examples into groups when no pre-defined categories/classes areavailableNovelty detection: find new lables of dataConcept drift detection: find changes in distribution of dataOutlier detection: find unusual events (e.g. hackers)Only instances required, but no labels
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 3 / 32
Why do Unsupervised Learning?
Cost
Raw data is cheapLabeled data is expensive
Save memory/computation
Reduce noise in high-dimensional data
Useful in exploratory data analysis
Often pre-processing step for supervised learning
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 4 / 32
Clustering
Partition unlabeled examples into disjoint subsets of clusters, such that:
Examples within a cluster are similarExamples in different clusters are different
Discover new categories in an unsupervised manner
(no sample category labels provided)
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 5 / 32
Similarity Measures
Eucliudian distance (L2 norm)
L2(−→x ,−→x ′) =√
ΣNi=1(xi − x ′i )
2
L1 norm:
L1(−→x ,−→x ′) =√
ΣNi=1|xi − x ′i |
Cosine similarity:
cos(−→x ,−→x ′) =−→x ∗−→x ′||−→x ||||−→x ′||
Kernels
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 6 / 32
Application of Clustering
Cluster retrieved documents
to present more organized and understandable results to user → ”diversified retrieval”
Detecting near duplicatesEntity resolution
E.g. ”thorsten Joachims” == ”thorsten B Joachims”
Exploratory data analysis
Automated (or semi-automated) creation of taxonomies
E.g. Yahoo, DMOZ
Comparison
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 7 / 32
Applications of Clustering
Text Mining
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 8 / 32
Applications of Clustering
Image SegmentationImage Segmentation
http://people.cs.uchicago.edu/ pff/segment
Sriram Sankararaman Clustering
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 9 / 32
Clustering Example
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 10 / 32
Clustering Example
\
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 11 / 32
Clustering Example
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 12 / 32
Clustering Example
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 13 / 32
Clustering Example
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 14 / 32
Clustering Algorithms Categories
Hierarchical clustering
E.g. Linkage clustering
Centroid-based clustering
E.g. K-means clustering
Distribution-based clustering
E.g. EM algorithm
Density-based clustering
E.g. DBSCAN
Other methods
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 15 / 32
Hierarchical Clustering
Organize the clusters in a hierarchical way
Produces a rooted tree (Dendrogram)
Animal
Vertebrate Invertebrate
Fish Reptile Amphibian Mammal Worm Insect Crustacean
Recursive application of a standard clustering algorithm can produce a hierarchicalclustering
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 16 / 32
Agglomerative vs. Divisive Clustering
Agglomerative (bottom-up) Methods start with each example in its own cluster anditeratively combine them to form larger and larger clusters.
Divisive(top-down) methods separate all examples recursively into smaller clusters.
Hierarchical Clustering
• Organize the clusters in a hierarchical way.• Produces a rooted binary tree (dendrogram).
Sriram Sankararaman Clustering
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 17 / 32
Agglomerative (bottom up)
Assumes a similarity function for determining the similarity of two clusters.
Starts with all instances in a separate cluster and then repeatedly joins the two clustersthat are most similar until there is only one cluster.
The history of merging forms a binary tree or hierarchy
Basic algorithms:
Start with all instances in their own clusterUntil there is only one cluster:
Among the current clusters, determine the two clusters, ci and cj that are most similarReplace ci and cj with a single cluster ci ∪ cj
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 18 / 32
Cluster Similarity
How to compute similarity of two clusters each possibly containing multiple instances?
Single Linkage: Similarity of two most similar members.Complete Linkage: Similarity of two least similar members.Group Average: Average similarity between members.
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 19 / 32
Single-Link (bottom-up)
sim(ci , cj) = minx∈ci ,y∈cj sim(x , y)
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 20 / 32
Compelete-Link (bottom-up)
sim(ci , cj) = maxx∈ci ,y∈cj sim(x , y)
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 21 / 32
Computational Complexity of HAC
In the first iteration, all HAC methods need to compute similarity of all pairs n individualinstances which is O(n2).
In each of the subsequent O(n) merging iterations, must find smallest distance pair ofclusters → Maintain heap O(n2log(n))
in each of the subsequent O(n) merging iterations, it must compute the distance betweenthe most recently created cluster and all other existing cluster. Can this be done inconstant time such that O(n2log(n)) overall?
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 22 / 32
Computing Cluster Similarity
After merging ci and cj , the similarity of the resulting cluster to any cluster, Ck , can becomputed by:
Single Linkage:
sim((ci ∪ cj), ck) = min(sim(ci , ck), sim(cj , ck))
Single Linkage:
sim((ci ∪ cj), ck) = max(sim(ci , ck), sim(cj , ck))
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 23 / 32
Group Average Bottom-up Clustering
Use average similarity across all pairs within the merged cluster to measure the similarityof two clusters.
sim(ci , cj) =1
|ci ∪ cj − 1|(|ci ∪ cj − 1|)− 1
∑−→x ∈(ci ,cj )
∑−→y ∈(ci ,cj ):−→y 6=−→x
sim(−→x ,−→y )
Compromise between single and complete link.
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 24 / 32
Non-Hierarchical Clustering
K-means clustering (”hard”)
Mixtures of Gaussian and training via Expectation maximization Algorithm (”soft”)
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 25 / 32
Clustering Criterion
Evaluation function that assigns a (usually real-valued) value to a clusteringClustering criterion typically function of
withing-cluster similarity andbetween-cluster dissimilarity
OptimizationFind clustering that maximize the criterion
Global optimization (often intractable)Greedy searchApproximation algorithms
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 26 / 32
Centroid-Based Clustering
Assumes instances are real-valued vectors.
Clusters represented via centroids (i.e. average of points in a cluster) c:
~µ(c) = 1/|c |∑~x∈c
~x
Reassignment of instances to clusters is based on distance to the current cluster
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 27 / 32
K-Means Algorithm
Input: k = number of clusters, distance measure d,
Select k random instances s1, s2, ..., sk as seeds.
Until clustering converges or other stopping criterion:For each instance xi :
Assign xi to the cluster cj such that d(xi , sj) is min.
For each cluster cj ,item itupdate the centroid of each cluster
sj = µ(cj)
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 28 / 32
K-Means Algorithm
Input: k = number of clusters, distance measure d,
Select k random instances s1, s2, ..., sk as seeds.
Until clustering converges or other stopping criterion:For each instance xi :
Assign xi to the cluster cj such that d(xi , sj) is min.
For each cluster cj ,item itupdate the centroid of each cluster
sj = µ(cj)
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 29 / 32
Time Complexity
Assume computing distance between two instances is O(D) where D is the dimensionalityof the vectors.
Reassigning clusters for N points: O(kN) distance computations, or O(kDN).
Computing centroids: Each instance gets added once to some centroid: O(DN).
Assume these two steps are each done once for i iterations: O(ikDN).
Linear in all relevant factors, assuming a fixed number of iterations, more efficient thanHAC.
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 30 / 32
Time Complexity
Problem
Results can vary based on random seed selection, especially for high-dimensional data.
Some seeds can result in poor convergence rate, or convergence to sub-optimalclusterings.
Idea: Combine HAC and K-means clustering.
First randomly take a sample of instances of size
Run group-average HAC on this sample
Use the results of HAC as initial seeds for K-means.
Overall algorithm is efficient and avoids problems of bad seed selection.
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 31 / 32
Gaussian Mixtures and EM
Gaussian Mixture Models
Assume
EM Algorithm
Assume P(Y ) and k know and∑
i = 1.REPEAT
~µj =
∑ni=1 P(Y = j |X = ~xi , ~µ1, ..., ~µk)~xi∑ni=1 P(Y = j |X = ~xi , ~µ1, ..., ~µk)
P(Y = j |X = ~xi , ~µ1, ..., ~µk) =P(X = ~xi |Y = j , ~µj)P(Y = j)∑kl=1 P(X = ~xi |Y = l , ~µl)P(Y = l)
=
e−0.5(~xj−~µj )2
P(Y = j)∑kl=1 e
−0.5(~xj−~µj )2P(Y = l)
Hamid Beigy (Sharif University of Technology) Clustering algorithms Fall 1394 32 / 32