Flat clustering approaches Colin Dewey BMI/CS 576 [email protected] Fall 2015.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
-
Upload
tiffany-barker -
Category
Documents
-
view
216 -
download
1
description
Transcript of Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Basic Machine Learning: Clustering
CS 315 – Web Search and Data Mining
1
What is Machine Learning?Algorithms for inferring new results from known dataNot magic, though presented as suchNot infallible, though sometimes presented as suchMainly based on Probability and StatisticsExamples:
Detecting email spam Recognizing handwritten text Reading license plates and human faces Recognizing speech Recommending new movies to watch
2
Supervised vs. Unsupervised Learning
Two Fundamental Methods in Machine LearningSupervised Learning (“learn from my example”)
Goal: A program that performs a task as good as humans. TASK – well defined (the target function) EXPERIENCE – training data provided by a human PERFORMANCE – error/accuracy on the task
Unsupervised Learning (“see what you can find”) Goal: To find some kind of structure in the data. TASK – vaguely defined No EXPERIENCE No PERFORMANCE (but, there are some evaluations metrics)
3
What is Clustering?The most common form of Unsupervised Learning
Clustering is the process of grouping a set of physical or abstract objects
into classes (“clusters”) of similar objects
It can be used in IR: To improve recall in search For better navigation of search results
Ex1: Cluster to Improve Recall
Cluster hypothesis: Documents with similar text are related
Thus, when a query matches a document D, also return other documents in the cluster containing D.
5
Ex2: Cluster for Better Navigation
6
Clustering CharacteristicsFlat Clustering vs Hierarchical Clustering
Flat: just dividing objects in groups (clusters) Hierarchical: organize clusters in a hierarchy
Evaluating Clustering Internal Criteria
The intra-cluster similarity is high (tightness) The inter-cluster similarity is low (separateness)
External Criteria Did we discover the hidden classes?
(we need gold standard data for this evaluation)
7
Clustering for Web IRRepresentation for clustering
Document representation Need a notion of similarity/distance
How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small
8
Recall: Documents as vectorsEach doc j is a vector of tf.idf values,
one component for each term. Can normalize to unit length.
Vector space terms are axes - aka features N docs live in this space even with stemming, may have 20,000+ dimensions
What makes documents related?
9
ijijin
i ji
ji
j
jj idftfw
w
w
d
dd
,,
1 ,
, where
t 1
D2
D1D3
D4t 2
xy
What makes documents related?Ideal: semantic similarity.Practical: statistical similarity
We will use cosine similarity.
We will describe algorithms in terms of cosine similarity.
10
This is known as the “normalized inner product”.
Clustering AlgorithmsHierarchical algorithms
Bottom-up, agglomerative clustering
Partitioning “flat” algorithms Usually start with a random partitioning Refine it iteratively
The famous k-means partitioning algorithm: Given: a set of n documents and the number k Compute: a partition of k clusters that
optimizes the chosen partitioning criterion
11
K-meansAssumes documents are real-valued vectors.Cluster Ci based on centroid of points x in a cluster (= the center of gravity or mean) :
Reassignment of instances to clusters tries to maximize cohesion (= minimize distance) to the current cluster centroids.
12
K-Means Algorithm
13
Select K points as initial centroids Repeat
form K clusters by assigning each pointto its closest centroid
recompute the centroid of each clusterUntil
centroids do not change
See Animation
K-means: Different IssuesWhen to stop?
When a fixed number of iterations is reached When centroid positions do not change
Seed Choice Results can vary based on random seed selection. Try out multiple starting points
14
Example showingsensitivity to seeds
A B
D E
C
F
If you start with centroids: B and Eyou converge to
If you start with centroids D and Fyou converge to:
K-Means in Orange http://orange.biolab.si/
Machine Learning Software with Visual Programming ComponentStarting tutorial at http://wiki.sdakak.com/ml:getting-started-with-orange
15
Hierarchical clusteringBuild a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.
16
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
Hierarchical Agglomerative Clustering
We assume there is a similarity function that determines the similarity of two instances.
17
Start with all instances in their own cluster.Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj
Algorithm:
Watch animation of HAC
What is the most similar cluster?Single-link
Similarity of the most cosine-similar (single-link)
Complete-link Similarity of the “furthest” points, the least cosine-similar
Group-average agglomerative clustering Average cosine between pairs of elements
Centroid clustering Similarity of clusters’ centroids
18
Single link clustering
19
1) Use maximum similarity of pairs:
),(max),(,
yxsimccsimji cycxji
2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
)),(),,(max()),(( kjkikji ccsimccsimcccsim
Complete link clustering
20
1) Use minimum similarity of pairs:
2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
)),(),,(max()),(( kjkikji ccsimccsimcccsim
Hierarchical Clustering in Orange
21
Major issue - labelingAfter clustering algorithm finds clusters - how can they be useful to the end user?
Need a concise label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues.
Often done by hand, a posteriori.
22
How to Label ClustersShow titles of typical documents
Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent
cluster
Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases But harder to scan
23
Further issuesComplexity:
Clustering is computationally expensive. Implementations need careful balancing of needs.
How to decide how many clusters are best?
Evaluating the “goodness” of clustering There are many techniques, some focus on implementation issues
(complexity/time), some on the quality of
24