Csc446: Pattern Recognition
-
Upload
mostafa-g-m-mostafa -
Category
Data & Analytics
-
view
170 -
download
4
Transcript of Csc446: Pattern Recognition
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1
Lecture Note 9:
Unsupervised Learning
(Chapter 10, Pattern Classification)
CSC446 : Pattern Recognition
Prof. Dr. Mostafa Gadal-Haqq
Faculty of Computer & Information Science
Computer Science Department
AIN SHAMS UNIVERSITY
Unsupervised Learning & Clustering
10.1 Introduction
10.2 Mixture Densities Modee
10.3 Maximum Likelihood Estimates
10.4 Application to Normal Mixtures
10.4.3 K-means algorithm
10.6 Data description and clustering
10.6.1 similarity measures
10.7 Criterion function for clustering
10.9 Hierarchical clustering
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 2
Introduction
• So far, we have investigated “Supervised Learning”, in which training samples were labeled.
𝑫 = { 𝒙𝒊 , 𝒕𝒊 ; 𝒊 = 𝟏, … , 𝒏}
• We now investigate a number of “Unsupervised Learning” procedures, in which unlabeled training samples are used.
𝑫 = { 𝒙𝒊 ; 𝒊 = 𝟏, … , 𝒏}
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 3
Introduction
Supervised learning vs. Unsupervised learning:
• Supervised learning: discover patterns in the data
that relate data attributes with a target (class)
attribute.
– These patterns are then utilized to predict the values of
the target attribute in future data instances.
• Unsupervised learning: The data have no target
attribute.
– We want to explore the data to find some intrinsic
structures in them.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 4
Introduction
• Motivations in unsupervised learning:
1. Labeling large amount of sample data can be very costly.
For example: speech recognition.
2. Training classifiers with large amount of (less expensive)
unlabeled samples, and only then use supervision to label
groupings found. For example: data mining.
3. Unsupervised learning can be used to track slowly
varying features over time, which improves performance.
4. Unsupervised methods can be used to identify features
that will then be useful for categorization.
5. Unsupervised procedures can be used to gain some
insight into the nature (or structure) of the data.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 5
Introduction
• Is it possible, in principle, to learn any thing from unlabeled data?
• The answer to this question depends on the assumptions one is willing to accept!
• We shall begin with the assumption that the functional forms for the underlying probability densities are known and that the only thing that must be learned is the value of an unknown parameter vector.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 6
Mixture Densities Model
• We make the following assumptions:
1. The samples come from a known number of
classes c.
2. The prior probabilities P(j) for each class are
known (j = 1, …,c)
3. The form of the class-conditional probability
densities P(x | j, j) are known (j = 1, …,c)
4. The values of the c parameter vectors 1, 2, …,
c are unknown.
5. The category labels are unknown.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 7
• Thus, the probability density function for the samples is given by:
– where is the parameter vector.
– The density function p(x|) is called the mixture density,
– The density functions p(x| j, j) are called a component densities, and
– The prior probabilities P(j) are called the mixing parameters.
c
j
jjj Pxpxp1
)().,|()|( θθ
Mixture Densities Model
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 8
• Our basic goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector .
• Once we know , we can decompose the mixture into its components, and
• we then use a maximum a posteriori classifier on the derived densities; if classification is our final goal.
Mixture Densities Model
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 9
Maximum Likelihoods Estimates
• Suppose that we have a set D = {x1, …, xn} of n
unlabeled samples drawn independently from the
mixture density:
where is fixed but unknown. We have for iid data:
and The gradient of the log-likelihood:
c
j
jjj Pxpxp1
)(),|()|( θθ
)|(ln)(1
n
k
xpl θθ
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 10
n
k
k|θxpD|θ p1
)()(
c
j
jjj
n
k k
k
n
k k
Pxpxp
xpxp
l111
)(),|()|(
1)|(
)|(
1)( θ
θθ
θθ θθθ
Maximum Likelihoods Estimates
• We assume that elements of k and j are
independent of k and j, and we have
Then we can write:
Then is estimated by maximizing
0),|(ln),|(1
iikk
n
k
i xpxPi
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 11
)|(
)(),|(),|(
θ
θθ
k
iikki
xp
PxpxP
),|(ln),|(1
iikk
n
k
i xpxPlii
(1)
Maximum Likelihoods Estimates
• We can generalize these results to include the prior
probabilities P(i) among the unknown quantities:
• Then the MLE can be summarized as:
Find the parameter by:
where
and
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 12
c
j
jjk
iikk
Pxp
Pxpx
1
i
)(ˆ)ˆ,|(
)(ˆ)ˆ,|()ˆ,|(P
j
i
θ
θθ
)ˆ,|(ˆ1
)(ˆ θkii xPn
P
0)ˆ,|(ln)ˆ,|(ˆn
1k
iθ θθi ikki xpxP (1)
(2)
(3)
Applications to Normal Mixtures
• The following table shows a few of the
different cases that can arise depending
upon which parameters are unknown (?)
and which in known (x).
Case i i P(i) c
1 ? x x x
2 ? ? ? x
3 ? ? ? ?
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 13
Case 1: Unknown mean vectors (i = i )
• The log-likelihood is
• The gradient of the log-likelihood:
• Thus, the ML estimates of , according to Eq. 1, is:
Applications to Normal Mixtures
)()(2
1)2(ln),|(ln 12/12/
ii
t
ii
d
iip μxΣμxΣμx
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 14
(2)
n
k
ki
n
k
kki
i
P
P
1
1
)ˆ,|(
)ˆ,|(
ˆ
μx
xμx
μ
)(),|(ln 1
iiiipi
μxΣμxμ
• Unfortunately, Eq. 2 does not give explicitly.
• However, if we have some way of obtaining good
initial estimates for the unknown means,
therefore Eq.2 can be seen as an iterative process
for improving the estimates as follows:
• This is a gradient ascent or hill-climbing procedure
for maximizing the log-likelihood function.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 15
n
k
ki
n
k
kki
i
jxP
xjxP
j
1
1
))(ˆ,|(
))(ˆ,|(
)1(ˆ
)0(ˆi
i
Applications to Normal Mixtures
• We tempted to call it the c-mean procedure, since its
goal is to find the c mean vectors 1, 2, …, c .
However , its now popular with the name k-means.
• It is clear that the probability is large as
the is small.
• Suppose we compute by the Euclidean distance
• Find the means nearest to xk , and approximate
the posterior as:
– Use the iterative scheme to find .
otherwise 0
mi if 1),|(ˆ ki xP
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 16
The k-means clustering algorithm
)ˆ,|(ˆ ki xP
)ˆ(ˆ)ˆ( 1
iki
t
ik xx
2
kˆx i
m
c ˆ,...,ˆ,ˆ21
• We denote the known number of patterns as n, and
the desired number of clusters as c, the k-means
algorithm is then:
Begin
initialize n, c, 1, 2, …, c
do classify n samples according to nearest i
recompute i
until no change in i
return 1, 2, …, c
End
The k-means clustering algorithm
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 17
Data Description and Clustering • If the goal is to find subclasses , a more direct alternative
is to use clustering procedure, which produce a data
description in terms of clusters or groups of data points
that possess strong internal similarities.
• Formal clustering procedures
use a criterion function, such
as the sum of the squared
distance from class centers,
and seek the grouping that
extremizes the criterion
function.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 18
Data Description and Clustering
• To define what we mean by natural grouping, we need to answer the following two questions:
1. How should we measure the similarity between samples?
2. How should we evaluate a partitioning of a set samples into clusters?
• The most obvious measure of similarity (or dissimilarity)
between two samples is a metric distance between them.
• Once such a distance is defined , one would expect the distance between samples of the same cluster to be significantly less than the distance between samples in different classes.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 19
• A metric D(., .) is a function that gives a
generalized scalar distance between two
arguments (patterns).
• A metric must has four properties:
– non-negativity: D(a, b) 0.
– reflexivity: D(a, b) = 0 a = b.
– Symmetry: D(a, b) = D (b, a) .
– Triangle inequality: D(a, b) + D (b, c) D (a, c).
Similarity Measures: Metric Distance
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 20
• The most popular metric functions (in d
dimensions):
– The Manhattan or
City Block Distance:
– The Euclidean Distance:
– The Minkowski Distance
or Lk norm:
• Note that, L1 and L2 norms give the Euclidean and
Manhattan metric, respectively.
Similarity Measures: Metric Distance
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 21
2/1
1
2)()b,a(
d
k
kkbaD
kd
i
k
iikbaL
/1
1
||)b,a(
||)b,a(1
d
i
iibaC
• The Minkowski distance for different values of k:
Similarity Measures: Metric Distance
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 22
• To achieve invariance, one can normalize
the data, e.g., such that they all have zero
means and unit variance, or use principal
components for invariance to rotation
• One can also used a nonmetric similarity function
s(x,x’) to compare two vectors.
• For example:
Similarity Measures
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 23
x'x
x'xx'x
t
s ),(
• Euclidean distance is a possible metric: a possible criterion is to assume samples belonging to same cluster if their distance is less than a threshold value; d0 .
• Clusters defined by Euclidean distance are invariant to translations and rotation of the feature space, but not invariant to general transformations that distort the distance relationship.
Similarity Measures
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 24
Criterion Functions for Clustering • The second issue: how to evaluate a partitioning of a
set into clusters?
• Clustering can be posted as an optimization of a
criterion function
– Scatter criteria:
– The sum-of-squared-error criterion
– Where ni is the number of samples in Di , and mi the mean
of those samples
iDi
in x
xm1
c
i D
ie
i
J1 x
mx
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 25
• This criterion defines clusters as their mean vectors mi in the
sense that it minimizes the sum of the squared lengths of the
error x – mi .
• The optimal partition is defined as the minimization of Je ,
also called minimum variance partition.
• Work fine when clusters form well separated compact
clouds, less when there are great differences in the number of
samples in different clusters.
Criterion Functions for Clustering
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 26
• Scatter criteria
– Scatter matrices used in multiple discriminant analysis,
i.e., the within-scatter matrix SW and the between-scatter
matrix SB
ST = SB +SW
that does depend only on the set of samples (not on the
partitioning)
– The criteria can be to minimize the within-cluster or
maximize the between-cluster scatter
– The trace (sum of diagonal elements) is the simplest
scalar measure of the scatter matrix, as it is proportional
to the sum of the variances in the coordinate directions
Criterion Functions for Clustering
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 27
that is in practice the sum-of-squared-error criterion.
• As tr[ST] = tr[SW] + tr[SB] and tr[ST] is independent from the
partitioning, no new results can be derived by minimizing
tr[SB]
• However, seeking to minimize the within-cluster criterion
Je=tr[SW], is equivalent to maximise the between-cluster
criterion where m is the total mean vector:
c
i
iiB nStr1
2mm
e
c
i D
i
c
i
iW JStrStri
1
2
1 x
mx
c
i
ii
D
nnn 1
11mxm
Criterion Functions for Clustering
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 28
Iterative optimization • Once a criterion function has beem selected, clustering
becomes a problem of discrete optimization.
• As the sample set is finite there is a finite number of possible
partitions, and the optimal one can be always found by
exhaustive search.
• Most frequently, it is adopted an iterative optimization
procedure to select the optimal partitions
• The basic idea lies in starting from a reasonable initial
partition and “move” samples from one cluster to another
trying to minimize the criterion function.
• In general, this kinds of approaches guarantee local,
NOT global, optimization.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 29
• Let us consider an iterative procedure to minimize
the sum-of-squared-error criterion Je
where Ji is the effective error per cluster.
iD
ii
c
i
ie JJJx
mx2
1
where
Iterative optimization
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 30
Iterative optimization
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 31
• This procedure is a sequential version of the k-means
algorithm, with the difference that k-means waits until n
samples have been reclassified before updating, whereas the
later updates each time a sample is reclassified.
• This procedure is more prone to be trapped in local minima,
and depends from the order of presentation of the samples.
• Starting point is always a problem:
• Random centers of clusters.
• Repetition with different random initialization.
• c-cluster starting point as the solution of the (c-1)-
cluster problem plus the sample farthest from the
nearer cluster center.
Iterative optimization
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 32
Hierarchical Clustering
• Many times, clusters are not disjoint, but a cluster may
have subclusters, in turn having sub-subclusters, etc.
• Consider a sequence of partitions of the n samples into
c clusters:
• The first is a partition into n cluster, each one
containing exactly one sample.
• The second is a partition into n-1 clusters, the third
into n-2, and so on, until the n-th in which there is
only one cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 33
• Given any two samples x and x’, they will be grouped
together at some level, and if they are grouped at level k,
they remain grouped for all higher levels.
• Hierarchical clustering is a tree representation called
dendrogram
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 34
Hierarchical Clustering
• The similarity values may help to determine if the
grouping are natural or forced, but if they are evenly
distributed no information can be gained
• Another representation is based on set, e.g., on the Venn
diagrams
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 35
Hierarchical Clustering
• Hierarchical clustering can be divided in
agglomerative and divisive.
• Agglomerative (bottom up, clumping): start with n
singleton cluster and form the sequence by merging
clusters.
• Divisive (top down, splitting): start with all of the
samples in one cluster and form the sequence by
successively splitting clusters.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 36
Hierarchical Clustering
• The procedure terminates when the specified number of
cluster has been obtained, and returns the cluster as sets of points, rather than the mean or a representative vector for each cluster
Agglomerative hierarchical clustering
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 37
• At any level, the distance between nearest clusters can provide the dissimilarity value for that level
• To find the nearest clusters, one can use:
which behave quite similar if the clusters are hyperspherical and well separated.
• The computational complexity is O(cn2d2), n>>c.
'min),(',
min xxxx
ji DD
ji DDd 'max),(',
max xxxx
ji DD
ji DDd
i jD Dji
jiavgnn
DDdx x
xx'
'1
),(jijimean DDd mm ),(
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 38
Agglomerative hierarchical clustering
Nearest-neighbor algorithm
• When dmin is used, the algorithm is called the nearest neighbor algorithm
• If it is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called single-linkage algorithm
• If data points are thought as nodes of a graph with edges forming a path between the nodes in the same subset Di , the merging of Di and Dj corresponds to adding an edge between the neirest pair of node in Di and Dj .
• The resulting graph has any closed loop and it is a tree, if all subsets are linked we have a spanning tree
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 39
Agglomerative hierarchical clustering
• The use of dmin as a distance measure and the
agglomerative clustering generate a minimal
spanning tree
• Chaining effect: defect of this distance measure
(right) ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 40
Agglomerative hierarchical clustering
The farthest neighbor algorithm:
• When dmax is used, the algorithm is called the farthest
neighbor algorithm
• If it is terminated when the distance between nearest
clusters exceeds an arbitrary threshold, it is called
complete-linkage algorithm
• This method discourages the growth of elongated
clusters
• In the terminology of the graph theory, every cluster
constitutes a complete subgraph, and the distance
between two clusters is determined by the most distant
nodes in the two clusters
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 41
Agglomerative hierarchical clustering
• When two clusters are merged, the graph is changed by adding edges between every pair of nodes in the 2 clusters
• All the procedures involving minima or maxima are sensitive to outliers. The use of dmean or davg are natural compromises.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 42
Agglomerative hierarchical clustering
The problem of the number of clusters
• Typically, the number of clusters is known.
• When it’s not, there are several ways of proceed.
• When clustering is done by extremizing a criterion function,
a common approach is to repeat the clustering with c=1, c=2,
c=3, etc.
• Another approach is to state a threshold for the creation of a
new cluster; this is adapt to on line cases but depends on the
order of presentation of data.
• These approaches are similar to model selection procedures,
typically used to determine the topology and number of
states (e.g., clusters, parameters) of a model, given a specific
application.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 43
Graph-theoretic methods
• The graph theory permits to consider particular
structure of data.
• The procedure of setting a distance as a threshold to
place 2 points in the same cluster can be generalized
to arbitrary similarity measures.
• If s0 is a threshold value, we can say that xi is
similar to xj if s(xi, xj) > s0.
• Hence, we define a similarity matrix S = [sij]
otherwise0
),(if1 0sss
ji
ij
xx
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 44
• This matrix induces a similarity graph, dual to S, in which nodes corresponds to points and edge joins node i and j iff sij=1.
• Single-linkage alg.: two samples x and x’ are in the same cluster if there exists a chain x, x1, x2, …, xk, x’, such that x is similar to x1, x1 to x2, and so on connected components of the graph
• Complete-link alg.: all samples in a given cluster must be similar to one another and no sample can be in more than one cluster.
• Neirest-neighbor algorithm is a method to find the minimum spanning tree and vice versa
– Removal of the longest edge produce a 2-cluster grouping, removal of the next longest edge produces a 3-cluster grouping, and so on.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 45
Graph-theoretic methods
• This is a divisive hierarchical procedure, and suggest
ways to dividing the graph into subgraphs.
– E.g., in selecting an edge to remove, comparing its length
with the lengths of the other edges incident upon a node.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 46
Graph-theoretic methods
• One useful statistics to be estimated from the
minimal spanning tree is the edge length distribution
• For instance, in the case of two dense cluster
immersed in a sparse set of points:
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 47
Graph-theoretic methods
What’s Next?
Pattern Recognition Application:
Speech Recognition
ASU-CSC446: Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 48