Clustering I - Sharif
Transcript of Clustering I - Sharif
Machine Learning
Clustering I
Hamid R. RabieeJafar Muhammadi, Nima Pourdamghani
Spring 2015
http://ce.sharif.edu/courses/93-94/2/ce717-1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course2
Agenda
Unsupervised Learning
Quality Measurement
Similarity Measures
Major Clustering Approaches
Distance Measuring
Partitioning Methods
Hierarchical Methods
Density Based Methods
Spectral Clustering
Other Methods
Constraint Based Clustering
Clustering as Optimization
Sharif University of Technology, Computer Engineering Department, Machine Learning Course2
Sharif University of Technology, Computer Engineering Department, Machine Learning Course3
Unsupervised Learning
Clustering or unsupervised classification is aimed at discovering natural groupings in a set
of data.
Note: All samples in the training set are unlabeled.
Applications for clustering:
Spatial data analysis: Create thematic maps in GIS by clustering feature space
Image processing: Segmentation
Economic science: Discover distinct groups in costumer bases
Internet: Document classification
To gain insight into the structure of the data prior to classifier design; classifier design
Sharif University of Technology, Computer Engineering Department, Machine Learning Course4
Quality Measurement
High quality clusters must have
high intra-class similarity
low inter-class similarity
Some other measures
Ability to discover hidden patterns
Judged by the user
Purity
Suppose we know the labels of the data, assign to each cluster its most frequent class
Purity is the number of correctly assigned points divided by the number of data
Sharif University of Technology, Computer Engineering Department, Machine Learning Course5
Similarity Measures
Distances are normally used to measure the similarity or dissimilarity between two
data objects
Some popular distances are Minkowski and Mahalanobis.
Distance between binary strings
d(S1,S2)=|{(s1,i,s2,i) : s1,i ≠ s2,i}|
Distance between vector objectsTX .Y
d(X,Y)X Y
Sharif University of Technology, Computer Engineering Department, Machine Learning Course6
Major Clustering Approaches
Partitioning approach
Construct various partitions and then evaluate them by some criterion (ex. k-means, c-means, k-medoids)
Hierarchical approach
Create a hierarchical decomposition of the set of data using some criterion (ex. Agnes)
Density-based approach
Based on connectivity and density functions (ex. DBSACN, OPTICS)
Graph-based approach (Spectral Clustering)
approximately optimizing the normalized cut criterion
Grid-based approach
based on a multiple-level granularity structure (ex. STING, WaveCluster, CLIQUE)
Model-based
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other (ex.
EM, SOM)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course7
Distance Measuring
Single link
smallest distance between an element in one cluster
and an element in the other
Complete link
largest distance between an element in one cluster
and an element in the other
Average
avg distance between an element in one cluster and an element in the other
Centroid
distance between the centroids of two clusters
Used in k-means
Medoid
distance between the medoids of two clusters
Medoid: A representative object whose average dissimilarity to all the objects in the cluster is minimal
Sharif University of Technology, Computer Engineering Department, Machine Learning Course8
Partitioning Methods
Construct a partition of n data into a set of k clusters, s.t., min sum of squared
distance
where Cms are clusters representatives.
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means, c-means and k-medoids algorithms
k-means: Each cluster is represented by the center of the cluster
c-means: The fuzzy version of k-means
k-medoids: Each cluster is represented by one of the samples in the cluster
j m
k 2
m 1 x Cluster j mmin (x C )
Sharif University of Technology, Computer Engineering Department, Machine Learning Course9
Partitioning Methods: k-means
k-means
Suppose we know there are K categories and each category is represented by its
sample mean
Given a set of unlabeled training samples, how to estimate the means?
Algorithm k-means (k)
1. Partition samples into k non-empty subsets (random initialization)
2. Compute mean points of the clusters of the current partition
3. Assign each sample to the cluster with the nearest mean point
4. Go back to Step 2, stop when no more new assignment
Sharif University of Technology, Computer Engineering Department, Machine Learning Course10
Partitioning Methods: k-means
Some notes on k-means
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers (Why?)
Not suitable to discover clusters with non-convex shapes (Why?)
Algorithm is sensitive to
number of cluster centers,
choice of initial cluster centers
sequence in which data are processed (Why?)
Convergence not guaranteed, but results acceptable if there are well-separated
clusters
Sharif University of Technology, Computer Engineering Department, Machine Learning Course11
Partitioning Methods: c-means
The membership function μil expresses to what degree xl belongs to class Ci.
Crisp clustering: xl can belong to one class only
Fuzzy clustering: xl belongs to all classes simultaneously with varying degrees of membership
where z(m)s are cluster means
q is a fuzziness index with 1<q<2
Fuzzy clustering becomes crisp clustering when q→1
Observe that
C-mean minimizes
1
0
l i
il
l i
if x C
if x C
1
1
( )
1
1
( )1
1
( , )
1
( , )
q
m
i l
il
qk
mii l
d z x
d z x
11, 1,2,..., .
k
ilifor l N
2( )
1 1, ( )
k Nf f q m
e i i il i li lJ J J z x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course12
Partitioning Methods: k-medoids
k-medoids
Instead of taking the mean value of the samples in a cluster as a reference point, medoids
can be used
Note that choosing the new medoids is slightly different with choosing the new means in k-
means algorithm
Algorithm k-medoids (k)
1. Select k representative samples arbitrarily
2. Associate each data point to the closest medoid
3. For each medoid m and data point o
Swap m and o and compute the total cost of configuration
4. Select the configuration with the lowest cost
5. repeat steps 2-5 until there is no change
Sharif University of Technology, Computer Engineering Department, Machine Learning Course13
Partitioning Methods: k-medoids
Some notes on k-medoids
k-medoids is more robust than k-means in the presence of noise and outliers (Why?)
works effectively for small data sets, but does not scale well for large data sets
For Large data sets we can use sampling based methods (How?)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course14
Hierarchical Methods
Clusters have sub-clusters and sub-clusters can have sub-sub-clusters, …
Use distance matrix as clustering criteria.
This method does not require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3
Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
a
b
c
d
e
a b
d ec d e
a b c d e
Sharif University of Technology, Computer Engineering Department, Machine Learning Course15
Hierarchical Methods
Agglomerative Hierarchical Clustering
AGNES (Agglomerative Nesting)
Uses the Single-Link method
Merge nodes (clusters) that have the maximum similarity
divisive Hierarchical Clustering
DIANA (Divisive Analysis)
Inverse order of AGNES
Eventually each node forms a cluster on its own
Sharif University of Technology, Computer Engineering Department, Machine Learning Course16
Hierarchical Methods
Dendrogram
Shows How the Clusters are Merged
Decompose samples into a several levels of nested partitioning (tree of clusters), called a
dendrogram.
A clustering of the samples is obtained by cutting the dendrogram at the desired level, then
each connected component forms a cluster.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course17
Density Based Methods
Clustering based on density (local cluster criterion), such as density-connected
points
Major features:
Discover clusters of arbitrary shapes
Handle noise
Need density parameters as termination condition
Sharif University of Technology, Computer Engineering Department, Machine Learning Course18
Density Based Methods
Main Concepts:
parameters:
Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Eps-neighbourhood
of that point
Sample q is directly density-reachable from sample p, if
d(p,q)<=Eps and p has MinPts points in its neighborhood.
Sample q is density-reachable from a sample p if there is a chain
of points p1, …, pn, p1 = p, pn = q such that pi+1 is directly density-
reachable from pi
Sample p is density-connected to sample q if there is a sample o
such that both, p and q are density-reachable from o.
p q
o
q
pp1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course19
Density Based Methods: DBSCAN
DBSCAN (Density Based Spatial Clustering of Applications with Noise)
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of
density-connected points
Discovers clusters of arbitrary shape in spatial data with noise
Algorithm DBSCAN (Eps, MinPts)
Arbitrary select a sample p
Retrieve all samples density-reachable from p w.r.t. Eps and MinPts.
If p is a core sample (some samples are density-reachable from p), a cluster is formed.
If p is a border sample (no samples are density-reachable from p), DBSCAN visits the next
sample of the database.
Continue the process until all of the samples have been processed.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course20
Graph-based Clustering
Represent data points as the vertices V of a graph G.
All pairs of vertices are connected by an edge E.
Edges have weights W.
Large weights mean that the adjacent vertices are very similar; small weights imply
dissimilarity.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course21
Graph-based Clustering
Clustering on a graph is equivalent to partitioning the vertices of the graph.
A loss function for a partition of V into sets A and B
In a good partition, vertices in different partitions will be dissimilar.
Mincut criterion: Find a partition A, B that minimizes cut(A,B)
Mincut criterion ignores the size of the subgraphs formed
,
,
( , ) u v
u A v B
cut A B W
Sharif University of Technology, Computer Engineering Department, Machine Learning Course22
Graph-based Clustering
Normalized cut criterion favors balanced partitions.
Minimizing the normalized cut criterion exactly is NP-hard.
One way of approximately optimizing the normalized cut criterion leads to
spectral clustering.
, ,
, ,
( , ) ( , )( , )
u v u v
u A v V u B v V
cut A B cut A BNcut A B
W W
Sharif University of Technology, Computer Engineering Department, Machine Learning Course23
Spectral Clustering
Spectral clustering
Looks for a new representation of the original data points, such that
Preserve the edge weights.
The convex clusters’ shapes in the new space represents non-convex ones in the
original space.
Cluster the points in the new space using any clustering scheme (say k-means).
We only describe the resulting algorithm here.
For more information about derivations, refer to U. Luxburg, “A Tutorial on
Spectral Clustering”.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course24
Spectral Clustering
Inputs
Set of points and number of clusters k
Algorithm
Form the edge weights matrix
For example:
Scaling parameter chosen by user
Define D a diagonal matrix whose (i,i) element is the sum of W’s row i
Form the matrix
Find the k largest eigenvectors of L to form the matrix Xnxk
nxnW R
1/2 1/2L D WD
2 2/2
0
i js s
ij
e if i jWelse
1 ,..., l
nS S S R
Sharif University of Technology, Computer Engineering Department, Machine Learning Course25
Spectral Clustering
Algorithm (cont.)
Normalized the matrix Xnxk and form matrix Ynxk
Treat each row of Y as a point in Rk (data dimensionality reduction from n to k)
Cluster the new data into k clusters via K-means
2/ij ij ijj
Y X X
Sharif University of Technology, Computer Engineering Department, Machine Learning Course26
Spectral Clustering
Example
simple edge weights matrix (d(xi,xj) denotes Euclidean distance between points
xi, xj and θ=1)
1 ,
, ,0
i jif d x xW i j W j i
otherwise
1 (0.7,0.7,0,0)Te
2 (0,0,0.7,0.7)Te 1 (0.7,0,0.7,0)Te
2 (0,0.7,0,0.7)Te
~1 1 0 0 1 0 1 0
1 1 0 0 0 1 0 1
0 0 1 1 1 0 1 0
0 0 1 1 0 1 0 1
a b c d a c b d
a a
W Wb c
c b
d d
b
c1 d
a
Sharif University of Technology, Computer Engineering Department, Machine Learning Course27
Spectral Clustering
Another example
Sharif University of Technology, Computer Engineering Department, Machine Learning Course28
Other Methods
Grid based methods
Using multi-resolution grid data structure.
1. Create the grid structure, i.e., partition the data space into a finite number of cells
2. Calculate the cell density for each cell
3. Sort the cells according to their densities
4. Identify cluster centers
5. Traverse the neighbor cells
Sharif University of Technology, Computer Engineering Department, Machine Learning Course29
Other Methods
Model based methods
Attempt to optimize the fit between the given data and some mathematical
model
Based on the assumption: Data are generated by a mixture of underlying
probability distribution
Typical methods
Statistical approach: EM (Expectation maximization) – will be discussed later
Neural network approach: SOM (Self-Organizing Feature Map)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course30
Constraint Based Clustering
Why constraint based clustering?
Need user feedback: Users know their applications the best
Less parameters but more user-desired constraints, e.g., an ATM allocation problem:
obstacle & desired clusters
Different constraints in cluster analysis:
Constraints on individual samples (do selection first)
Cluster on samples which …
Constraints on distance or similarity functions
Weighted functions, obstacles
Constraints on the selection of clustering parameters
Number of clusters, limitation of each cluster size
User-specified constraints
Some samples must be in cluster and some others not!
Semi-supervised: giving small training sets as “constraints” or hints
Sharif University of Technology, Computer Engineering Department, Machine Learning Course31
Constraint Based Clustering
A sample data and two answers (taking the constraints into account and not taking
the constraints into account)
Constraints: The data in different sides of each “wall” should be in different
clusters
Sharif University of Technology, Computer Engineering Department, Machine Learning Course32
Clustering as Optimization
Clustering can be posted as an optimization of a criterion function
The sum-of-squared-error criterion
Scatter criteria
The given criterion function is optimized through iterative optimization
Sharif University of Technology, Computer Engineering Department, Machine Learning Course33
Any Question?
End of Lecture 19
Thank you!
Spring 2015
http://ce.sharif.edu/courses/93-94/2/ce717-1
Machine Learning
Clustering IIHamid R. Rabiee
[Slides are based on Bishop Book]
Spring 2015
http://ce.sharif.edu/courses/93-94/2/ce717-1
Problem of identifying groups, or clusters, of data points in a
multidimensional space
Partitioning the data set into some number K of clusters
Cluster: a group of data points whose inter-point distances are small
compared with the distances to points outside of the cluster
Goal: an assignment of data points to clusters such that the sum of the
squares of the distances to each data point to its closest vector (the center
of the cluster) is a minimum
Objective function called distortion measure:
K-means Clustering
2 Sharif University of Technology, Computer Engineering Department, Machine Learning Course
3
K-means Clustering
Two-stage optimization
In the 1st stage: minimizing J with respect to the rnk, keeping the μk fixed
In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed
The mean of all of the data points assigned to cluster k
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
4
K-means Clustering
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
5
Mixtures of Gaussians
Gaussian mixture distribution can be written as a linear superposition of Gaussian
An equivalent formulation of the Gaussian mixture involving an explicit latent variable Graphical representation of a mixture model
A binary random variable z having a 1-of-K representation
The marginal distribution of x is a Gaussian mixture of the form (*) ( for
every observed data point xn, there is a corresponding latent variable zn)
…(*)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
6
Mixtures of Gaussians
γ(zk) can also be viewed as the responsibility that component k takes for
explaining the observation x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
Mixtures of Gaussians
Generating random samples distributed according to the Gaussian mixture
model
Generating a value for z, which denoted as from the marginal
distribution p(z) and then generate a value for x from the conditional
distribution
7
8
Mixtures of Gaussians
a. The three states of z, corresponding to the three components of the mixture, are
depicted in red, green, blue
b. The corresponding samples from the marginal distribution p(x)
c. The same samples in which the colors represent the value of the responsibilities
γ(znk) associated with data point
Illustrating the responsibilities by evaluating the posterior probability for each
component in the mixture distribution which this data set was generated
9
Maximum likelihood
Graphical representation of a
Gaussian mixture model for
a set of N i.i.d. data points
{xn}, with corresponding
latent points {zn}
The log of the likelihood function
….(*1)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
10
Maximum likelihood
For simplicity, consider a Gaussian mixture whose components have covariance matrices given by
Suppose that one of the components of the mixture model has its mean μj exactly equal to one of the data points so that μj = xn
This data point will contribute a term in the likelihood function of the form
Once there are at least two components in the mixture, one ofthe components can have a finite variance and therefore assignfinite probability to all of the datapoints while the other componentcan shrink onto one specific data point and thereby contribute an everincreasing additive value to the loglikelihood over-fitting problem
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
11
Maximum likelihood
Over-fitting problem
In applying maximum likelihood to a Gaussian mixture models, there
should be steps to avoid finding such pathological solutions and instead
seek local minima of the likelihood function that are well behaved
Identifiability problem
A K-component mixture will have a total of K! equivalent solutions
corresponding to the K! ways of assigning K sets of parameters to K
components
Difficulty of maximizing the log likelihood function the presence of the summation
over k that appears inside the logarithm gives no closed form solution as in the single
case
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
12
EM for Gaussian mixtures
I. Assign some initial values for the means, covariances, and mixing coefficients
II. Expectation or E step
• Using the current value for the parameters to evaluate the posterior probabilities
or responsibilities
III. Maximization or M step
• Using the result of II to re-estimate the means, covariances, and mixing coefficients
It is common to run the K-means algorithm in order to find a suitable initial values
• The covariance matrices the sample covariances of the clusters found by the K-
means algorithm
• Mixing coefficients the fractions of data points assigned to the respective
clusters
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
13
EM for Gaussian mixtures
Given a Gaussian mixture model, the goal is to maximize the likelihood
function with respect to the parameters
1. Initialize the means μk, covariance Σk and mixing coefficients πk
2. E step
3. M step
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
14
EM for Gaussian mixtures
Given a Gaussian mixture model, the goal is to maximize the likelihood
function with respect to the parameters
4: Evaluate the log likelihood
…(*2)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
15
EM for Gaussian mixtures
Setting the derivatives of (*2) with respect to the means of the
Gaussian components to zero
Setting the derivatives of (*2) with respect to the covariance of
the Gaussian components to zero
Responsibilityγ(znk )
A weighted mean of all of the points in the data
set
- Each data point weighted by the corresponding posterior probability
- The denominator given by the effective # of points associated with the corresponding component
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
16
EM for Gaussian mixtures
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
17
An Alternative View of EM
In maximizing the log likelihood function
the summation prevents the logarithm from acting directly on the
joint distribution
Instead, the log likelihood function for the complete data set {X, Z} is
straightforward.
In practice since we are not given the complete data set, we consider
instead its expected value Q under the posterior distribution p( Z|X,
Θ) of the latent variable
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
An Alternative View of EM
General EM
1. Choose an initial setting for the parameters Θold
2. E step Evaluate p(Z|X,Θold )
3. M step Evaluate Θnew given by
Θnew = argmaxΘQ(Θ ,Θold)
Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)
4. It the covariance criterion is not satisfied,
then let Θold Θnew
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Gaussian mixtures revisited
Maximizing the likelihood for the complete data {X, Z}
The logarithm acts directly on the Gaussian distribution much simpler solution to the maximum likelihood problem
the maximization with respect to a mean or a covariance is exactly as for a single Gaussian (closed form)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
20
Gaussian mixtures revisited
Unknown latent variables considering expectation of the complete-data log
likelihood with respect to the posterior distribution of the latent variables
Posterior distribution
The expected value of the indicator variable under this posterior distribution
The expected value of the complete-data log likelihood function
…(*3)
1 1
( | , , , ) ( | , , , ) ( ) [ ( | , )]
( . (9.10), (9.11))
nk
N Kz
k n k k
n k
p Z X μ p X Z μ p Z N x
ref
1
( 1) ( | 1)( ) ( 1| )
( )
( | , )( )
( | , )
nk n nknk nk n
n
k n k knkK
j n j j
j
p z p x zE z p z x
p x
N xz
N x
k
1 1
[ln ( , , , )] ( ){ln ln ( | , )}N K
nk n k k
n k
E p X Z | μ z N x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
21
Relation to K-means
K-means performs a hard assignment of data points to the clusters (each data
point is associated uniquely with one cluster
EM makes a soft assignment based on the posterior probabilities
K-means can be derived as a particular limit of EM for Gaussian mixtures:
As epsilon gets smaller, the terms for which is farthest will go to zero
most quickly. Hence the responsibility go all zero except for the term k for which the
responsibility will go to unit
2
n jx
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
22
Relation to K-means
Thus maximizing the expected complete data log-likelihood is equivalent to
minimizing the distortion measure J for the K-means
(In Elliptical K-means, the covariance is estimated also.)
2
2
2
2
1 1
exp{ / 2 }1 1( | , ) exp , ( )
22 exp{ / 2 }
1[ln ( , | , , )]
2
k n k
n k k n k nk
j n jj
N K
nk n k
n k
xp x I x z
x
E p X Z r x const
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
23
(1 1 ) (2 2 )
Mixtures of Bernoulli distributions
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
24
Mixtures of Bernoulli distributions
(1 )
1
1 1
(1 )
1 1
( | ) (1 ) (single component)
( ( , , ), ( ) ( , , ), ( ) { (1 )})
( | , ) ( | ), ( | ) (1 ) (mixture)
( { },
i i
i i
Dx x
i i
i
D D i i
DKx x
k k k ki ki
k i
p
x x E Cov diag
p p p
1 K k
x μ
x x μ x Σ
x μ x μ x μ
μ μ , ,μ μ
1
1 1
1 1
2
( ) ( ) ( )
( , , ))
( ) ( ) ,
( ) ( ) ( ) ( ) ( ( )) ( ) ( ) ( ) ( )
(Let ( )) ( ), then ( | ) , ( |
k kD
K K
k k k k
k k
K KT T T
k k k k
k k
k ij k ii k i k k ij k i j k
E E
Cov E E E E E E E E
E c c E x c E x x
T T T
k k
T
x x |μ μ
x xx x x xx |μ x x μ μ x x
xx |μ μ μ2
3 2
) ( )
and note that individual variabes are )
[ . .] (1, 0, 0, 1, 1)
(single component): ( ) : ( ) (1 ) , ( ) ( , , ), ( ) (1 )
(mixture) : (
k
i
i j
x independent, given μ
E g
p H p E Cov I
p H
x
x x x
3 2 3 2
1 1 2 2 1 1 1 2 2 2 1 1 2 2 1 1 2 2
2 2 2
1 1 1 1 2 2 2 2 1 1 2 2
2 2
1 1 1 2 2 2 1 1 1 2 1 1
| ) , ( | ) : ( ) (1 ) (1 ) , ( ) ( , , )
( ) { (1 ) } { (1 ) } ( )
{ (1 ) (1 )} { (
c p H c p E
Cov I I
I
x x
x 1 1 1
2
2 2 ) }
* Because the covariance matrix Cov( ) is no longer diagonal, the mixture distribution can
capture correlations between the variables, unlike a single Bernulli distribution.
x
1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
25
Mixtures of Bernoulli distributions
1 1
T
1
1
1
1
ln ( | , ) ln ( | )
( | , ) ( | ) ( ( , ) is a binary indicator variables)
( )
(complete-data log likelihood function) :
ln ( , | , ) ln [
k
k
N K
k k
n k
Kz
K
k
Kz
k
k
K
nk k
k
p p
p p z , z
p
p z
n
k
x μ x μ
x z μ x μ z
z
X Z μ
π
π
π
|
1 1
Z
1 1 1
1
1
ln (1 )ln(1 )]
E [ln ( , | , )] ( ) ln [ ln (1 )ln(1 )]
( | ) (E-step) ( ) [ ] , ( ),
( | )
N D
ni ki ni ki
n i
N K D
nk k ni ki ni ki
n k i
Nk k
nk nk k nkKn
j j
j
x x
p z x x
pz E z N z
p
n
n
X Z μ
x μx
x μ
π
1
k
1( )
(M-step) ,
* In contrast to the mixture of Gaussians, there are no singularities in which the likelihood
goes to infinity
N
nk
nk
kk
zN
N
N
k n
k
x
μ x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
26
Mixtures of Bernoulli distributions
N=600 digit images, 3 mixtures
A mixture of k=3 Bernoulli distributions by 10 EM iterations
Parameters for each of the three components/single multivariate Bernoulli
The analysis of Bernoulli mixtures can be extended to the case of multinomial binary
variables having M>2 states (Ex.9.19)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
Sharif University of Technology, Computer Engineering Department, Machine Learning Course27
References
Slides of Chapter 9 of Bishop book adapted from Biointelligence
Laboratory, Seoul National University http://bi.snu.ac.kr/
Sharif University of Technology, Computer Engineering Department, Machine Learning Course28
Any Question?
End of Lecture 20
Thank you!
Spring 2015
http://ce.sharif.edu/courses/93-94/2/ce717-1