Machine Learning
description
Transcript of Machine Learning
Outline
Machine Learning
Devdatt DubhashiDepartment of Computer Science and Engineering
Chalmers UniversityGothenburg, Sweden.
LP3 2007
Dubhashi Machine Learrning
Outline
Outline
1 k-Means Clustering
2 Mixtures of Gaussians and EM Algorithm
Dubhashi Machine Learrning
Outline
Outline
1 k-Means Clustering
2 Mixtures of Gaussians and EM Algorithm
Dubhashi Machine Learrning
k-meansmix gaussians
Clustering
Data set {x 1, · · · , x N} of N observations of a randomd-dim eculidean variable x .
Goal is to partition the data set ito K clusters (K known).
Intuitively, the points within a cluster must be “close” toeach other compared to pints outside the cluster.
Dubhashi Machine Learrning
k-meansmix gaussians
Cluster centers and assignments
Find a set of centers µk , k ∈ [K ]
Assign each data point to one of the centers so as tominimize the sum of the squares of the distances to theassigned centers.
Dubhashi Machine Learrning
k-meansmix gaussians
Assignment and Distortion
Introduce binary indicator variables
rn,k :=
{1, if xn is asssigned to µk
0, otherwise
Minimize the distortion measure
J :=∑
n∈[N]
∑k∈[K ]
rn,k ||xn − µk ||2 .
Dubhashi Machine Learrning
k-meansmix gaussians
Two Step Optimization
Start with some initial values of µk . Basic iteration consists oftwo steps until convergence.
M Minimize J wrt rn,k keeping µk fixed.
E Minimize J wrt µk keeping rn,k fixed.
Dubhashi Machine Learrning
k-meansmix gaussians
Two Step Optimization: M Step
Minimize J wrt rn,k keeping µk fixed:
rn,k :=
{1, if k = argminj ||xn − µk ||2
0, otherwise.
Dubhashi Machine Learrning
k-meansmix gaussians
Two Step Optimization: E Step
Minimize J wrt µk keeping rn,k fixed: J is a quadratic function ofµk , so setting derivative to zero gives:∑
n∈[N]
rn,k (x n − µk ) = 0,
hence,
µk =
∑n rn,kx n∑
n rn,k.
In words: set µk to be the mean of the points assigned tocluster k , hence called K-means Algorithms.
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm Analysis
Since J decreses at each iteration, convergenceguaranteesd.
But may converge to a local rather than a global optimum.
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
(a)
−2 0 2
−2
0
2 (b)
−2 0 2
−2
0
2
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
(c)
−2 0 2
−2
0
2 (d)
−2 0 2
−2
0
2
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
(e)
−2 0 2
−2
0
2 (f)
−2 0 2
−2
0
2
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
(g)
−2 0 2
−2
0
2 (h)
−2 0 2
−2
0
2
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
(i)
−2 0 2
−2
0
2
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means and Image Segmentation
Image Segmentation problem: partition image into regionsof homogeneosu visual appearance, corresponding toobjects or parts of objects.
Each pixel is a 3-dim point corresponding to intensities ofred, blue and green channels.
perform K-means and redraw image replacing ecah pixelby the corresponding center µk .
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
����� �����
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
������� Original image
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means Algorithm: Example
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means and Data Compression
Lossy as opposed to lossless compression where weaccept some errors in recontruction in return for higher rateof compression.
Instead of storing all the N data pointsm store only theidentity of the assigned cluster, and the cluster centers.
Significant savings provided K << N.
Each data point approximated by nearest center µk :code-book vectors.
New data compressed by finding nearest center andstoring only the label k of corresponding cluster.
Scheme called Vector Quantization.
Dubhashi Machine Learrning
k-meansmix gaussians
K-Means and Data Compression: Example
Suppose original image has N pixels comprising {R, G, B}values which are stored with 8 bits precision. Then totalspace required is 24N bits.
Instead if we first do K -means and transmit only label ofcorresponding cluster for ecah pixel, this takes log K bitsper pixel for a total of N log K bits.
Also need to transmit the K code–book vectors whichneeds 24K bits.
In example, original image has 240 × 180 = 43, 200 pixels,requring 24 × 43, 200 = 1, 036, 800 pixels.
Compressed images require 43, 248 (K = 2), 86, 472(K = 3) and 173, 040 (K = 10) bits.
Dubhashi Machine Learrning
k-meansmix gaussians
Mixtures of Gaussians: Motivation
Pure Gaussian distributions have limitations when it comesto modelling real life data.
Example: “Olfd Faithful” eruption durations.
Forms two dominant clumps
Single Gaussian can’t model this data well
Linear superposition of two Gaussians does much better.
Dubhashi Machine Learrning
k-meansmix gaussians
Old Faithful Eruptions
1 2 3 4 5 640
60
80
100
1 2 3 4 5 640
60
80
100
Dubhashi Machine Learrning
k-meansmix gaussians
Mixtures of Gaussians: Modelling
Linear combination of Gaussianscan give rise to complexdistributions.
By using a suffieicnt number ofGaussians and adjusting theirmeans and covariannces, as wellas linear combinationcoefiicients, can model almostany continuous density toarbitrary accuracy.
x
p(x)
Dubhashi Machine Learrning
k-meansmix gaussians
Mixtures of Gaussians: Definition
Superpositon of Gaussians of the form
p(x ) :=∑
k∈[K ]
πkN (x | µk ,Σk ).
Each Gaussian density N (x | µk ,Σk ) is a component ofthe mixture with its own mean and covariance.
Parameters πk are mixing coefficients and satisfy0 ≤ πk ≤ 1 and
∑k πk = 1.
Dubhashi Machine Learrning
k-meansmix gaussians
Mixtures of Gaussians: Definition
0.5 0.3
0.2
(a)
0 0.5 1
0
0.5
1 (b)
0 0.5 1
0
0.5
1
Dubhashi Machine Learrning
k-meansmix gaussians
Mixtures of Gaussians: Definition
0.5 0.3
0.2
(a)
0 0.5 1
0
0.5
1
Dubhashi Machine Learrning
k-meansmix gaussians
Equivalent Definition: Latent Variable
Can introduce a latent variable z which is such that exactlyone component is 1 and the rest are zeros, withp(zk = 1) = πk . This variable gives the component.Given z, the conditional distribution is
p(x | zk = 1) = N (x | µk ,Σk ).
Inverting this, using Baye’s rule
γ(zk ) := p(zk = 1 | x )
=p(zk = 1)p(x | zk = 1)∑
j p(zj = 1)p(x | zj = 1)
=πkN (x | µk ,Σk )∑
j πjN (x | µj ,Σj)
is the posterior probability or responsibility that componentk takes for observation x .
Dubhashi Machine Learrning
k-meansmix gaussians
Mixtures and Responsibilities
(a)
0 0.5 1
0
0.5
1 (b)
0 0.5 1
0
0.5
1
Dubhashi Machine Learrning
k-meansmix gaussians
Mixtures and Responsibilities
(b)
0 0.5 1
0
0.5
1 (c)
0 0.5 1
0
0.5
1
Dubhashi Machine Learrning
k-meansmix gaussians
Learning Mixtures
Suppose we have a data set of observations representedby a N × D matrix X := {x 1, · · · , x N} and we want tomodel it as a mixture of K Gaussians.
Need to find mixing coefficients πk , and parameters ofcomponent models, µk and Σk .
Dubhashi Machine Learrning
k-meansmix gaussians
Learning Mixtures: The Means
Start with the loglikelihood function:
ln p(X | πµ,Σ) =∑
n∈[N]
ln
∑k∈[K ]
πkN (x n | µk ,Σk )
Setting derivative wrt µk to zero, and assuming Σ isinvertible gives:
µk =1
Nk
∑n∈[N]
γ(zn,k )x n,
where Nk :=∑
n∈[N] γ(zn,k ).
Dubhashi Machine Learrning
k-meansmix gaussians
Learning Mixtures: The Means
Interpret Nk as the “effective number of points” assigned tocluster k .
Note that the mean µk for the k th Gaussian component isgiven by a weighted mean of all points in the data set
The weighting factor for data point x n is given by theposterior probability or responsibilty of component k forgenerating x n.
Dubhashi Machine Learrning
k-meansmix gaussians
Learning Mixtures: The Covariances
Setting derivative wrt Σk to zero, and assuming Σ isinvertible gives:
σk =1
Nk
∑n∈[N]
γ(zn,k )(x n − µk )(x n − µk )T .
which is same as sigle Gaussian solution but with aavergae weighted by the corresponding posteriorprobability.
Dubhashi Machine Learrning
k-meansmix gaussians
Learning Mixtures: Mixing Coefficients
Setting derivative wrt πk to zero, and taking into accountthat
∑k πk = 1 (lagrange multipliers!)
πk =Nk
N.
The mixing coefficient for the k th componet is the averageresponsibilility that the component takes for explaining thedata set.
Dubhashi Machine Learrning
k-meansmix gaussians
Learning Mixtures: EM Algorithm
1 Initialize means, covars and mix coeffs and repeat:2 E Step : Evaluate responsibilities using current parameters:
γ(zn,k ) =πkN (x n | µk ,Σk )∑
j πjN (x n | µj ,Σj)
3 M Step : Re-estimate parameters using currentresponsibilities:
µnewk =
1Nk
∑n
γ(zn,k )x n
Σnewk =
1Nk
∑n
γ(zn,k )(x n − µnewk )(x n − µnew
k )T .
πnewk =
Nk
N,
where Nk :=∑
n γ(zn,k ).
Dubhashi Machine Learrning
k-meansmix gaussians
EM Algorithm: Example
(a)−2 0 2
−2
0
2
(b)−2 0 2
−2
0
2
Dubhashi Machine Learrning
k-meansmix gaussians
EM Algorithm: Example
(c)
�����
−2 0 2
−2
0
2
(d)
�����
−2 0 2
−2
0
2
Dubhashi Machine Learrning
k-meansmix gaussians
EM Algorithm: Example
(e)
�����
−2 0 2
−2
0
2
(f)
�������
−2 0 2
−2
0
2
Dubhashi Machine Learrning
k-meansmix gaussians
EM vs K-Means
K-means performs a hard assignment of data points toclusters i.e. each data point is assigned to a unique cluster.
EM algorithm makes a soft assignment based on posteriorprobabilities.
K-means can be derived as the limit of the EM algorithmassigned to a particular instance of Gaussian mixtures.
Dubhashi Machine Learrning