EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.

30
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim

Transcript of EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.

EE462 MLCV

1

Lecture 3-4Clustering (1hr)Gaussian Mixture and EM (1hr)

Tae-Kyun Kim

EE462 MLCV

2

Data points (green), 2D vectors, are grouped to two homogenous clusters (blue and red).Clustering is achieved by an iterative algorithm (left to right). The cluster centers are marked x.

Vector Clustering

EE462 MLCV

3

``

RGB

Pixel Clustering (Image Quantisation)Image pixels are represented by 3D vectors of R,G,B values.The vectors are grouped to K=10,3,2 clusters, and represented by the mean values of the respective clusters.

𝐱∈R3

EE462 MLCV

4

dim

ensi

on D

……

……

or raw pixels

K codewords

Patch ClusteringImage patches are harvested around interest points from a large number of images.They are represented by finite dimensional vectors, and clustered to form a visual dictionary.

SIFT

20

20

D=400

Lecture 9-10 (BoW)

EE462 MLCV

5

Image ClusteringWhole images are represented as finite dimensional vectors.Homogenous vectors are grouped together in Euclidean space.

Lecture 9-10 (BoW)

……

EE462 MLCV

6

K-means vs GMM

Hard clustering: a data point is assigned a cluster.

Soft clustering: a data point is explained by a mix of multiple Gaussians probabilistically.

Two standard methods are k-means and Gaussian Mixture Model (GMM).K-means assigns data points to the nearest clusters, while GMM represents data by multiple Gaussian densities.

EE462 MLCV

7

Matrix and Vector DerivativesMatrix and vector derivatives are obtained first by element-wise derivatives and then reforming them into matrices and vectors.

EE462 MLCV

8

Matrix and Vector Derivatives

EE462 MLCV

9

K-means Clustering

Given a data set {x1,…, xN} of N observations in a D-dimensional space, our goal is to partition the data set into K clusters or groups.

The vectors μk, where k = 1,...,K, represent k-th cluster, e.g. the centers of the clusters.

Binary indicator variables are defined for each data point xn, rnk ∈ {0, 1}, where k = 1,...,K.

1-of-K coding scheme: xn is assigned to cluster k then rnk = 1, and rnj = 0 for j ≠ k.

EE462 MLCV

10

The objective function that measures distortion is

We ought to find {rnk} and {μk} that minimise J.

EE462 MLCV

11

till converge

• Iterative solution:

Step 1: We minimise J with respect to rnk, keeping μk fixed. J is a linear function of rnk, we have a closed form solution

Step 2: We minimise J with respect to μk keeping rnk fixed. J is a quadratic of μk. We set its derivative with respect to μk to zero,

First we choose some initial values for μk.

EE462 MLCV

12

K=2

μ 1

μ2

rnk

EE462 MLCV

13

It provides convergence proof.Local minimum: its result depends on initial values of μk .

EE462 MLCV

14

• Generalisation of K-means using a more generic dissimilarity measure V (xn, μk). The objective function to minimise is

Circles in the same size

V = ( x n - u k ) T Σk

-1 ( x n - u k )

Generalisation of K-means

• Cluster shapes by different Σk:

, where Σk denotes the covariance matrix.

Σk: = I

¿ [ 𝜎 𝑥2 𝜎 𝑥𝑦

𝜎 𝑦𝑥 𝜎 𝑦2 ]

EE462 MLCV

15

Generalisation of K-means

Σk: a diagonal matrix

Σk: an isotropic matrixDifferent sized circles

Ellipses

Σk: a full matrix Rotated

ellipses

EE462 MLCV

16

Statistical Pattern Recognition Toolbox for Matlab

http://cmp.felk.cvut.cz/cmp/software/stprtool/

…\stprtool\probab\cmeans.m

EE462 MLCV

17

Mixture of GaussiansDenote z as 1-of-K representation: zk ∈ {0, 1} and Σk zk = 1.

We define the joint distribution p(x, z) by a marginal distribution p(z) and a conditional distribution p(x|z).

Lecture 11-12 (Prob. Graphical models)

Hidden variable

Observable variable: data

EE462 MLCV

18

The marginal distribution over z is written by the mixing coefficients πk

where

The marginal distribution is in the form of

Similarly,

EE462 MLCV

19

The marginal distribution of x is

, which is as a linear superposition of Gaussians.

EE462 MLCV

20

The conditional probability p(zk = 1|x) denoted by γ(zk ) is obtained by Bayes' theorem,

We view πk as the prior probability of zk = 1, and γ(zk ) as the posterior probability.

γ(zk ) is the responsibility that k-component takes for explaining the observation x.

EE462 MLCV

21

Maximum Likelihood Estimation

s.t.

Given a data set of X = {x1,…, xN}, the log of the likelihood function is

EE462 MLCV

22

Setting the derivatives of ln p(X|π, μ, Σ) with respect to μk to zero, we obtain

EE462 MLCV

23

objective ftn. f(x)constraints g(x)

max f(x) s.t. g(x)=0 max f(x) + g(x)

Refer to Optimisation course or http://en.wikipedia.org/wiki/Lagrange_multiplier

Setting the derivatives of ln p(X|π, μ, Σ) with respect to k to zero, we obtain

Finally, we maximise ln p(X|π, μ, Σ) with respect to the mixing coefficients πk. We use a Largrange multiplier

EE462 MLCV

24

which gives

we find λ = -N and

EE462 MLCV

25

EM (Expectation Maximisation) for Gaussian Mixtures

1. Initialise the means μk , covariances Σk and mixing coefficients πk.

2. Ε step: Evaluate the responsibilities using the current parameter values

3. M step: RE-estimate the parameters using the current responsibilities

EE462 MLCV

26

4. Evaluate the log likelihood

and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to step 2.

EM (Expectation Maximisation) for Gaussian Mixtures

EE462 MLCV

27

EE462 MLCV

28

Statistical Pattern Recognition Toolbox for Matlab

http://cmp.felk.cvut.cz/cmp/software/stprtool/

…\stprtool\visual\pgmm.m

…\stprtool\demos\demo_emgmm.m

EE462 MLCV

29

Information Theory

The amount of information can be viewed as the degree of surprise on the value of x.

If we have two events x and y that are unrelated, h(x,y) = h(x) + h(y). As p(x,y) = p(x)p(y), thus h(x) takes the logarithm of p(x) as

where the minus sign ensures that information is positive or zero.

Lecture 7 (Random forest)

0 1

EE462 MLCV

30

The average amount of information (called entropy) is given by

The differential entropy for a multivariate continuous variable x is