Advanced Topics in Learning and Vision

Advanced Topics in Learning and Vision

Ming-Hsuan [email protected]

Lecture 4 (draft)

Overview

• EM Algorithm

• Mixture of Factor Analyzers

• Mixture of Probabilistic Component Analyzers

• Isometric Mapping

• Locally Linear Embedding

• Linear regression

• Logistic regression

• Linear classifier

• Fisher linear discriminant

Lecture 4 (draft) 1

Announcements

• More course material available on the course web page

• Code: PCA, FA, MoG, MFA, MPPCA, LLE, and Isomap

• Reading (due Oct 25):

- Fisher linear discriminant: Fisherface vs. Eigenface [1]- Support vector machine: [3] or [2]

References

[1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces:Recognition using class specific linear projection. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19(7):711–720, 1997.

[2] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in imagesby components. IEEE Transactions on Pattern Analysis and Machine Intelligence,23(4):349–361, 2001.

[3] M. Pontil and A. Verri. Support vector machines for 3D object recognition. IEEETransactions on Pattern Analysis and Machine Intelligence, 20(6):637–646, 1998.

Lecture 4 (draft) 2

Mixture of Gaussians

p(x) =∑K

k=1 πkN(x|µk,Σk)∑Kk=1 πk = 1 0 ≤ πk ≤ 1

(1)

where πk is the mixing parameter, describing the contribution of k-th Gaussiancomponent in explaining x.

• Given data X = {x1, . . . , xN}, we want to determine πk and modelparameters θ = {πk, µk,Σk}.

- X are observable- The contribution of each data point xi to j-th Gaussian component,

γj(xi), is hidden variables that can be derived from X and θ.- θ are unknown

• If we know θ, we can compute γj(xi).

• If we know γj(xi), we can compute θ.

• Chicken and egg problem.

Lecture 4 (draft) 3

EM algorithm

• Expectation Maximization

• First take some initial guess of model parameters and compute theexpectation of hidden value

• Iterative procedure

• Start with some initial guess and refine it

• Very useful technique

• Variational learning

• Generalized EM algorithm

Lecture 4 (draft) 4

EM Algorithm for Mixture of Gaussians

• Log likelihood function

ln p(X|π, µ,Σ) =N∑

i=1

ln{K∑

k=1

πkN(xi|µk,Σk)} (2)

• No close form solution (sum of components inside the log function)

• E (Expectation) step: Given all the current model parameters, compute theexpectation of the hidden variables

• M (Maximization) step: Optimize the log likelihood with respect to modelparameters

Lecture 4 (draft) 5

EM Algorithm for Mixture of Gaussians: M Step

lnL =N∑

i=1

(K∑

k=1

πkNki) (3)

where

Nki = N(xi|µk,Σk) (4)

Take derivative of L w.r.t. µj

∂ lnL∂µj

=N∑

i=1

πjNji∑Kk=1 πkNki

1Nji

∂Nji

∂µj= 0 (5)

Note1

Nji

∂Nji

∂µj= −

∑j

(xi − µj) (6)

Let γj(xi) = πjNjiPK

k=1 πkNki, i.e., the normalized probablity of xi being generated

Lecture 4 (draft) 6

from the j-th Gaussian component

N∑i=1

γj(xi)Σj(xi − µj) = 0 (7)

Thus,

µj =∑N

i=1 γj(xi)xi∑Ni=1 γj(xi)

(8)

Likewise take partial derivative w.r.t. to πj and Σj

πj =1N

N∑i=1

γj(xi) (9)

Σj =∑N

i=1 γj(xi)(xi − µj)(xi − µj)T∑Ni=1 γj(xi)

(10)

Note γj(xi) plays the weighting role

Lecture 4 (draft) 7

EM Algorithm for Mixture of Gaussians: E Step

• Compute the expected value of hidden variable γj

γj(xi) =πjN(xi|µj,Σj)∑K

k=1 πkN(xi|µk,Σk)(11)

• Interpret the mixing coefficients as prior probabilities

p(xi) =K∑

j=1

πjN(xi|µj,Σj) =K∑

j=1

p(j)p(xi|j) (12)

• Thus, γj(xi) corresponds to posterior probabilities (responsibilities)

p(j|xi) =p(j)p(xi|j)

p(xi)=

πjN(xi|µj,Σj)∑Kk=1 πkN(xi|µk,Σk)

= γj(xi) (13)

Lecture 4 (draft) 8

EM for Factor Analysis

• Factor analysis: x = Λz + ε

• Log likelihood: L = log∏

i(2π)d/2|Ψ|−1/2 exp{−12(xi − Λz)TΨ−1(xi − Λz)

• Hidden variable: z, model parameters: Λ, Ψ.

• E-step:E[z|x] = βxE[zzT |x] = V ar(z|x) + E[z|x]E[z|x]T

= I − βΛ + βxxTβT(14)

• M-step:

Λnew = (∑N

i=1 xiE[z|xi]T )(∑N

i=1 E[zzT |xi])−1

Ψnew = 1N diag{

∑Ni=1 xix

Ti − ΛnewE[z|xi]xT

i }(15)

where diag operator sets all off-diagonal elements to zero.

Lecture 4 (draft) 9

Mixture of Factor Analyzers (MFA)

• Assume that we have K factor analyzers indexed by ωk, k = 1, . . . ,K.ωk = 1 when the data point was generated by k-th factor analyzer.

• The generative mixture model:

p(x) =K∑

k=1

∫p(x|z, ωk)p(z|ωk)p(ωk)dz (16)

wherep(z|ωk) = p(z) = N(0, I) (17)

• All each mode factor analyzer to model data covariance structure in adifferent part of the input space

p(x|z, ωk) = N(µk + Λkz,Ψ) (18)

Lecture 4 (draft) 10

EM for Mixture of Factor Analyzers

• For the E step, we need to compute the expectations of all hidden variables

E[ωk|xi] ∝ p(xi, ωk) = p(ωk)p(xi|ωk) = πkN(x− µk,ΛkΛTk |Ψ)

E[ωkz|xi] = E[ωk|xi]E[z|ωk, xi]E[ωkzzT |xi] = E[ωk|xi]E[zzT |ωk, xi]

(19)

• The model parameters are {(µk,Λk, πk)Kk=1,Ψ}.

• For the M step, take derivative of log likelihood with respect to modelparameters for new µk, Λk, πk, and Ψ.

• Read “The EM Algorithm for Mixtures of Factor Analyzers,” by Ghahramaniand Hinton for details.

• Also read Ghahramani’s lecture notes.


EM for Mixture of Probabilistic PCA

• Based on factor analyzers

• Read “Mixtures of Probabilistic Principal Component Analyzers,” by Tippingand Bishop.


MFA: Applications

• Modeling the manifolds of images of hand written digits with mixture offactor analyzers [Hinton et al. 97].

• Modeling multimodal density of faces for recognition and detection [Frey etal. 98] [Yang et al. 00].

• Analyze layers of appearance and motion [Frey and Jojic 99]

• Mixture of factor analyzers concurrently performs clustering anddimensionality reduction.

• Able to model the nonlinear manifold well.


Nonlinear Principal Component Analysis (NLPCA)

• Aim to better model nonlinear manifold

• Use on multi-layer (5 layer) perception

• The layer in the middle represents the feature space of the NLPCAtransform.

• Two additional layers are used for nonlinearity.

• Auto-encoder, auto-associator, bottleneck network.


Recap

• Linear dimensionality reduction:

- Assume data is generated from a subspace- Determine the subspace with PCA or FA (i.e., the subspace is spanned

by the principal components)

• Nonlinear dimensionality reduction:

- Model data with a mixture of locally linear subspaces- use mixture of PCA, mixture of FA

• Mixture methods have local coordinate systems

• Need to find transformation between coordinate systems


Isometric Mapping (Isomap) [Tenenbaum et al. 00]

• Preserving pairwise distance structure

• Approximate geodesic distance

• Nonlinear dimensionality reduction

• Use a global coordinate system

• Aim to find intrinsic dimensionality


Multidimensional Scaling (MDS)

• Analyze pairwise similarities of entities to gain insight in the underlyingstructure

• Based on a matrix of pairwise similarities

• Metric or non-metric

• Useful for data visualization

• Can be used for dimensionality reduction

• Preserve the pairwise similarity measure


Isomap: Algorithm

• Isomap:

- Construct neighborhood graph:Define a graph G of all data points by connecting points i and j if theyare neighbors

- Compute the short paths:For any pair of points i and j, compute their shortest path, and obtainDG.

- Construct M -dimensional embedding:Apply classic MDS to DG to construct M -dimensional Euclidean spaceY . The coordinates yi are obtained by minimizing

E = ||τ(DG)− τ(DY )||L2 (20)

where τ converts distances into inner products that uniquely characterizethe geometry of the data.

• The global minimum of (20) is obtained by setting yi to the top Meigenvectors of τ(DG).


Isomap: Applications

Intrinsic low dimensionalembedding

Interpolation using lowdimensional embedding



• Object recognition: memory-based recognition

• Object tracking: trajectory along inferred nonlinear manifold

• Video synthesis: interpolate along trajectory on nonlinear manifold

“Representation analysis and synthesis of lip images using dimensionality reduction,”Aharon and Kimmel, IJCV 2005.



• States of a moving object move smoothly along a low dimensional manifold.

• Discover the underlying manifold using Isomap

• Learn the mapping between input data and the corresponding points thelow dimensional manifold using mixture of factor analyzers.

• Learn a dynamical model based on the points on the low dimensionalmanifold.

• Use particle filter for tracking.

“Learning object intrinsic structure for robust visual tracking,” Wang et al., CVPR 2003.


Locally Linear Embedding (LLE) [Roweis et al. 00]

• Capture local geometry by linear reconstruction

• Map high dimensional data to global internal coordinates


Locally Linear Embedding: Algorithm

• LLE

- For each point, determine its neighbors- Reconstruct with linear weights using neighbors

E(W ) =∑

i

|xi −∑

j

wijxj|2 (21)

to find out w- Map to embedded coordinates:

Fix w and project x ∈ Rd to y ∈ RM (M < d) such that

φ(y) =∑

i

|yi −∑

j

wijyj|2 (22)

• The embedding cost (22) defines a unconstrained optimization problem.Add a normalized constraint and solve an eigenvalue problem.

• The bottom M nonzero eigenvectors provide an ordered set of orthogonalcoordinates.


LLE: Applications

Learn the embedding of facial expressionimages for synthesis

Learn the embedding of lip images forsynthesis


Isomap and LLE

• Embedding rather than mapping function

• No probabilistic interpretation

• No generative model

• Do not take temporal information into consideration

• Unsupervised learning


Further Study

• Kernel PCA.

• Principal Curve.

• Laplacian Eigenmap.

• Hessian Isomap.

• Spectral clustering.

• Unified view of spectral embedding and clustering.

• Global coordination of local generative models:

- Global coordination of local linear representation.- Automatic alignment of local representation.


Big Picture

“A unifying review of linear Gaussian models,” Roweis and Ghahramani 99

• Deterministic/probabilistic

• Static/dynamic

• Linear/nonlinear

• Mixture, hierarchical


Advanced Topics in Learning and Vision

Documents

Transcript of Advanced Topics in Learning and Vision